2013年6月17日 星期一

QuickData DMA performance

Recently we have been working on porting the Marlin and Ladon architecture from PLX's platform to a more vendor-neutral platform, using Intel's CPU with its DMA engine and on-board NTB support. The latest Linux kernel (3.9) has included the NTB driver, and also an Ethernet driver based on NTB.

However, it seems that the Ethernet driver does not leverage the DMA channel to do data movement between back-to-back connected hosts. To understand the Intel DMA's performance, I've tested the Intel's DMA engine, aka Quickdata technology.


System Description:
I'm testing using Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, with 8GB memory. For both experiment, the PCIe max payload size is set to 128 byte.

Two experiments are shown:
1) send total amount of payload 4MB (non-continuous), with different transfer size per DMA entry. Ex: 4KB transfer size results in 1024 entries to make a 4MB. (X-axis is K-byte, not byte)
2) With a particular transfer size per entry, batch number of entries per DMA transaction. Ex: 4KB transfer size per entry, batching 1024 entries in one DMA transaction.


Short Summary:
I did not setup the NTB so these experiments are local to local memory copy. The best performance is around 54Gbps using > 4KB per entry transfer size to copy 4MB data. Copying 1KB data for 1 entry takes 8 us. Compared to PLX's DMA which has around 20Gbps local to local memory copy bandwidth under 128B MaxPayload size, Intel's DMA seems to outperform a lot.

The throughput increases as the transfer size per entry increase. I think this is due to the maximum number of outstanding read request, usually set to 32. If all PCIe TLPs need to break down into 128 byte maximum payload size, then the 32 outstanding pending requests means 128B * 32 = 4KB, which explains why the throughput saturates at 4KB transfer size per entry.

I did not test the effect of address alignment under Intel's DMA, however, we've found the the address alignment matters a lot to the performance under the PLX's DMA engine.

Source code:
#include <linux/sched.h>
#include <linux/pci.h>
#include <linux/ioport.h>
#include <linux/init.h>
#include <linux/dmi.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/dmaengine.h>
#include <linux/async_tx.h>
#include <linux/kernel.h>
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/dma-mapping.h>
#include <linux/async_tx.h>

#define RAND_BASE 1024
#define GB  (1<<30)
#define MB  (1<<20)
#define KB  (1<<10)


MODULE_LICENSE("GPL");

static void callback(void *param)
{
    if (!param)
        printk("NULL param\n");

    struct dma_async_tx_descriptor *tx =
        (struct dma_async_tx_descriptor *)param;

    return;
}
dma_cookie_t
itri_dma_async_memcpy(struct dma_chan *chan, void *dest,
                        void *src, size_t len)
{
        struct dma_device *dev = chan->device;
        struct dma_async_tx_descriptor *tx;
        dma_addr_t dma_dest, dma_src;
        dma_cookie_t cookie;
        unsigned long flags;

        dma_src = dma_map_single(dev->dev, src, len, DMA_TO_DEVICE);
        dma_dest = dma_map_single(dev->dev, dest, len, DMA_FROM_DEVICE);

        flags = DMA_CTRL_ACK |
                DMA_COMPL_SRC_UNMAP_SINGLE |
                DMA_COMPL_DEST_UNMAP_SINGLE;

        tx = dev->device_prep_dma_memcpy(chan, dma_dest, dma_src, len, flags);

        if (!tx) {
                dma_unmap_single(dev->dev, dma_src, len, DMA_TO_DEVICE);
                dma_unmap_single(dev->dev, dma_dest, len, DMA_FROM_DEVICE);
                return -ENOMEM;
        }

        tx->callback = NULL;
        //tx->callback = callback;
        //tx->callback_param = tx; //william
        cookie = tx->tx_submit(tx);

        preempt_disable();
        __this_cpu_add(chan->local->bytes_transferred, len);
        __this_cpu_inc(chan->local->memcpy_count);
        preempt_enable();

    return cookie;
}

int dma_throughput_test(int tr_size, unsigned long total_size)
{
    int i, num_entry;
    void *src, *dst, *ran_src, *ran_dst;
    unsigned long offset, total_usec, ts;
    static struct timeval  time_e, time_s;
    struct dma_chan *chan;
    dma_cookie_t cookie;

    dmaengine_get();
    chan = dma_find_channel(DMA_MEMCPY);
    if (!chan)
        return -ENODEV;

    num_entry = total_size / tr_size;
    if (num_entry <= 0)
        return -ENOMEM;

    src = kmalloc(tr_size * RAND_BASE, GFP_KERNEL);
    if (!src)
        return -ENOMEM;
    dst = kmalloc(tr_size * RAND_BASE, GFP_KERNEL);
    if (!dst) {
        kfree(src);
        return -ENOMEM;
    }

    memset(dst, 0xff, tr_size * RAND_BASE);
    do_gettimeofday(&time_s);

    for (i = 0; i < num_entry; i++) {
        offset = (i % RAND_BASE) * tr_size;

        offset = (ts % RAND_BASE) * tr_size;
        if (offset >=  RAND_BASE * tr_size) {
            pr_err("Invalid addr\n");
            return -EINVAL;
        }
        ran_src = src + offset;
        ran_dst = dst + offset;

        cookie = itri_dma_async_memcpy(chan, ran_dst, ran_src, tr_size);
        if (cookie < 0) {
            kfree(src);
            kfree(dst);
            printk("Invalid cookie\n");
            return -EINVAL;
        }
       // dma_sync_wait(chan, cookie);
    }

    dma_issue_pending_all();
    dma_sync_wait(chan, cookie);

    do_gettimeofday(&time_e);

    total_usec = (time_e.tv_sec - time_s.tv_sec) * 1000 * 1000
            + (time_e.tv_usec - time_s.tv_usec);

    printk(KERN_EMERG "Local2Local DMA %6d entry, per entry size %3dKB, "
            "total size: %6ldKB, take %6ld us, bandwidth %6ldMbps\n",
        num_entry, tr_size / (1<<10), total_size / (1<<10),
        total_usec, total_size * 8 / (total_usec)
    );

    kfree(src);
    kfree(dst);

    dma_issue_pending_all(); //flush all pending descriptor
    dmaengine_put();

    return 0;
}


void test_fixed_total_entry(void)
{
    // assume run time should be the same
    dma_throughput_test(1 * KB, 10 * MB);
    dma_throughput_test(2 * KB, 20 * MB);
    dma_throughput_test(4 * KB, 40 * MB);
    dma_throughput_test(8 * KB, 80 * MB);
    dma_throughput_test(16 * KB, 160 * MB);
    return;
}

void test_throughput(void)
{
    dma_throughput_test(128 , 4UL * GB);
    dma_throughput_test(256 , 4UL * GB);
    dma_throughput_test(512 , 4UL * GB);
    dma_throughput_test(1024 , 4UL * GB);
    return;
}

static int itri_dma_init(void)
{

    test_fixed_total_entry();
    test_throughput();
    return 0;
}

static void itri_dma_exit(void)
{
    printk(KERN_INFO "unloading itri dma\n");
    return;
}

module_init(itri_dma_init);
module_exit(itri_dma_exit);


System supporting NTB (Non-Transparent Bridge)

I'm recently purchasing a set of machines to do my NTB experiments on Intel's machine. So I did a few search on Intel's spec which has this cool feature. It seems that currently no other vendor except Intel has NTB-capable motherboard. And my local vendor even made a mistake by giving me a machine with NTB-capable CPU but not NTB-capable server board (R1000GZ/GL). Turn out I have to carefully check the spec and told them what I need in detail.

NTB-capable CPU: E5-2620


Vol1 and Vol2:
http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-1600-2600-vol-1-datasheet.html
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-1600-2600-vol-2-datasheet.pdf

NTB: in chapter 3.3
QuickData: DMA channel support in chapter 3.4


NTB-capable Server Board: S1600JP4

http://download.intel.com/support/motherboards/server/sb-s1600jp/sb/g68018004_s1600jp_tps_r1_4.pdf


Intel® Server Board S1600JP4 BIOS

Root Complex Peer-to-Peer support

In 3.4.1, S1600JP4
The Intel® C600 PCH provides up to eight PCI Express* Root Ports, supporting the PCI
http://download.intel.com/support/motherboards/server/sb-s1600jp/sb/g68018004_s1600jp_tps_r1_4.pdf

Processor Integrated I/O (IIO) configuration, peer-to-peer configurations are mentioned.
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-1600-2600-vol-2-datasheet.pdf

Intel C600PCH
http://www.intel.com/content/dam/www/public/us/en/documents/design-guides/c600-series-chipset-thermal-guide.pdf

Other chipset with P2P root complex: Intel MCH 5100, 5400, 7520都有peer-to-peer RC
http://www.intel.com/content/dam/doc/datasheet/5400-chipset-memory-controller-hub-datasheet.pdf
However, I'm not 100% sure this supports peer-to-peer root complex.


References

Intel's forum
http://communities.intel.com/message/194756#194756

Server Board with NTB support
S1600JP/S2600JF/S2600CO/S2600WP

Server board without NTB support
S2600CP, S2600GZ/S2600GL, R1000GZ/GL server
If your product doesn't have JP, JF, CO, or WP in it, then it does not support NTB.

Jon Mason’s blog
https://github.com/jonmason/ntb/wiki





2013年6月5日 星期三

Trap/Redirect a function call in Linux kernel module

Trapping a kernel module function call

今天測試了一下如何再kernel module中將一個function替換成另一個function, 主要參考
http://www.phrack.org/issues.html?issue=68&id=11
他所提供的tool elfchger.c是給32bit ELF header, 所以要改成64bit ELF.

Step 1: orig module: evil沒有被call

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/errno.h>

MODULE_LICENSE("GPL");

int fun(void) {
    printk(KERN_ALERT "calling fun!");
    return 0;
}

int evil(void) {
    printk(KERN_ALERT "===== EVIL ====\n");
    return 0;
}

int init(void) {
    printk(KERN_ALERT "Init Original!");
    fun();
    return 0;
}

void clean(void) {
    printk(KERN_ALERT "Exit Original!");
    return;
}
module_init(init);
module_exit(clean);


Step2: 找到相對映的address

[root@new_host1 kprobe]# objdump -t orig.ko

orig.ko:     file format elf64-x86-64

SYMBOL TABLE:
000000000000001b g     F .text    000000000000001b evil
0000000000000000 g     O .gnu.linkonce.this_module    0000000000000280 __this_module
0000000000000056 g     F .text    0000000000000019 cleanup_module
0000000000000036 g     F .text    0000000000000020 init_module
0000000000000056 g     F .text    0000000000000019 clean
0000000000000000 g     F .text    000000000000001b fun
0000000000000000         *UND*    0000000000000000 mcount
0000000000000000         *UND*    0000000000000000 printk
0000000000000036 g     F .text    0000000000000020 init

Step 3: 利用patched過後的elfchger.c去修改fun, 將fun寫成evil的address (0x1b)

[root@new_host1 kprobe]# ./elfchger -s fun -v 1b orig.ko
[+] Opening orig.ko file...
[+] Reading Elf header...
    >> Done!
[+] Finding ".symtab" section...
    >> Found at 0xc618
[+] Finding ".strtab" section...
    >> Found at 0xc658
[+] Getting symbol' infos:
    >> Symbol found at 0x159c8
    >> Index in symbol table: 0x1c
[+] Replacing 0x00000000 with 0x0000001b... done!


Step 4: 再check一次symbol  table, 此時fun已經被改成1b

000000000000001b g     F .text    000000000000001b evil
0000000000000000 g     O .gnu.linkonce.this_module    0000000000000280 __this_module
0000000000000056 g     F .text    0000000000000019 cleanup_module
0000000000000036 g     F .text    0000000000000020 init_module
0000000000000056 g     F .text    0000000000000019 clean
000000000000001b g     F .text    000000000000001b fun
0000000000000000         *UND*    0000000000000000 mcount
0000000000000000         *UND*    0000000000000000 printk
0000000000000036 g     F .text    0000000000000020 init


Step 5: 驗證

> insmod orig.ko
> dmesg
[  692.169913] Init Original!
[  692.169975] ===== EVIL ====


==== patch elfchger.c for 64bit system ====

/*
 * elfchger.c by styx^ <the.styx@gmail.com> (based on truff's code)
 * June 6, 2013, Cheng-Chun Tu <u9012063@gmail.com>
 * Patch for 64 bit system
 * Script with two features:
 *
 * Usage 1: Change the symbol name value (address) in a kernel module.
 * Usage 2: Change the symbol binding (from local to global) in a kernel
 *          module.
 *
 * Usage:
 * 1: ./elfchger -f [symbol] -v [value] <module_name>
 * 2: ./elfchger -g [symbol] <module_name>
 */

#include <stdlib.h>
#include <stdio.h>
#include <elf.h>
#include <string.h>
#include <getopt.h>

int ElfGetSectionByName (FILE *fd, Elf64_Ehdr *ehdr, char *section,
                         Elf64_Shdr *shdr);

int ElfGetSectionName (FILE *fd, Elf64_Word sh_name,
                       Elf64_Shdr *shstrtable, char *res, size_t len);

Elf64_Off ElfGetSymbolByName (FILE *fd, Elf64_Shdr *symtab,
                       Elf64_Shdr *strtab, char *name, Elf64_Sym *sym);

void ElfGetSymbolName (FILE *fd, Elf64_Word sym_name,
                       Elf64_Shdr *strtable, char *res, size_t len);

unsigned long ReorderSymbols (FILE *fd, Elf64_Shdr *symtab,
                       Elf64_Shdr *strtab, char *name);

int ReoderRelocation(FILE *fd, Elf64_Shdr *symtab,
                       Elf64_Shdr *strtab, char *name, Elf64_Sym *sym);

int ElfGetSectionByIndex (FILE *fd, Elf64_Ehdr *ehdr, Elf64_Half index,
                          Elf64_Shdr *shdr);

void usage(char *cmd);

int main (int argc, char **argv) {

  FILE *fd;
  Elf64_Ehdr hdr;
  Elf64_Shdr symtab, strtab;
  Elf64_Sym sym;
  Elf64_Off symoffset;
  Elf64_Addr value;

  unsigned long new_index = 0;
  int gflag = 0, vflag = 0, fflag = 0;
  char *sym_name;
  int sym_value = 0;

  long sym_off, str_off;
  int opt;

  if ( argc != 4 && argc != 6 ) {
    usage(argv[0]);
    exit(-1);
  }

  while ((opt = getopt(argc, argv, "vsg")) != -1) {

    switch (opt) {

      case 'g':

        if( argc-1 < optind) {
        printf("[-] You must specify symbol name!\n");
        usage(argv[0]);
        exit(-1);
        }

        gflag = 1;
        sym_name = argv[optind];

        break;

      case 's':

        if( argc-1 < optind) {
          printf("[-] You must specify symbol name!\n");
          usage(argv[0]);
          exit(-1);
        }

        fflag = 1;
        sym_name = argv[optind];

        break;

      case 'v':

        if( argc-1 < optind) {
          printf("[-] You must specify new symbol address\n");
          usage(argv[0]);
          exit(-1);
        }

        vflag = 1;
        sym_value = strtol(argv[optind], (char **) NULL, 16);

        break;

      default:
        usage(argv[0]);
        exit(-1);
    }
  }

  printf("[+] Opening %s file...\n", argv[argc-1]);

  fd = fopen (argv[argc-1], "r+");

  if (fd == NULL) {

    printf("[-] File \"%s\" not found!\n", argv[1]);
    exit(-1);
  }

  printf("[+] Reading Elf header...\n");

  if (fread (&hdr, sizeof (Elf64_Ehdr), 1, fd) < 1) {

    printf("[-] Elf header corrupted!\n");
    exit(-1);
  }

  printf("\t>> Done!\n");

  printf("[+] Finding \".symtab\" section...\n");

  sym_off = ElfGetSectionByName (fd, &hdr, ".symtab", &symtab);

  if (sym_off == -1) {

    printf("[-] Can't get .symtab section\n");
    exit(-1);
  }

  printf("\t>> Found at 0x%x\n", (int )sym_off);
  printf("[+] Finding \".strtab\" section...\n");

  str_off = ElfGetSectionByName (fd, &hdr, ".strtab", &strtab);

  if (str_off  == -1) {

    printf("[-] Can't get .strtab section!\n");
    exit(-1);
  }

  printf("\t>> Found at 0x%x\n", (int )str_off);

  printf("[+] Getting symbol' infos:\n");

  symoffset = ElfGetSymbolByName (fd, &symtab, &strtab, sym_name, &sym);

  if ( (int) symoffset == -1) {

    printf("[-] Symbol \"%s\" not found!\n", sym_name);
    exit(-1);
  }

  if ( gflag == 1 ) {

    if ( ELF64_ST_BIND(sym.st_info) == STB_LOCAL ) {

      unsigned char global;
      unsigned long offset = 0;

      printf("[+] Reordering symbols:\n");

      new_index = ReorderSymbols(fd, &symtab, &strtab, sym_name);

      printf("[+] Updating symbol' infos:\n");

      symoffset = ElfGetSymbolByName(fd, &symtab, &strtab, sym_name, &sym);

      if ( (int) symoffset == -1) {

        printf("[-] Symbol \"%s\" not found!\n", sym_name);
        exit(-1);
      }

      offset = symoffset+1+sizeof(Elf64_Addr)+1+sizeof(Elf64_Word)+2;

      printf("\t>> Replacing flag 'LOCAL' located at 0x%x with 'GLOBAL'\
              \n", (unsigned int)offset);

      if (fseek (fd, offset, SEEK_SET) == -1) {

        perror("[-] fseek: ");
        exit(-1);
      }

      global = ELF64_ST_INFO(STB_GLOBAL, STT_FUNC);

      if (fwrite (&global, sizeof(unsigned char), 1, fd) < 1) {

        perror("[-] fwrite: ");
        exit(-1);
      }

      printf("[+] Updating symtab infos at 0x%x\n", (int )sym_off);

      if ( fseek(fd, sym_off, SEEK_SET) == -1 ) {

        perror("[-] fseek: ");
        exit(-1);
      }

      symtab.sh_info = new_index; // updating sh_info with the new index
                                  // in symbol table.

      if( fwrite(&symtab, sizeof(Elf64_Shdr), 1, fd) < 1 )  {

        perror("[-] fwrite: ");
        exit(-1);
      }

    } else {

      printf("[-] Already global function!\n");
    }

  } else if ( fflag == 1 && vflag == 1 ) {

      memset(&value, 0, sizeof(Elf64_Addr));
      memcpy(&value, &sym_value, sizeof(Elf64_Addr));

      printf("[+] Replacing 0x%.8x with 0x%.8x... ", sym.st_value, value);

      //if (fseek (fd, symoffset+sizeof(Elf64_Word), SEEK_SET) == -1) {
        //william
    if (fseek (fd, symoffset + sizeof(Elf64_Word) + 2 * sizeof(unsigned char) + sizeof(Elf64_Half), SEEK_SET) == -1) {

        perror("[-] fseek: ");
        exit(-1);
      }

      if (fwrite (&value, sizeof(Elf64_Addr), 1, fd) < 1 )  {

        perror("[-] fwrite: ");
        exit(-1);
      }

      printf("done!\n");

      fclose (fd);
}

return 0;
}

/* This function returns the offset relative to the symbol name "name" */

Elf64_Off ElfGetSymbolByName(FILE *fd, Elf64_Shdr *symtab,
        Elf64_Shdr *strtab, char *name, Elf64_Sym *sym) {

  unsigned int i;
  char symname[255];

  for ( i = 0; i < (symtab->sh_size/symtab->sh_entsize); i++) {

    if (fseek (fd, symtab->sh_offset + (i * symtab->sh_entsize),
               SEEK_SET) == -1) {

      perror("\t[-] fseek: ");
      exit(-1);
    }

    if (fread (sym, sizeof (Elf64_Sym), 1, fd) < 1) {

      perror("\t[-] read: ");
      exit(-1);
    }

    memset (symname, 0, sizeof (symname));

    ElfGetSymbolName (fd, sym->st_name, strtab, symname, sizeof (symname));

    if (!strcmp (symname, name)) {

      printf("\t>> Symbol found at 0x%x\n",
                    symtab->sh_offset + (i * symtab->sh_entsize));

      printf("\t>> Index in symbol table: 0x%x\n", i);

      return symtab->sh_offset + (i * symtab->sh_entsize);
    }
  }

  return -1;
}

/* This function returns the new index of symbol "name" inside the symbol
 * table after re-ordering. */

unsigned long ReorderSymbols (FILE *fd, Elf64_Shdr *symtab,
              Elf64_Shdr *strtab, char *name) {

  unsigned int i = 0, j = 0;
  char symname[255];
  Elf64_Sym *all;
  Elf64_Sym temp;
  unsigned long new_index = 0;
  unsigned long my_off = 0;

  printf("\t>> Starting:\n");

  all = (Elf64_Sym *) malloc(sizeof(Elf64_Sym) *
                      (symtab->sh_size/symtab->sh_entsize));

  if ( all == NULL ) {

    return -1;

  }

  memset(all, 0, symtab->sh_size/symtab->sh_entsize);

  my_off = symtab->sh_offset;

  for ( i = 0; i < (symtab->sh_size/symtab->sh_entsize); i++) {

    if (fseek (fd, symtab->sh_offset + (i * symtab->sh_entsize),
                             SEEK_SET) == -1) {

      perror("\t[-] fseek: ");
      exit(-1);
    }

    if (fread (&all[i], sizeof (Elf64_Sym), 1, fd) < 1) {

      printf("\t[-] fread: ");
      exit(-1);
    }

    memset (symname, 0, sizeof (symname));

    ElfGetSymbolName(fd, all[i].st_name, strtab, symname, sizeof(symname));

    if (!strcmp (symname, name)) {

      j = i;

      continue;
    }
  }

  temp = all[j];

  for ( i = j; i < (symtab->sh_size/symtab->sh_entsize); i++ ) {

    if ( i+1 >= symtab->sh_size/symtab->sh_entsize )
      break;

    if ( ELF64_ST_BIND(all[i+1].st_info) == STB_LOCAL ) {

      printf("\t>> Moving symbol from %x to %x\n", i+1, i);

      all[i] = all[i+1];

    } else {

      new_index = i;

      printf("\t>> Moving our symbol from %d to %x\n", j, i);

      all[i] = temp;
      break;
    }
  }

  printf("\t>> Last LOCAL symbol: 0x%x\n", (unsigned int)new_index);

  if ( fseek (fd, my_off, SEEK_SET) == -1 ) {

      perror("\t[-] fseek: ");
      exit(-1);
  }

  if ( fwrite(all, sizeof( Elf64_Sym), symtab->sh_size/symtab->sh_entsize,
              fd) < (symtab->sh_size/symtab->sh_entsize )) {

      perror("\t[-] fwrite: ");
      exit(-1);
  }

  printf("\t>> Done!\n");

  free(all);

  return new_index;
}


int ElfGetSectionByIndex (FILE *fd, Elf64_Ehdr *ehdr, Elf64_Half index,
                          Elf64_Shdr *shdr) {

  if (fseek (fd, ehdr->e_shoff + (index * ehdr->e_shentsize),
             SEEK_SET) == -1) {

    perror("\t[-] fseek: ");
    exit(-1);
  }

  if (fread (shdr, sizeof (Elf64_Shdr), 1, fd) < 1) {

    printf("\t[-] Sections header corrupted");
    exit(-1);
  }

  return 0;
}


int ElfGetSectionByName (FILE *fd, Elf64_Ehdr *ehdr, char *section,
                         Elf64_Shdr *shdr) {

  int i;
  char name[255];
  Elf64_Shdr shstrtable;

  ElfGetSectionByIndex (fd, ehdr, ehdr->e_shstrndx, &shstrtable);

  memset (name, 0, sizeof (name));

  for ( i = 0; i < ehdr->e_shnum; i++) {

    if (fseek (fd, ehdr->e_shoff + (i * ehdr->e_shentsize),
               SEEK_SET) == -1) {

      perror("\t[-] fseek: ");
      exit(-1);
    }

    if (fread (shdr, sizeof (Elf64_Shdr), 1, fd) < 1) {

      printf("[-] Sections header corrupted");
      exit(-1);
    }

    ElfGetSectionName (fd, shdr->sh_name, &shstrtable,
                       name, sizeof (name));

    if (!strcmp (name, section)) {

      return ehdr->e_shoff + (i * ehdr->e_shentsize);

    }
  }

  return -1;
}

int ElfGetSectionName (FILE *fd, Elf64_Word sh_name,
                       Elf64_Shdr *shstrtable, char *res, size_t len) {

  size_t i = 0;

  if (fseek (fd, shstrtable->sh_offset + sh_name, SEEK_SET) == -1) {

    perror("\t[-] fseek: ");
    exit(-1);
  }

  while ( (i < len-1) || *res != '\0' ) {

    *res = fgetc (fd);
    i++;
    res++;

  }

  return 0;
}


void ElfGetSymbolName (FILE *fd, Elf64_Word sym_name,
                       Elf64_Shdr *strtable, char *res, size_t len)
{
  size_t i = 0;

  if (fseek (fd, strtable->sh_offset + sym_name, SEEK_SET) == -1) {

    perror("\t[-] fseek: ");
    exit(-1);
  }

  while ((i < len-1) || *res != '\0') {

    *res = fgetc (fd);
    i++;
    res++;

  }

  return;
}

void usage(char *cmd) {

  printf("Usage: %s <option(s)> <module_name>\n", cmd);
  printf("Option(s):\n");
  printf(" -g [symbol]\tSymbol we want to change the binding as global\n");
  printf("Or:\n");
  printf(" -s [symbol]\tSymbol we want to change the value (address)\n");
  printf(" -v [value] \tNew value (address) for symbol\n");

  return;
}






2013年3月20日 星期三

Interrupt Virtualization

Interrupt Virtualization

Reference:
- Hardware Assisted Virtualization Intel Virtualization Technology, by Mat as Zabalj auregui


VMX support for handling interrupts



External interrupt virtualization

mm
mm
mm


Example of handling external interrupts:

1. Guest setup: 
VMM必須要先設定當external interrupt發生時 Guest會產生VM exit
(set "external-interrupt exiting" bit in VMCS)

2. CPU對external interrupt的處理
Interrupt會被自動mask, 藉由clearing RFLAGS.IF. 
如果VMM使用acknowledge-on-exit的功能,  
The processor acknowledges the interrupts, retrieves the host vector, and saves the interrupt in the exist-interruption-information field before transitioning control to the VMM. 
再將控制權交給VMM之前, CPU會自動將Host vector取出, 把目前state存入VMCS

3.  VMM處理Interrupt
如前例, 此時若acknowledge-interrupt-on-exit有設定, VMM可以直接使用Host vector去呼叫相對應的interrupt handler. 此時就跟VM無直接關係.
若沒有設定, 則VMM必須要re-enable interrupt (by setting RFLAGS.IF) to allow vectoring of external interrupts through the monitor/host IDT. 此時考慮兩種情況

[a] Host owned I/O devices 
如果這個device是屬於VMM的, 那相對應的ISR會被呼叫, 此過程和一般interrupt service routine一樣. 但當ISR結束之後, VMM會檢查是此次的interrupt需要其他virtual interrupt的產生 (例如VMM接收到packet之後, 需要轉送給VM的虛擬網卡).
這時候對每個"affected virtual device", VMM會injects virtual external interrupt event. 

[b] Direct pass-through I/O devices
如果這個device是屬於VM的, 此時是由VM內部driver的ISR來處理此interrupt.
- Interrupt causes VM exits to the VMM and vectoring through Host IDT to a registered handler (應該是專門給passthrough device的handler)
- VMM此時會map host vector到corresponding guest vector to inject virtual interrupt into the assigned VM. 
- The guest software does EOI write to the virtual interrupt controller. 

如何inject virtual interrupt?


4. 產生Virtual Interrupt
[a] 首先要檢查processor interruptibility state. 
[b] 如果Processor屬於"not interruptible", VMM則使用"interrupt-window exiting"功能, 也就是說當processor變成可interrupt時, 會產生VM exit通知VMM
[c] 檢查virtual interrupt controller的狀態
- 有無使用Local APIC? 或routed through local vector table (LVT)? I/O APIC是否mask virtual interrupt?
[d] Priority: 
因為virtual interrupt是被queue在VMM並且利用VM entry送入, 所以VMM可以設計不同的priority機制.
[e] Update the virtual interrupt controller state
"When the above checks have passed, before generating the virtual interrupt to the guest, the VMM updates the virtual interrupt controller state (Local-APIC, IO-APIC and/or PIC) to reflect assertion of the virtual interrupt."
[f] Inject the virtual interrupt on VM entry
VMM藉由設定VMCS去產生virtual interrupt.
當VM entry時, Processor會執行相對應的guest IDT, 完成interrupt的處理



Intel's Hardware Assisted Virtualization Technology

Intel's Hardware Assisted Virtualization Technology
Reference:
- Hardware Assisted Virtualization Intel Virtualization Technology, by Mat as Zabalj auregui


Background
- Intel processor uses 4 privileged-level (0 - 3), 0 for highest privileged and 3 for least privileged (user level program).
- For an OS to control a CPU, it must run with privilege 0.

- 0/1/3 model: let VMM run on level 0, guest VM kernel on 1, and guest VM user space on level 3. This is called ring deprivileging. However, ring deprivileging causes many challenges (ex: every component such as page table must be aware of the additional level 1. 通常都只瞭解level 0 and 3)
- Intel VT-x is aimed to solve these challenges by allowing guest to run on its intended level (ring 0) and guest software is also constrained "not by privileged level", but by non-root VMX operations.

- Privilege-based protection的缺點 --> overhead較高
IA-32 uses SYSENTER and SYSEXIT to support low latency system calls, however, in guest, execution of sysenter/sysexit will be transitioned to the VMM. The VMM must emulate every guest execution of sysenter/sysexit. --> 因此有了Intel VT-x

- Interrupt Virtualization
IA-32 architecture allows OS to mask/unmask the external interrupt, preventing incoming INT if it is not ready yet. The VMM needs to control these mask and deny guest when a guest is trying to access. Such mechanism could have performance issues since OS is frequently mask and unmask interrupt and complicate the design of VMM.

- Ring compression
VMM must have control of some amount of a guest's virtual address space for control structure. (These include IDT and GDT). Guest accessing IDT or GDT will generate transitions to the VMM, for VMM to do further handling.

有了以上講的這些缺點
下面解釋兩種目前解決方案

Paravirtualization v.s Binary translation
- Source level modification of guest OS such as Xen. However, not support MS windows system.
- Making modifications directly to guest-OS binaries, such as VMare and Virtual PC. Support broader range of OSes but higher overhead.
* VT-x的設計就是為了不要在使用binary translation, 並且讓VMM支援更多的作業系統


Virtual Machine eXtension (VMX)
VMM runs on VMX root and guest OS runs on VMX non-root. Transitions to VMX non-root are called VM entry while transitions to VMX root is called VM exit.

- VMX non-root: although it's on ring 0, VMX operation places restrictions so that guest software is under some control by VMM, which runs at VMX root level.


- VMM executes VMXON to enter VMX root mode.
- VMM put the guest software into VM by VM entries.  (or VMLAUNCH / VMRESUME). The VMM regains control when VM exit.
- When VM Exit, the VMM is able to take appropriate actions by reading the cause of VM exit from VMCS.


Virtual Machine Control Structure (VMCS)
每個logical CPU都有相對應的VMCS區域, VMCS是Host和VM之間用來溝通的橋梁
當VM exit時, Host可藉由VMCS來知道exit的原因
而當要VM entry or VMRESUME時, Host也可藉由VMCS來傳入event, 例如interrupt和exception

- Each logical process is associated with a VMCS region in its memory. Software makes a VMCS active by executing VMPTRLD.
- The format of a VMCS region includes header (identifier and abort indicator) and VMCS data.
- The VMCS data includes:
1. Guest-state area,
2. Host-state area,
3. VM-execution control fields
4. VM-exit control fields

x86 instruction

x86 http://en.wikipedia.org/wiki/X86_instruction_listings
STI: Set interrupt flag
IRET: Return from interrupt





2013年2月19日 星期二

volatile的用法



使用前提: 如果一個global變數宣告再某個地方, 但有可能又被另一個program作修改, 而compiler不知道的話, 那就需要加上volatile


Use

A variable should be declared volatile whenever its value could change unexpectedly. In practice, only three types of variables could change:

  • Memory-mapped peripheral registers
  • Global variables modified by an interrupt service routine
  • Global variables within a multi-threaded application

http://www.embedded.com/electronics-blogs/beginner-s-corner/4023801/Introduction-to-the-Volatile-Keyword