2013年6月17日 星期一

QuickData DMA performance

Recently we have been working on porting the Marlin and Ladon architecture from PLX's platform to a more vendor-neutral platform, using Intel's CPU with its DMA engine and on-board NTB support. The latest Linux kernel (3.9) has included the NTB driver, and also an Ethernet driver based on NTB.

However, it seems that the Ethernet driver does not leverage the DMA channel to do data movement between back-to-back connected hosts. To understand the Intel DMA's performance, I've tested the Intel's DMA engine, aka Quickdata technology.


System Description:
I'm testing using Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, with 8GB memory. For both experiment, the PCIe max payload size is set to 128 byte.

Two experiments are shown:
1) send total amount of payload 4MB (non-continuous), with different transfer size per DMA entry. Ex: 4KB transfer size results in 1024 entries to make a 4MB. (X-axis is K-byte, not byte)
2) With a particular transfer size per entry, batch number of entries per DMA transaction. Ex: 4KB transfer size per entry, batching 1024 entries in one DMA transaction.


Short Summary:
I did not setup the NTB so these experiments are local to local memory copy. The best performance is around 54Gbps using > 4KB per entry transfer size to copy 4MB data. Copying 1KB data for 1 entry takes 8 us. Compared to PLX's DMA which has around 20Gbps local to local memory copy bandwidth under 128B MaxPayload size, Intel's DMA seems to outperform a lot.

The throughput increases as the transfer size per entry increase. I think this is due to the maximum number of outstanding read request, usually set to 32. If all PCIe TLPs need to break down into 128 byte maximum payload size, then the 32 outstanding pending requests means 128B * 32 = 4KB, which explains why the throughput saturates at 4KB transfer size per entry.

I did not test the effect of address alignment under Intel's DMA, however, we've found the the address alignment matters a lot to the performance under the PLX's DMA engine.

Source code:
#include <linux/sched.h>
#include <linux/pci.h>
#include <linux/ioport.h>
#include <linux/init.h>
#include <linux/dmi.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/dmaengine.h>
#include <linux/async_tx.h>
#include <linux/kernel.h>
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/dma-mapping.h>
#include <linux/async_tx.h>

#define RAND_BASE 1024
#define GB  (1<<30)
#define MB  (1<<20)
#define KB  (1<<10)


MODULE_LICENSE("GPL");

static void callback(void *param)
{
    if (!param)
        printk("NULL param\n");

    struct dma_async_tx_descriptor *tx =
        (struct dma_async_tx_descriptor *)param;

    return;
}
dma_cookie_t
itri_dma_async_memcpy(struct dma_chan *chan, void *dest,
                        void *src, size_t len)
{
        struct dma_device *dev = chan->device;
        struct dma_async_tx_descriptor *tx;
        dma_addr_t dma_dest, dma_src;
        dma_cookie_t cookie;
        unsigned long flags;

        dma_src = dma_map_single(dev->dev, src, len, DMA_TO_DEVICE);
        dma_dest = dma_map_single(dev->dev, dest, len, DMA_FROM_DEVICE);

        flags = DMA_CTRL_ACK |
                DMA_COMPL_SRC_UNMAP_SINGLE |
                DMA_COMPL_DEST_UNMAP_SINGLE;

        tx = dev->device_prep_dma_memcpy(chan, dma_dest, dma_src, len, flags);

        if (!tx) {
                dma_unmap_single(dev->dev, dma_src, len, DMA_TO_DEVICE);
                dma_unmap_single(dev->dev, dma_dest, len, DMA_FROM_DEVICE);
                return -ENOMEM;
        }

        tx->callback = NULL;
        //tx->callback = callback;
        //tx->callback_param = tx; //william
        cookie = tx->tx_submit(tx);

        preempt_disable();
        __this_cpu_add(chan->local->bytes_transferred, len);
        __this_cpu_inc(chan->local->memcpy_count);
        preempt_enable();

    return cookie;
}

int dma_throughput_test(int tr_size, unsigned long total_size)
{
    int i, num_entry;
    void *src, *dst, *ran_src, *ran_dst;
    unsigned long offset, total_usec, ts;
    static struct timeval  time_e, time_s;
    struct dma_chan *chan;
    dma_cookie_t cookie;

    dmaengine_get();
    chan = dma_find_channel(DMA_MEMCPY);
    if (!chan)
        return -ENODEV;

    num_entry = total_size / tr_size;
    if (num_entry <= 0)
        return -ENOMEM;

    src = kmalloc(tr_size * RAND_BASE, GFP_KERNEL);
    if (!src)
        return -ENOMEM;
    dst = kmalloc(tr_size * RAND_BASE, GFP_KERNEL);
    if (!dst) {
        kfree(src);
        return -ENOMEM;
    }

    memset(dst, 0xff, tr_size * RAND_BASE);
    do_gettimeofday(&time_s);

    for (i = 0; i < num_entry; i++) {
        offset = (i % RAND_BASE) * tr_size;

        offset = (ts % RAND_BASE) * tr_size;
        if (offset >=  RAND_BASE * tr_size) {
            pr_err("Invalid addr\n");
            return -EINVAL;
        }
        ran_src = src + offset;
        ran_dst = dst + offset;

        cookie = itri_dma_async_memcpy(chan, ran_dst, ran_src, tr_size);
        if (cookie < 0) {
            kfree(src);
            kfree(dst);
            printk("Invalid cookie\n");
            return -EINVAL;
        }
       // dma_sync_wait(chan, cookie);
    }

    dma_issue_pending_all();
    dma_sync_wait(chan, cookie);

    do_gettimeofday(&time_e);

    total_usec = (time_e.tv_sec - time_s.tv_sec) * 1000 * 1000
            + (time_e.tv_usec - time_s.tv_usec);

    printk(KERN_EMERG "Local2Local DMA %6d entry, per entry size %3dKB, "
            "total size: %6ldKB, take %6ld us, bandwidth %6ldMbps\n",
        num_entry, tr_size / (1<<10), total_size / (1<<10),
        total_usec, total_size * 8 / (total_usec)
    );

    kfree(src);
    kfree(dst);

    dma_issue_pending_all(); //flush all pending descriptor
    dmaengine_put();

    return 0;
}


void test_fixed_total_entry(void)
{
    // assume run time should be the same
    dma_throughput_test(1 * KB, 10 * MB);
    dma_throughput_test(2 * KB, 20 * MB);
    dma_throughput_test(4 * KB, 40 * MB);
    dma_throughput_test(8 * KB, 80 * MB);
    dma_throughput_test(16 * KB, 160 * MB);
    return;
}

void test_throughput(void)
{
    dma_throughput_test(128 , 4UL * GB);
    dma_throughput_test(256 , 4UL * GB);
    dma_throughput_test(512 , 4UL * GB);
    dma_throughput_test(1024 , 4UL * GB);
    return;
}

static int itri_dma_init(void)
{

    test_fixed_total_entry();
    test_throughput();
    return 0;
}

static void itri_dma_exit(void)
{
    printk(KERN_INFO "unloading itri dma\n");
    return;
}

module_init(itri_dma_init);
module_exit(itri_dma_exit);


System supporting NTB (Non-Transparent Bridge)

I'm recently purchasing a set of machines to do my NTB experiments on Intel's machine. So I did a few search on Intel's spec which has this cool feature. It seems that currently no other vendor except Intel has NTB-capable motherboard. And my local vendor even made a mistake by giving me a machine with NTB-capable CPU but not NTB-capable server board (R1000GZ/GL). Turn out I have to carefully check the spec and told them what I need in detail.

NTB-capable CPU: E5-2620


Vol1 and Vol2:
http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-1600-2600-vol-1-datasheet.html
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-1600-2600-vol-2-datasheet.pdf

NTB: in chapter 3.3
QuickData: DMA channel support in chapter 3.4


NTB-capable Server Board: S1600JP4

http://download.intel.com/support/motherboards/server/sb-s1600jp/sb/g68018004_s1600jp_tps_r1_4.pdf


Intel® Server Board S1600JP4 BIOS

Root Complex Peer-to-Peer support

In 3.4.1, S1600JP4
The Intel® C600 PCH provides up to eight PCI Express* Root Ports, supporting the PCI
http://download.intel.com/support/motherboards/server/sb-s1600jp/sb/g68018004_s1600jp_tps_r1_4.pdf

Processor Integrated I/O (IIO) configuration, peer-to-peer configurations are mentioned.
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-1600-2600-vol-2-datasheet.pdf

Intel C600PCH
http://www.intel.com/content/dam/www/public/us/en/documents/design-guides/c600-series-chipset-thermal-guide.pdf

Other chipset with P2P root complex: Intel MCH 5100, 5400, 7520都有peer-to-peer RC
http://www.intel.com/content/dam/doc/datasheet/5400-chipset-memory-controller-hub-datasheet.pdf
However, I'm not 100% sure this supports peer-to-peer root complex.


References

Intel's forum
http://communities.intel.com/message/194756#194756

Server Board with NTB support
S1600JP/S2600JF/S2600CO/S2600WP

Server board without NTB support
S2600CP, S2600GZ/S2600GL, R1000GZ/GL server
If your product doesn't have JP, JF, CO, or WP in it, then it does not support NTB.

Jon Mason’s blog
https://github.com/jonmason/ntb/wiki





2013年6月5日 星期三

Trap/Redirect a function call in Linux kernel module

Trapping a kernel module function call

今天測試了一下如何再kernel module中將一個function替換成另一個function, 主要參考
http://www.phrack.org/issues.html?issue=68&id=11
他所提供的tool elfchger.c是給32bit ELF header, 所以要改成64bit ELF.

Step 1: orig module: evil沒有被call

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/errno.h>

MODULE_LICENSE("GPL");

int fun(void) {
    printk(KERN_ALERT "calling fun!");
    return 0;
}

int evil(void) {
    printk(KERN_ALERT "===== EVIL ====\n");
    return 0;
}

int init(void) {
    printk(KERN_ALERT "Init Original!");
    fun();
    return 0;
}

void clean(void) {
    printk(KERN_ALERT "Exit Original!");
    return;
}
module_init(init);
module_exit(clean);


Step2: 找到相對映的address

[root@new_host1 kprobe]# objdump -t orig.ko

orig.ko:     file format elf64-x86-64

SYMBOL TABLE:
000000000000001b g     F .text    000000000000001b evil
0000000000000000 g     O .gnu.linkonce.this_module    0000000000000280 __this_module
0000000000000056 g     F .text    0000000000000019 cleanup_module
0000000000000036 g     F .text    0000000000000020 init_module
0000000000000056 g     F .text    0000000000000019 clean
0000000000000000 g     F .text    000000000000001b fun
0000000000000000         *UND*    0000000000000000 mcount
0000000000000000         *UND*    0000000000000000 printk
0000000000000036 g     F .text    0000000000000020 init

Step 3: 利用patched過後的elfchger.c去修改fun, 將fun寫成evil的address (0x1b)

[root@new_host1 kprobe]# ./elfchger -s fun -v 1b orig.ko
[+] Opening orig.ko file...
[+] Reading Elf header...
    >> Done!
[+] Finding ".symtab" section...
    >> Found at 0xc618
[+] Finding ".strtab" section...
    >> Found at 0xc658
[+] Getting symbol' infos:
    >> Symbol found at 0x159c8
    >> Index in symbol table: 0x1c
[+] Replacing 0x00000000 with 0x0000001b... done!


Step 4: 再check一次symbol  table, 此時fun已經被改成1b

000000000000001b g     F .text    000000000000001b evil
0000000000000000 g     O .gnu.linkonce.this_module    0000000000000280 __this_module
0000000000000056 g     F .text    0000000000000019 cleanup_module
0000000000000036 g     F .text    0000000000000020 init_module
0000000000000056 g     F .text    0000000000000019 clean
000000000000001b g     F .text    000000000000001b fun
0000000000000000         *UND*    0000000000000000 mcount
0000000000000000         *UND*    0000000000000000 printk
0000000000000036 g     F .text    0000000000000020 init


Step 5: 驗證

> insmod orig.ko
> dmesg
[  692.169913] Init Original!
[  692.169975] ===== EVIL ====


==== patch elfchger.c for 64bit system ====

/*
 * elfchger.c by styx^ <the.styx@gmail.com> (based on truff's code)
 * June 6, 2013, Cheng-Chun Tu <u9012063@gmail.com>
 * Patch for 64 bit system
 * Script with two features:
 *
 * Usage 1: Change the symbol name value (address) in a kernel module.
 * Usage 2: Change the symbol binding (from local to global) in a kernel
 *          module.
 *
 * Usage:
 * 1: ./elfchger -f [symbol] -v [value] <module_name>
 * 2: ./elfchger -g [symbol] <module_name>
 */

#include <stdlib.h>
#include <stdio.h>
#include <elf.h>
#include <string.h>
#include <getopt.h>

int ElfGetSectionByName (FILE *fd, Elf64_Ehdr *ehdr, char *section,
                         Elf64_Shdr *shdr);

int ElfGetSectionName (FILE *fd, Elf64_Word sh_name,
                       Elf64_Shdr *shstrtable, char *res, size_t len);

Elf64_Off ElfGetSymbolByName (FILE *fd, Elf64_Shdr *symtab,
                       Elf64_Shdr *strtab, char *name, Elf64_Sym *sym);

void ElfGetSymbolName (FILE *fd, Elf64_Word sym_name,
                       Elf64_Shdr *strtable, char *res, size_t len);

unsigned long ReorderSymbols (FILE *fd, Elf64_Shdr *symtab,
                       Elf64_Shdr *strtab, char *name);

int ReoderRelocation(FILE *fd, Elf64_Shdr *symtab,
                       Elf64_Shdr *strtab, char *name, Elf64_Sym *sym);

int ElfGetSectionByIndex (FILE *fd, Elf64_Ehdr *ehdr, Elf64_Half index,
                          Elf64_Shdr *shdr);

void usage(char *cmd);

int main (int argc, char **argv) {

  FILE *fd;
  Elf64_Ehdr hdr;
  Elf64_Shdr symtab, strtab;
  Elf64_Sym sym;
  Elf64_Off symoffset;
  Elf64_Addr value;

  unsigned long new_index = 0;
  int gflag = 0, vflag = 0, fflag = 0;
  char *sym_name;
  int sym_value = 0;

  long sym_off, str_off;
  int opt;

  if ( argc != 4 && argc != 6 ) {
    usage(argv[0]);
    exit(-1);
  }

  while ((opt = getopt(argc, argv, "vsg")) != -1) {

    switch (opt) {

      case 'g':

        if( argc-1 < optind) {
        printf("[-] You must specify symbol name!\n");
        usage(argv[0]);
        exit(-1);
        }

        gflag = 1;
        sym_name = argv[optind];

        break;

      case 's':

        if( argc-1 < optind) {
          printf("[-] You must specify symbol name!\n");
          usage(argv[0]);
          exit(-1);
        }

        fflag = 1;
        sym_name = argv[optind];

        break;

      case 'v':

        if( argc-1 < optind) {
          printf("[-] You must specify new symbol address\n");
          usage(argv[0]);
          exit(-1);
        }

        vflag = 1;
        sym_value = strtol(argv[optind], (char **) NULL, 16);

        break;

      default:
        usage(argv[0]);
        exit(-1);
    }
  }

  printf("[+] Opening %s file...\n", argv[argc-1]);

  fd = fopen (argv[argc-1], "r+");

  if (fd == NULL) {

    printf("[-] File \"%s\" not found!\n", argv[1]);
    exit(-1);
  }

  printf("[+] Reading Elf header...\n");

  if (fread (&hdr, sizeof (Elf64_Ehdr), 1, fd) < 1) {

    printf("[-] Elf header corrupted!\n");
    exit(-1);
  }

  printf("\t>> Done!\n");

  printf("[+] Finding \".symtab\" section...\n");

  sym_off = ElfGetSectionByName (fd, &hdr, ".symtab", &symtab);

  if (sym_off == -1) {

    printf("[-] Can't get .symtab section\n");
    exit(-1);
  }

  printf("\t>> Found at 0x%x\n", (int )sym_off);
  printf("[+] Finding \".strtab\" section...\n");

  str_off = ElfGetSectionByName (fd, &hdr, ".strtab", &strtab);

  if (str_off  == -1) {

    printf("[-] Can't get .strtab section!\n");
    exit(-1);
  }

  printf("\t>> Found at 0x%x\n", (int )str_off);

  printf("[+] Getting symbol' infos:\n");

  symoffset = ElfGetSymbolByName (fd, &symtab, &strtab, sym_name, &sym);

  if ( (int) symoffset == -1) {

    printf("[-] Symbol \"%s\" not found!\n", sym_name);
    exit(-1);
  }

  if ( gflag == 1 ) {

    if ( ELF64_ST_BIND(sym.st_info) == STB_LOCAL ) {

      unsigned char global;
      unsigned long offset = 0;

      printf("[+] Reordering symbols:\n");

      new_index = ReorderSymbols(fd, &symtab, &strtab, sym_name);

      printf("[+] Updating symbol' infos:\n");

      symoffset = ElfGetSymbolByName(fd, &symtab, &strtab, sym_name, &sym);

      if ( (int) symoffset == -1) {

        printf("[-] Symbol \"%s\" not found!\n", sym_name);
        exit(-1);
      }

      offset = symoffset+1+sizeof(Elf64_Addr)+1+sizeof(Elf64_Word)+2;

      printf("\t>> Replacing flag 'LOCAL' located at 0x%x with 'GLOBAL'\
              \n", (unsigned int)offset);

      if (fseek (fd, offset, SEEK_SET) == -1) {

        perror("[-] fseek: ");
        exit(-1);
      }

      global = ELF64_ST_INFO(STB_GLOBAL, STT_FUNC);

      if (fwrite (&global, sizeof(unsigned char), 1, fd) < 1) {

        perror("[-] fwrite: ");
        exit(-1);
      }

      printf("[+] Updating symtab infos at 0x%x\n", (int )sym_off);

      if ( fseek(fd, sym_off, SEEK_SET) == -1 ) {

        perror("[-] fseek: ");
        exit(-1);
      }

      symtab.sh_info = new_index; // updating sh_info with the new index
                                  // in symbol table.

      if( fwrite(&symtab, sizeof(Elf64_Shdr), 1, fd) < 1 )  {

        perror("[-] fwrite: ");
        exit(-1);
      }

    } else {

      printf("[-] Already global function!\n");
    }

  } else if ( fflag == 1 && vflag == 1 ) {

      memset(&value, 0, sizeof(Elf64_Addr));
      memcpy(&value, &sym_value, sizeof(Elf64_Addr));

      printf("[+] Replacing 0x%.8x with 0x%.8x... ", sym.st_value, value);

      //if (fseek (fd, symoffset+sizeof(Elf64_Word), SEEK_SET) == -1) {
        //william
    if (fseek (fd, symoffset + sizeof(Elf64_Word) + 2 * sizeof(unsigned char) + sizeof(Elf64_Half), SEEK_SET) == -1) {

        perror("[-] fseek: ");
        exit(-1);
      }

      if (fwrite (&value, sizeof(Elf64_Addr), 1, fd) < 1 )  {

        perror("[-] fwrite: ");
        exit(-1);
      }

      printf("done!\n");

      fclose (fd);
}

return 0;
}

/* This function returns the offset relative to the symbol name "name" */

Elf64_Off ElfGetSymbolByName(FILE *fd, Elf64_Shdr *symtab,
        Elf64_Shdr *strtab, char *name, Elf64_Sym *sym) {

  unsigned int i;
  char symname[255];

  for ( i = 0; i < (symtab->sh_size/symtab->sh_entsize); i++) {

    if (fseek (fd, symtab->sh_offset + (i * symtab->sh_entsize),
               SEEK_SET) == -1) {

      perror("\t[-] fseek: ");
      exit(-1);
    }

    if (fread (sym, sizeof (Elf64_Sym), 1, fd) < 1) {

      perror("\t[-] read: ");
      exit(-1);
    }

    memset (symname, 0, sizeof (symname));

    ElfGetSymbolName (fd, sym->st_name, strtab, symname, sizeof (symname));

    if (!strcmp (symname, name)) {

      printf("\t>> Symbol found at 0x%x\n",
                    symtab->sh_offset + (i * symtab->sh_entsize));

      printf("\t>> Index in symbol table: 0x%x\n", i);

      return symtab->sh_offset + (i * symtab->sh_entsize);
    }
  }

  return -1;
}

/* This function returns the new index of symbol "name" inside the symbol
 * table after re-ordering. */

unsigned long ReorderSymbols (FILE *fd, Elf64_Shdr *symtab,
              Elf64_Shdr *strtab, char *name) {

  unsigned int i = 0, j = 0;
  char symname[255];
  Elf64_Sym *all;
  Elf64_Sym temp;
  unsigned long new_index = 0;
  unsigned long my_off = 0;

  printf("\t>> Starting:\n");

  all = (Elf64_Sym *) malloc(sizeof(Elf64_Sym) *
                      (symtab->sh_size/symtab->sh_entsize));

  if ( all == NULL ) {

    return -1;

  }

  memset(all, 0, symtab->sh_size/symtab->sh_entsize);

  my_off = symtab->sh_offset;

  for ( i = 0; i < (symtab->sh_size/symtab->sh_entsize); i++) {

    if (fseek (fd, symtab->sh_offset + (i * symtab->sh_entsize),
                             SEEK_SET) == -1) {

      perror("\t[-] fseek: ");
      exit(-1);
    }

    if (fread (&all[i], sizeof (Elf64_Sym), 1, fd) < 1) {

      printf("\t[-] fread: ");
      exit(-1);
    }

    memset (symname, 0, sizeof (symname));

    ElfGetSymbolName(fd, all[i].st_name, strtab, symname, sizeof(symname));

    if (!strcmp (symname, name)) {

      j = i;

      continue;
    }
  }

  temp = all[j];

  for ( i = j; i < (symtab->sh_size/symtab->sh_entsize); i++ ) {

    if ( i+1 >= symtab->sh_size/symtab->sh_entsize )
      break;

    if ( ELF64_ST_BIND(all[i+1].st_info) == STB_LOCAL ) {

      printf("\t>> Moving symbol from %x to %x\n", i+1, i);

      all[i] = all[i+1];

    } else {

      new_index = i;

      printf("\t>> Moving our symbol from %d to %x\n", j, i);

      all[i] = temp;
      break;
    }
  }

  printf("\t>> Last LOCAL symbol: 0x%x\n", (unsigned int)new_index);

  if ( fseek (fd, my_off, SEEK_SET) == -1 ) {

      perror("\t[-] fseek: ");
      exit(-1);
  }

  if ( fwrite(all, sizeof( Elf64_Sym), symtab->sh_size/symtab->sh_entsize,
              fd) < (symtab->sh_size/symtab->sh_entsize )) {

      perror("\t[-] fwrite: ");
      exit(-1);
  }

  printf("\t>> Done!\n");

  free(all);

  return new_index;
}


int ElfGetSectionByIndex (FILE *fd, Elf64_Ehdr *ehdr, Elf64_Half index,
                          Elf64_Shdr *shdr) {

  if (fseek (fd, ehdr->e_shoff + (index * ehdr->e_shentsize),
             SEEK_SET) == -1) {

    perror("\t[-] fseek: ");
    exit(-1);
  }

  if (fread (shdr, sizeof (Elf64_Shdr), 1, fd) < 1) {

    printf("\t[-] Sections header corrupted");
    exit(-1);
  }

  return 0;
}


int ElfGetSectionByName (FILE *fd, Elf64_Ehdr *ehdr, char *section,
                         Elf64_Shdr *shdr) {

  int i;
  char name[255];
  Elf64_Shdr shstrtable;

  ElfGetSectionByIndex (fd, ehdr, ehdr->e_shstrndx, &shstrtable);

  memset (name, 0, sizeof (name));

  for ( i = 0; i < ehdr->e_shnum; i++) {

    if (fseek (fd, ehdr->e_shoff + (i * ehdr->e_shentsize),
               SEEK_SET) == -1) {

      perror("\t[-] fseek: ");
      exit(-1);
    }

    if (fread (shdr, sizeof (Elf64_Shdr), 1, fd) < 1) {

      printf("[-] Sections header corrupted");
      exit(-1);
    }

    ElfGetSectionName (fd, shdr->sh_name, &shstrtable,
                       name, sizeof (name));

    if (!strcmp (name, section)) {

      return ehdr->e_shoff + (i * ehdr->e_shentsize);

    }
  }

  return -1;
}

int ElfGetSectionName (FILE *fd, Elf64_Word sh_name,
                       Elf64_Shdr *shstrtable, char *res, size_t len) {

  size_t i = 0;

  if (fseek (fd, shstrtable->sh_offset + sh_name, SEEK_SET) == -1) {

    perror("\t[-] fseek: ");
    exit(-1);
  }

  while ( (i < len-1) || *res != '\0' ) {

    *res = fgetc (fd);
    i++;
    res++;

  }

  return 0;
}


void ElfGetSymbolName (FILE *fd, Elf64_Word sym_name,
                       Elf64_Shdr *strtable, char *res, size_t len)
{
  size_t i = 0;

  if (fseek (fd, strtable->sh_offset + sym_name, SEEK_SET) == -1) {

    perror("\t[-] fseek: ");
    exit(-1);
  }

  while ((i < len-1) || *res != '\0') {

    *res = fgetc (fd);
    i++;
    res++;

  }

  return;
}

void usage(char *cmd) {

  printf("Usage: %s <option(s)> <module_name>\n", cmd);
  printf("Option(s):\n");
  printf(" -g [symbol]\tSymbol we want to change the binding as global\n");
  printf("Or:\n");
  printf(" -s [symbol]\tSymbol we want to change the value (address)\n");
  printf(" -v [value] \tNew value (address) for symbol\n");

  return;
}






2013年3月20日 星期三

Interrupt Virtualization

Interrupt Virtualization

Reference:
- Hardware Assisted Virtualization Intel Virtualization Technology, by Mat as Zabalj auregui


VMX support for handling interrupts



External interrupt virtualization

mm
mm
mm


Example of handling external interrupts:

1. Guest setup: 
VMM必須要先設定當external interrupt發生時 Guest會產生VM exit
(set "external-interrupt exiting" bit in VMCS)

2. CPU對external interrupt的處理
Interrupt會被自動mask, 藉由clearing RFLAGS.IF. 
如果VMM使用acknowledge-on-exit的功能,  
The processor acknowledges the interrupts, retrieves the host vector, and saves the interrupt in the exist-interruption-information field before transitioning control to the VMM. 
再將控制權交給VMM之前, CPU會自動將Host vector取出, 把目前state存入VMCS

3.  VMM處理Interrupt
如前例, 此時若acknowledge-interrupt-on-exit有設定, VMM可以直接使用Host vector去呼叫相對應的interrupt handler. 此時就跟VM無直接關係.
若沒有設定, 則VMM必須要re-enable interrupt (by setting RFLAGS.IF) to allow vectoring of external interrupts through the monitor/host IDT. 此時考慮兩種情況

[a] Host owned I/O devices 
如果這個device是屬於VMM的, 那相對應的ISR會被呼叫, 此過程和一般interrupt service routine一樣. 但當ISR結束之後, VMM會檢查是此次的interrupt需要其他virtual interrupt的產生 (例如VMM接收到packet之後, 需要轉送給VM的虛擬網卡).
這時候對每個"affected virtual device", VMM會injects virtual external interrupt event. 

[b] Direct pass-through I/O devices
如果這個device是屬於VM的, 此時是由VM內部driver的ISR來處理此interrupt.
- Interrupt causes VM exits to the VMM and vectoring through Host IDT to a registered handler (應該是專門給passthrough device的handler)
- VMM此時會map host vector到corresponding guest vector to inject virtual interrupt into the assigned VM. 
- The guest software does EOI write to the virtual interrupt controller. 

如何inject virtual interrupt?


4. 產生Virtual Interrupt
[a] 首先要檢查processor interruptibility state. 
[b] 如果Processor屬於"not interruptible", VMM則使用"interrupt-window exiting"功能, 也就是說當processor變成可interrupt時, 會產生VM exit通知VMM
[c] 檢查virtual interrupt controller的狀態
- 有無使用Local APIC? 或routed through local vector table (LVT)? I/O APIC是否mask virtual interrupt?
[d] Priority: 
因為virtual interrupt是被queue在VMM並且利用VM entry送入, 所以VMM可以設計不同的priority機制.
[e] Update the virtual interrupt controller state
"When the above checks have passed, before generating the virtual interrupt to the guest, the VMM updates the virtual interrupt controller state (Local-APIC, IO-APIC and/or PIC) to reflect assertion of the virtual interrupt."
[f] Inject the virtual interrupt on VM entry
VMM藉由設定VMCS去產生virtual interrupt.
當VM entry時, Processor會執行相對應的guest IDT, 完成interrupt的處理



Intel's Hardware Assisted Virtualization Technology

Intel's Hardware Assisted Virtualization Technology
Reference:
- Hardware Assisted Virtualization Intel Virtualization Technology, by Mat as Zabalj auregui


Background
- Intel processor uses 4 privileged-level (0 - 3), 0 for highest privileged and 3 for least privileged (user level program).
- For an OS to control a CPU, it must run with privilege 0.

- 0/1/3 model: let VMM run on level 0, guest VM kernel on 1, and guest VM user space on level 3. This is called ring deprivileging. However, ring deprivileging causes many challenges (ex: every component such as page table must be aware of the additional level 1. 通常都只瞭解level 0 and 3)
- Intel VT-x is aimed to solve these challenges by allowing guest to run on its intended level (ring 0) and guest software is also constrained "not by privileged level", but by non-root VMX operations.

- Privilege-based protection的缺點 --> overhead較高
IA-32 uses SYSENTER and SYSEXIT to support low latency system calls, however, in guest, execution of sysenter/sysexit will be transitioned to the VMM. The VMM must emulate every guest execution of sysenter/sysexit. --> 因此有了Intel VT-x

- Interrupt Virtualization
IA-32 architecture allows OS to mask/unmask the external interrupt, preventing incoming INT if it is not ready yet. The VMM needs to control these mask and deny guest when a guest is trying to access. Such mechanism could have performance issues since OS is frequently mask and unmask interrupt and complicate the design of VMM.

- Ring compression
VMM must have control of some amount of a guest's virtual address space for control structure. (These include IDT and GDT). Guest accessing IDT or GDT will generate transitions to the VMM, for VMM to do further handling.

有了以上講的這些缺點
下面解釋兩種目前解決方案

Paravirtualization v.s Binary translation
- Source level modification of guest OS such as Xen. However, not support MS windows system.
- Making modifications directly to guest-OS binaries, such as VMare and Virtual PC. Support broader range of OSes but higher overhead.
* VT-x的設計就是為了不要在使用binary translation, 並且讓VMM支援更多的作業系統


Virtual Machine eXtension (VMX)
VMM runs on VMX root and guest OS runs on VMX non-root. Transitions to VMX non-root are called VM entry while transitions to VMX root is called VM exit.

- VMX non-root: although it's on ring 0, VMX operation places restrictions so that guest software is under some control by VMM, which runs at VMX root level.


- VMM executes VMXON to enter VMX root mode.
- VMM put the guest software into VM by VM entries.  (or VMLAUNCH / VMRESUME). The VMM regains control when VM exit.
- When VM Exit, the VMM is able to take appropriate actions by reading the cause of VM exit from VMCS.


Virtual Machine Control Structure (VMCS)
每個logical CPU都有相對應的VMCS區域, VMCS是Host和VM之間用來溝通的橋梁
當VM exit時, Host可藉由VMCS來知道exit的原因
而當要VM entry or VMRESUME時, Host也可藉由VMCS來傳入event, 例如interrupt和exception

- Each logical process is associated with a VMCS region in its memory. Software makes a VMCS active by executing VMPTRLD.
- The format of a VMCS region includes header (identifier and abort indicator) and VMCS data.
- The VMCS data includes:
1. Guest-state area,
2. Host-state area,
3. VM-execution control fields
4. VM-exit control fields

x86 instruction

x86 http://en.wikipedia.org/wiki/X86_instruction_listings
STI: Set interrupt flag
IRET: Return from interrupt





2013年2月19日 星期二

volatile的用法



使用前提: 如果一個global變數宣告再某個地方, 但有可能又被另一個program作修改, 而compiler不知道的話, 那就需要加上volatile


Use

A variable should be declared volatile whenever its value could change unexpectedly. In practice, only three types of variables could change:

  • Memory-mapped peripheral registers
  • Global variables modified by an interrupt service routine
  • Global variables within a multi-threaded application

http://www.embedded.com/electronics-blogs/beginner-s-corner/4023801/Introduction-to-the-Volatile-Keyword



2011年10月25日 星期二

ELF, Linking and Loading (draft)

討論ELF檔案格式, Program如何開始執行, linker/loader如何將library和elf整合

ELF & ELF Loading
- Executable and Linkable Format
ELF包含:
1. Program header: 存各種segments (data, text, bss)
.text – Where read/execute code goes
.data – Programmer initialized read/write data
.bss – Uninitialized data (initially zero by convention)

2. Section header: 存各種sections用來做linking

- Static elf: load .text, .data, .bss section
- Dynamic elf: 會先load dynamic linker 然後在開始load .text, .data等等.

Dynamic Linking
1. Static linking is the result of the linker copying all library routines used in the program into the executable image.
2. Dynamic linking is accomplished by placing the name of a sharable library in the executable image. Actual linking with the library routines does not occur until the image is run, when both executable and library are placed in memory. An advantage of dynamic linking is that multiple programs can share a single copy of the library.



當使用dynamic linking時, 不直接呼叫main, 而要先做下列動作
1. 檢查heaader裡面是否有需要的library
2. Call mmap把library map到目前process的address space
3. Do some bookkeeping and call main()

如何呼叫library function?
1. Library, ex: libc, 大部份的時候都已經load到memory裡面, 因為其他program可能已經在使用.
2. Linker必須要把目前要直行的program所呼叫的function address填上在library裡面的address
3. If the linker doesn’t know where a function will end up, it creates a relocation -> 要求之後再做relocate
4. Part of loading: linker marches through each relocation and overwrites the call target -> 此時linker要去改.text裡面的function address

概念:
Compiler creates a jump table for all external calls
- 這個table就是PLT (Program Linkage Table), PLT指向GOT(Global Offset Table)
- GOT 存的就是<function, address in memory>
所以linker可以藉由PLT拿到某個library function目前在memory裡面的address. 

PLT的詳細解釋以及步驟:
A PLT is a table of absolute addresses to functions. It is used because the link editor doesn’t know where functions in shared objects will be located. Instead, a table is created so that the program and the dynamic linker can work together to find and execute functions in shared objects. I’ve simplified the explanation a bit1, but at a high level:
  1. Program calls a function in a shared object, the link editor makes sure that the program jumps to a slot in the PLT.
  2. The program sets some data up for the dynamic linker and then hands control over to it.
  3. The dynamic linker looks at the info set up by the program and fills in the absolute address of the function that was called in the PLT.
  4. Then the dynamic linker calls the function.
  5. Subsequent calls to the same function jump to the same slot in the PLT, but every time after the first call the absolute address is already in the PLT (because when the dynamic linker is invoked the first time, it fills in the absolute address in the PLT).
此方法又叫做Lazy resolved. 當function在PLT裡面被呼叫時, linker 才去figure out relocation, 從GOT裡面找到address. ltrace就是裡用在PLT裡面插入breakpoint來知道某個program用到了那些library.


reference:
1. address-space slide from cse506 Don Porter
2. Linker and Loader

2011年10月23日 星期日

CSE506: Networking Softirq and NAPI

Why top half and bottom half?
1. to minimize time in an interrupt handler with other interrupt disabled.
2. Give kernel more scheduling flexibility.
3. Simplify service routines.

Top half:
1. allocate skb, 2. copy into buffer, 3. initialize fields, 3. calling bottom half (for later processing)
Buffer preallocating using DMA:

In some cases, sk_buff can be pre-allocated, and network card can copy data in (DMA) before firing the interrupt. 

Softirqs:
- hardware irq is hardware interrupt line, calls the top half
- Soft IRQ is the associated software “interrupt” handler : the bottom half
- Kernel’s view: per-CPU work lists: Tuples of <function, data>
甚麼時候被執行:
1. return from exception, interrupt, and system call
2. each CPU has a kernel thread "ksoftirqd_CPU#" that processes pending requests
* softirq一定要run在同一個CPU, why? locking. (deadlock, code complexity, scale better!)

從programmer的角度:
1. Only one instance of a softirq function will run on a CPU at a time
2. Doesn’t need to be reentrant
3. If interrupted,won’t be called again by interrupt handler **Subsequent calls enqueued!**
4. One instance can run on each CPU concurrently, though Must use locks

Tasklet: Constrained to run one at a time on any CPU.

Softirq Priority:
HI_SOFTIRQ(high/first) 
TIMER
NET_TX
NET_RX
SCSI
TASKLET(low/last)
- Devices can decide whether their bottom half is higher or lower priority than network traffic (HI or TASKLET) 
- Transmit traffic prioritized above receive. Why? 
- The ability to send packets may stem the tide of incoming packets

Receive livelock

Reference:
1. cse506 course slide by Don Porter.

Non-transparent Bridging (NTB)

What is NTB (Non-Transparent Bridging)?
Introduction:
傳統的PCI架構是用在PC上面來連接數個IO devices, 當初的protocol架構並沒有考慮到多個CPU的情況, 而是由一個processor來emulate entire memory space, 控制所有的devices. 在此設計下如果有另一個CPU加入, PCI的operation就會產生問題.

NTB是一個解決方案讓多個host可以share多個PCI devices. 一個NTB其實是模擬出兩面的PCI endpoint, 並屆由address translation來達到兩邊資料的傳遞. NTB會expose CSR type 0 header來代表endpoint, 所以此方的host 不會繼續做forwarding or discovery.

downstream/upstream side:
Downstream代表的是NTB面向intelligent system (controller)的那面
Upstream則是面向host/root complex那面, 所以upstream side host會看到一個NTB, 以為他是一個pci device endpoint, 然後就結束.
"Device on the downstream side is not visible from the upstream side, this allows an intelligent controller on the downstream side to manage devices here" -> 所以其他所有的devices是由downstream side controller來控制.

from ref2:
"A system host will enumerate through Bridges A and B (both transparent) on the left branch of the figure 2 until it reaches the endpoint X.  On the right side branch, the host will stop enumeration at Bridge D (NTBD).  Similarly, the local CPU will enumerate through Bridge E and F (both virtual bridges within the switch) and discover the endpoint Y, but will not attempt to discover elements beyond Bridge D. "

Device Identification:
一般正常transparent PCI-PCI bridge的class code是0604h, 而NTB的class code是050000h. 代表RAM controller. 因為NTB就是用來做memory map, from one address domain to another.
每個device藉由<device, port, type>來定義,
- TB (Transparent Bridge) switch port又可分upstream跟downstream.  對NTB來說就是endpoint.
這些upstream/downstream property是由CSR BAR 來控制.

CSR header:
Transparent bridge用的是type 1 CSR header, 而NTB用的是type 0 CSR header.

Scratchpad and doorbell registers:
- scratchpad是用來做兩邊溝通的橋梁, 共有八個registers.
- can be accessed in either memory or I/O space from both the primary and secondary interface.
- 可用來傳遞control and status information between primary and secondary bus devices
- Doorbell是用來產生interrupt從一邊到另一邊
- 每個interface (primary and secondary)都有一組registers 可以藉由memory or I/O space來access.
- Standard PCIe allows interrupt to be configured as INTx or MSI (Message Signal Interrupt)

Address Translation
NTB利用BAR來對兩邊的adress做轉換
- BAR定義address translation window into memory space on another side of the NTB
- BAR0 and BAR1保留用來做memory and I/O mapped到CSR
- 每個BAR有個setup register用來指定window size跟type.
- Transparent bridge forwards all CSRs based on bus number, NTB only accept CSR transactions addressed to the device itself.

兩種address translation的方式如下
1. Direct Address Translation (利用 BAR Setup register和BAR Address Translation register)

How to setup the address translation of NT BARs:
A.Setup up the “BAR setup register” to reserved a range of memory address
              (Normally do this in EEPROM)
B.Setup up the “BAR Address Translation Register “ to map the range of
          memory address to the memory address on the other NT side
              (Normally do this in runtime)

From PLX:



2. Lookup Table Based Address Translation
另一種方式是用lookup table 來作address translation.

Request ID Translation
(這部份還沒完全弄董)
當有一個request從NTB的某一面進來, 要轉換到另一面時, Request ID (bus no, device, fun)在兩面是不一樣的, 所以需要管理一個translation table, 做一對一的轉換.

From ref1:


Complete ID Translation


Reference:
1. Using Non-transparent Bridging in PCI Express Systems
2. Multi-Host System and Intelligent I/O Design with PCI Express
3. Enabling Multi-Host System Designs with PCI Express Technology

2011年10月19日 星期三

Intel 8255X (draft)

Chapter 6: Host Software Interface
1. establish a shared memory communication system with the host CPU.
2. controls the device by writing and reading data to and from this shared memory space.
Where is this shared memory? BAR?

The shared memory architecture:
Divided into three parts:
1. CSR: Control/Status Register, reside on LAN controller
2. CBL: Command Block List, reside on host memory
3. RFA: Receive Frame Area, reside on host memory

CSR:
1. The first 8 bytes of the CSR is called the System Control Block (SCB). The SCB serves as a central communication point for exchanging control and status information between the host CPU and the 8255x.
2. The host software controls the state of the Command Unit (CU) and Receive Unit (RU) (for example, active, suspended or idle) by writing commands to the SCB.
3. The device posts the status of the CU and RU in the SCB Status word and indicates status changes with an interrupt.
4. The SCB also holds pointers to a linked list of action commands called the CBL and a linked list of receive resources called the RFA.



Glossory:
1. CSR
2. SCB (System Control Block) 
The CUC and RUC fields of the Command byte specify the actions to be performed by the 8255x
3. CU (Command Unit)
4. RU (Receive Unit)


Reference:
1. Intel 8255x 10/100 Mbps Ethernet Controller Family Open Source Software Developer Manua
2.

2011年10月17日 星期一

脂漏性皮膚炎

1. Oct 17, 2011
- Clindamycin Phosphate Lotion (1% clindamycin)
- OXY face wash, acne treatment (10% Benzoyl Peroxide)

2. Nov 4, 2011
- Eryth/Benz Perox Top GEL (ERYTHROMYCIN)
http://www.healthno1.com/component/content/article/134-antibiotics/1634-hna1670.html

- DOXYCYCLINE HYC
http://tw.myblog.yahoo.com/skindr-wang/article?mid=2311

2011年10月11日 星期二

Chapter 17: Page Frame Reclaiming

Page Frame Reclaiming Algorithm (PFRA )
From Understanding the Linux Kernel, chapter 17.
討論給定一個physical page, 如何得到誰再使用此page. 可能使用的有1. file cache, 2. process address space, 3. device cache.  然後根據某些heuristic來決定哪些page要先被釋放. 

"The page frame reclaiming algorithm of the Linux kernel refills the lists of free blocks of the buddy system by "stealing" page frames from both User Mode processes and kernel caches."
- 所以可以從user mode process 或是kernel cache來選擇要回收的papge.

Page type:
Anonymous page:
A page is said to be anonymous if it belongs to an anonymous memory region of a process (for instance, all pages in the User Mode heap or stack of a process are anonymous)
Mapped page: a page is mapped to a file.
For instance, all pages in the User Mode address spaces belonging to file memory mappings are mapped, as well as any other page included in the page cache.

Anonymous Page Reclaim
當kernel要做page reclaim時, 給定一個physical page,要找到使用此page的process, 如果只有單一process, 那此page就是non-shared, 如果有多個, 就是shared anonymous page.


判斷anonymous page的方法
struct page {
  _mapcount: if >=0, 代表多個process share此page
  struct address_space *mapping: 如果= NULL, -> anonymous page. Else -> mapped to a file (inode)
}
mapping: Pointer to the owning object

Shared and non-shared
如果此page是anonymous page: kernel keeps a list called anon_vma.
anon_vma是一個list, 儲存所有來自不同process的vm_area_struct但指到同一個page, 所以這是shared的anonymous page. 當kernel需要free anonymous pages時, 可以經由此list找到所有的相關的process, 拿到vm_area_struct, 然後invalidate他的page table.

"We have seen in the previous section that descriptors for anonymous memory regions are collected in doubly linked circular lists; retrieving all page table entries referencing a given page frame involves a linear scanning of the elements in the list. The number of shared anonymous page frames is never very large, hence this approach works well." from ref2:

From ref2, fig17-1: 兩個process (mm_struct) share 同一個page, anon_vma用來連結所有shared此page的vm_area_struct.
當要釋放page時, 必須要iterate all vm_area_struct, 找到他對應的page table, do linear scanning, 然後swap out or write tot disk.


Mapped Page Reclaim
Problem1: 給定一個要reclaim的page, 此page是被一個file mapped. 此時必須找出在file的哪個offset對應到此page.

如果是mapped page: 利用struct page裡面的struct address_space mapping可以拿到inode. address_space有radix_tree_root, radix tree上面對應file offset到page cache的位置. 當所對應的page被釋放後, address_space裡面的radix tree需要被更新 (變成NULL).

Problem2: 找到此file offset之後, 可能很多個process都透過mmap來對應到此file offset, ex:libc
Idea: 假設跟anonymous page一樣, 利用一個list來存所有的vma
Complication1:
Not all file mappings map the entire file
假設要page 4 of a file, 必須要filter all vma that don't include page 4
Complication2: anonymous mappings won’t be shared much
anonymous page很少被shared, 除非利用fork, mapped file常常被shared, ex: libc
Problem: Lots of entries on the list + many that might not overlap

Solution: Priority Search Tree
Idea: binary search tree that uses overlapping ranges as node keys
Bigger, enclosing ranges are the parents, smaller ranges are children
Use case: Search for all ranges that include page N

三種Tree的不同用法
1. Radix tree:  (inode -> physical memory)
file(inode)的page cache. 處理file(inode)到address space的maping. inode有i_mapping (struct address_space),  address_space有radix_tree_root
struct address_space {
 635        struct inode            *host;          /* owner: inode, block_device */
 636        struct radix_tree_root  page_tree;      /* radix tree of all pages */
...
從此page_tree得到file offset對physical memory的mapping.

2.  Red-black tree: (process address space -> page)
處理process address space的mapping. 每個process有mm_struct, mm_struct用來管理所有的vm_area_struct, 把所有的vm_area_struct存在一個list. (note: red-black tree is balanced.)
from ref1:

3. Priority tree (process address space -> file)
"There is a priority search tree for every file; its root is stored in the i_mmap field of the address_space object embedded in the file's inode object. It is always possible to quickly retrieve the root of the search tree, because the mapping field in the descriptor of a mapped page points to the address_space object. " from ref2
case: mapped anonymous page
很多program利用mmap將file直接map到process的address space. Example: libc就是很常見的例子. (所以這不是file page cache, 也不是process address space page mapping) 然而每個program可能只有map libc file的某個部份, 並不會把整個file全部map到自己program的address space. 因此Linux利用 PST(Priority Search Tree) 來管理這個mapping.

from ref2 fig17-2:
Goal: 給定一個file offset, 找到哪些vm_area_struct有包含此file offset (透過mmap).
Each node in the PST contains a list of vmas mapping that interval. 每個node代表一個interval, 而每個node裡面的list of vm_area_struct都map到此interval.
下圖為file page 0 ~ 5.
Radix: interval的起始位置, heap: interval的終止位置
條件: 1. parent heap number > child heap number. 2. 左邊的radix number > 右邊radix number
Traversal for page n:
if n > heap number, stop
else
  if n is in the range of radix and heap, it's the answer
  continue checking ALL the children.

example: n = 4
4 < 5 and 4 is in (0,5) -> (0,5,5), 4<=4-> (0,4,4), 4 > 0, 4 >2, 4 < 5 and 4 is in (2,5) -> (2,3,5), 4>3, 4>2

下圖右上角為inode, inode指向struct prio_tree i_mmap, i_mmap指向一個priority tree.  Priority tree紀錄了這個file裡面每個file offset page對應到process的mm_struct. 而每個mm_struct包含多個vm_area_struct. 由此可知如何從一個file inode找到所有對此file使用mmap的process.

From ref1, fig 4-7:

struct address_space {
 635        struct inode            *host;          /* owner: inode, block_device */
 636        struct radix_tree_root  page_tree;      /* radix tree of all pages */
 637        spinlock_t              tree_lock;      /* and lock protecting it */
 638        unsigned int            i_mmap_writable;/* count VM_SHARED mappings */
 639        struct prio_tree_root   i_mmap;         /* tree of private and shared mappings */
 640        struct list_head        i_mmap_nonlinear;/*list VM_NONLINEAR mappings */

LRU (Least Recent Use)
所有的page都分別屬於兩個LRU list: active or inactive, 當此page有備access到就移到active list.  如果很久沒被用就移到inactive.

- Anonymous page是user space在使用的,  kernel必須要知道此page是否經常被使用, 如果很少被使用就可以被reclaim.
- Page table entry裡面有個access bit, 當此page table entry被access時, x86 hardware會set this bit, 用來通知linux kernel. Kernel periodically reset and check this bit to know whether this page is frequently used.

Page Reclaim Heuristics: 
cases: low memory, hibernate, free memory below a threshold.
要先選哪個page來釋放?
goal: minimize the performance disruption.
1. Free harmless pages first
2. Steal pages from user programs, especially those that haven’t been used recently
    - page reclaim之後要remove all reference to the page
3. Laziness:Favor pages that are “cheaper” to free
    - Waiting on write back of dirty data takes time


Reference:
1. Professional Linux Kernel Architecture, chapter 4.
2. Understanding the Linux Kernel, chapter 17.
3. Segment tree: http://en.wikipedia.org/wiki/Segment_tree (or Priority tree)

2011年10月10日 星期一

C pointer 練習

From JOS Lab1:
http://www.cs.sunysb.edu/~porter/courses/cse506/f11/lab1.html

1. Read 5.1 (Pointers and Addresses) through 5.5 (Character Pointers and Functions) in K&R. Then download the code for pointers.c, run it, and make sure you understand where all of the printed values come from. In particular, make sure you understand where the pointer addresses in lines 1 and 6 come from, how all the values in lines 2 through 4 get there, and why the values printed in line 5 are seemingly corrupted.


2. There are other references on pointers in C, though not as strongly recommended. A tutorial by Ted Jensen that cites K&R heavily is available in the course readings.

3. Here are a few specific points you read about in K&R Chapter 5 that are worth remembering for the following exercise and for future labs.
  • If int *p = (int*)100, then (int)p + 1 and (int)(p + 1) are different numbers: the first is 101 but the second is 104. When adding an integer to a pointer, as in the second case, the integer is implicitly multiplied by the size of the object the pointer points to.
  • p[i] is defined to be the same as *(p+i), referring to the i'th object in the memory pointed to by p. The above rule for addition helps this definition work when the objects are larger than one byte.
  • &p[i] is the same as (p+i), yielding the address of the i'th object in the memory pointed to by p.
pointers.c
-----------
#include <stdio.h>
#include <stdlib.h>

void
f(void)
{
    int a[4];
    int *b = malloc(16);
    int *c;
    int i;

    printf("1: a = %p, b = %p, c = %p\n", a, b, c);

    c = a;
    for (i = 0; i < 4; i++)
 a[i] = 100 + i;
    c[0] = 200;
    printf("2: a[0] = %d, a[1] = %d, a[2] = %d, a[3] = %d\n",
    a[0], a[1], a[2], a[3]);

    c[1] = 300;
    *(c + 2) = 301;
    3[c] = 302;
    printf("3: a[0] = %d, a[1] = %d, a[2] = %d, a[3] = %d\n",
    a[0], a[1], a[2], a[3]);

    c = c + 1;
    *c = 400;
    printf("4: a[0] = %d, a[1] = %d, a[2] = %d, a[3] = %d\n",
    a[0], a[1], a[2], a[3]);

    c = (int *) ((char *) c + 1);
    *c = 500;
    printf("5: a[0] = %d, a[1] = %d, a[2] = %d, a[3] = %d\n",
    a[0], a[1], a[2], a[3]);

    b = (int *) a + 1;
    c = (int *) ((char *) a + 1);
    printf("6: a = %p, b = %p, c = %p\n", a, b, c);
}

int
main(int ac, char **av)
{
    f();
    return 0;
}

2011年10月6日 星期四

CSE506: Block Device Scheduling (DRAFT)

The big picture:
From ref1: "Below the generic block layer, the "I/O scheduler " sorts the pending I/O data transfer requests according to predefined kernel policies. The purpose of the scheduler is to group requests of data that lie near each other on the physical medium. "
Scheduling的目的:
1. Throughput, Latency,
2. Safety: file system can be recovered after a crash
3. Fairness – surprisingly, very little attention is given to storage access fairness

Problems:
- Between page cache and disk, you have a queue of pending requests
- Requests are a tuple of (block #, read/write, buffer addr)
- You can reorder these as you like to improve throughput

Below is from ref2.
Elevator Algorithm
- 像搭電梯一樣, 要上樓就把所有人一起往上載, 要下樓就一起往下載
idea: Require disk arm to move in continuous “sweeps” in and out
- Reorder requests within a sweep
- This approach prevents starvation: Sectors at “inside” or “outside”get service after a bounded time

Complete Fairness Queue
- 利用round robin 保證每個request公平的被分到
idea: add a second layer of queue, one per process. Then do the round-robin.
(x) Ping-pong disk head around

Deadline Scheduler
- Deadline快到的先被執行
idea: associate expiration time with requests. As requests get close to expiration, make sure they are deployed. Good for real time applications.

Anticipatory Scheduler
- 當某個Process有burstness特性時, 先等一下 (batching), 然後在一起執行
idea: Try to anticipate locality of requests
If process P tends to issue bursts of requests for close disk blocks, When you see are quest from P, wait a bit and see if more come in before scheduling them

Reference:
1. Understanding the Linux Kernel, Chapter 14.3
2. cse506 slide by Don Porter

CSE506: Linux Page Cache

Page Cache
Page cache 的主要功能是當作files on disk的快取.  每個on disk file都有一個相對映的page cache (a radix tree), 存取file的最小單位是4K page size, 每當從disk存取後就會被存在page cache裡面.
當要讀取檔案時, find_get_page()會要求輸入inode跟index, index為此file address space裡面的offset. Linux kernel會根據此index去搜尋radix tree來獲取相對映的phyiscal page. 若此page不存在於page cache, 則必須要從disk 裡面重新讀出來.

Recap: Memory用途
1. Process virtual memory, 2. File, 3. I/O mapping, and 4. device buffer.


Note:
- Address space有很多種
1. Virtual address space: 4G,  anonymous, (VM area struct, vma)
2. physical address space: depends on the size of your physical memory, and
3. file address space: offset from the head of the file
- 每個inode有自己的page cache
- 當file size超過4G時, file address space就無法用32 bits address space來表示, 此時就需要64bits or concatenation of registers.

Radix tree
對Process來說, linux使用page table來管理virtual to physical mapping. 對file來說, linux使用radix tree.  (特性: virtual address一定是0-4G, 但file size可大可小).
"In order to perform page cache lookup efficiently, Linux 2.6 makes use of a large set of search trees, one for each address_space object. "
Radix tree利用file address的prefix來找到對應的page.  Page cache是由一個radix tree來儲存, 每個node含有64個leaves, 最後指向physical page.
- Branching factor: 64
- TRIE: prefix tree. Rather than storing the entire key in the node, traversal of parent builds a prefix, node just stores suffix.
- height:1 , size = 64 * 4K  = 256K, height:2, 64*64*4K = 16M

Positioning:
一個file address的前12 bit (4K)會先被拿掉, 之後每六個bit代表一個radix tree level的index.

from ref1:
Buffer heads (BH) 15.2
(Mapping pages to disk blocks)
需要buffer head的原因是因為disk每個單位是512byte (sector), 而OS的單位是4K byte. 因此每個page中都含有8個buffer cache. Linux kernel利用buffer_head來做housekeeping. 當寫入範圍小於一個sector時,此page cache會被mark dirty, 但update only the BH in the disk, not the whole page containing 8 buffer.
from ref2:
Example: write() system call for first 5 bytes
1. Look up first page in radix tree
2. Modify page, mark dirty
3. Only mark first buffer head dirty

Raw device caching:
Disk 也可以有自己radix tree + address space!

Page cache v.s Page table
相對於Process address space (4K), page cache 是用來cache on disk file. 而4K的virtual address 是利用page table (10bits for page directory, 10bits for page table, and 12 bits for offset.)

如下圖, page cache處理的是從address space region到backing store 的cache (左半部), 而page table處理的是virtual address space mapping到physical page frame.

review page table from ref3:

Linux represents portions of a process with a vm_area_struct, or vma. 利用red-black tree來管理mapping.  from ref4


Reference:
1. Understanding the Linux Kernel, Chapter 15
2. http://www.ibm.com/developerworks/cn/linux/l-cache/index.html
3. Professional Linux Kernel architecture
4. CSE506 slide, Don Porter

2011年10月5日 星期三

PCI useful links

Focus: linking TCP/IP with PCIe HW.

Links
1. LDD example code
http://www.cs.fsu.edu/~baker/devices/lxr/http/source/ldd-examples
2.LDD, network driver
http://www.xml.com/ldd/chapter/book/ch14.html
3. Kernel documents
http://www.mjmwired.net/kernel/Documentation/PCI/pci.txt
4. Introduction to PCI
http://teknirvana.com/documents/Pci.pdf


Papers:
Unread:
1. SR-IOV Networking in Xen: Architecture, Design and Implementation
2. Standardized but Flexible I/O for Self-Virtualizing Devices
3. Towards an Integrated IO and Clustering Solution using PCI Express

Read: