2011年10月25日 星期二

ELF, Linking and Loading (draft)

討論ELF檔案格式, Program如何開始執行, linker/loader如何將library和elf整合

ELF & ELF Loading
- Executable and Linkable Format
ELF包含:
1. Program header: 存各種segments (data, text, bss)
.text – Where read/execute code goes
.data – Programmer initialized read/write data
.bss – Uninitialized data (initially zero by convention)

2. Section header: 存各種sections用來做linking

- Static elf: load .text, .data, .bss section
- Dynamic elf: 會先load dynamic linker 然後在開始load .text, .data等等.

Dynamic Linking
1. Static linking is the result of the linker copying all library routines used in the program into the executable image.
2. Dynamic linking is accomplished by placing the name of a sharable library in the executable image. Actual linking with the library routines does not occur until the image is run, when both executable and library are placed in memory. An advantage of dynamic linking is that multiple programs can share a single copy of the library.



當使用dynamic linking時, 不直接呼叫main, 而要先做下列動作
1. 檢查heaader裡面是否有需要的library
2. Call mmap把library map到目前process的address space
3. Do some bookkeeping and call main()

如何呼叫library function?
1. Library, ex: libc, 大部份的時候都已經load到memory裡面, 因為其他program可能已經在使用.
2. Linker必須要把目前要直行的program所呼叫的function address填上在library裡面的address
3. If the linker doesn’t know where a function will end up, it creates a relocation -> 要求之後再做relocate
4. Part of loading: linker marches through each relocation and overwrites the call target -> 此時linker要去改.text裡面的function address

概念:
Compiler creates a jump table for all external calls
- 這個table就是PLT (Program Linkage Table), PLT指向GOT(Global Offset Table)
- GOT 存的就是<function, address in memory>
所以linker可以藉由PLT拿到某個library function目前在memory裡面的address. 

PLT的詳細解釋以及步驟:
A PLT is a table of absolute addresses to functions. It is used because the link editor doesn’t know where functions in shared objects will be located. Instead, a table is created so that the program and the dynamic linker can work together to find and execute functions in shared objects. I’ve simplified the explanation a bit1, but at a high level:
  1. Program calls a function in a shared object, the link editor makes sure that the program jumps to a slot in the PLT.
  2. The program sets some data up for the dynamic linker and then hands control over to it.
  3. The dynamic linker looks at the info set up by the program and fills in the absolute address of the function that was called in the PLT.
  4. Then the dynamic linker calls the function.
  5. Subsequent calls to the same function jump to the same slot in the PLT, but every time after the first call the absolute address is already in the PLT (because when the dynamic linker is invoked the first time, it fills in the absolute address in the PLT).
此方法又叫做Lazy resolved. 當function在PLT裡面被呼叫時, linker 才去figure out relocation, 從GOT裡面找到address. ltrace就是裡用在PLT裡面插入breakpoint來知道某個program用到了那些library.


reference:
1. address-space slide from cse506 Don Porter
2. Linker and Loader

2011年10月23日 星期日

CSE506: Networking Softirq and NAPI

Why top half and bottom half?
1. to minimize time in an interrupt handler with other interrupt disabled.
2. Give kernel more scheduling flexibility.
3. Simplify service routines.

Top half:
1. allocate skb, 2. copy into buffer, 3. initialize fields, 3. calling bottom half (for later processing)
Buffer preallocating using DMA:

In some cases, sk_buff can be pre-allocated, and network card can copy data in (DMA) before firing the interrupt. 

Softirqs:
- hardware irq is hardware interrupt line, calls the top half
- Soft IRQ is the associated software “interrupt” handler : the bottom half
- Kernel’s view: per-CPU work lists: Tuples of <function, data>
甚麼時候被執行:
1. return from exception, interrupt, and system call
2. each CPU has a kernel thread "ksoftirqd_CPU#" that processes pending requests
* softirq一定要run在同一個CPU, why? locking. (deadlock, code complexity, scale better!)

從programmer的角度:
1. Only one instance of a softirq function will run on a CPU at a time
2. Doesn’t need to be reentrant
3. If interrupted,won’t be called again by interrupt handler **Subsequent calls enqueued!**
4. One instance can run on each CPU concurrently, though Must use locks

Tasklet: Constrained to run one at a time on any CPU.

Softirq Priority:
HI_SOFTIRQ(high/first) 
TIMER
NET_TX
NET_RX
SCSI
TASKLET(low/last)
- Devices can decide whether their bottom half is higher or lower priority than network traffic (HI or TASKLET) 
- Transmit traffic prioritized above receive. Why? 
- The ability to send packets may stem the tide of incoming packets

Receive livelock

Reference:
1. cse506 course slide by Don Porter.

Non-transparent Bridging (NTB)

What is NTB (Non-Transparent Bridging)?
Introduction:
傳統的PCI架構是用在PC上面來連接數個IO devices, 當初的protocol架構並沒有考慮到多個CPU的情況, 而是由一個processor來emulate entire memory space, 控制所有的devices. 在此設計下如果有另一個CPU加入, PCI的operation就會產生問題.

NTB是一個解決方案讓多個host可以share多個PCI devices. 一個NTB其實是模擬出兩面的PCI endpoint, 並屆由address translation來達到兩邊資料的傳遞. NTB會expose CSR type 0 header來代表endpoint, 所以此方的host 不會繼續做forwarding or discovery.

downstream/upstream side:
Downstream代表的是NTB面向intelligent system (controller)的那面
Upstream則是面向host/root complex那面, 所以upstream side host會看到一個NTB, 以為他是一個pci device endpoint, 然後就結束.
"Device on the downstream side is not visible from the upstream side, this allows an intelligent controller on the downstream side to manage devices here" -> 所以其他所有的devices是由downstream side controller來控制.

from ref2:
"A system host will enumerate through Bridges A and B (both transparent) on the left branch of the figure 2 until it reaches the endpoint X.  On the right side branch, the host will stop enumeration at Bridge D (NTBD).  Similarly, the local CPU will enumerate through Bridge E and F (both virtual bridges within the switch) and discover the endpoint Y, but will not attempt to discover elements beyond Bridge D. "

Device Identification:
一般正常transparent PCI-PCI bridge的class code是0604h, 而NTB的class code是050000h. 代表RAM controller. 因為NTB就是用來做memory map, from one address domain to another.
每個device藉由<device, port, type>來定義,
- TB (Transparent Bridge) switch port又可分upstream跟downstream.  對NTB來說就是endpoint.
這些upstream/downstream property是由CSR BAR 來控制.

CSR header:
Transparent bridge用的是type 1 CSR header, 而NTB用的是type 0 CSR header.

Scratchpad and doorbell registers:
- scratchpad是用來做兩邊溝通的橋梁, 共有八個registers.
- can be accessed in either memory or I/O space from both the primary and secondary interface.
- 可用來傳遞control and status information between primary and secondary bus devices
- Doorbell是用來產生interrupt從一邊到另一邊
- 每個interface (primary and secondary)都有一組registers 可以藉由memory or I/O space來access.
- Standard PCIe allows interrupt to be configured as INTx or MSI (Message Signal Interrupt)

Address Translation
NTB利用BAR來對兩邊的adress做轉換
- BAR定義address translation window into memory space on another side of the NTB
- BAR0 and BAR1保留用來做memory and I/O mapped到CSR
- 每個BAR有個setup register用來指定window size跟type.
- Transparent bridge forwards all CSRs based on bus number, NTB only accept CSR transactions addressed to the device itself.

兩種address translation的方式如下
1. Direct Address Translation (利用 BAR Setup register和BAR Address Translation register)

How to setup the address translation of NT BARs:
A.Setup up the “BAR setup register” to reserved a range of memory address
              (Normally do this in EEPROM)
B.Setup up the “BAR Address Translation Register “ to map the range of
          memory address to the memory address on the other NT side
              (Normally do this in runtime)

From PLX:



2. Lookup Table Based Address Translation
另一種方式是用lookup table 來作address translation.

Request ID Translation
(這部份還沒完全弄董)
當有一個request從NTB的某一面進來, 要轉換到另一面時, Request ID (bus no, device, fun)在兩面是不一樣的, 所以需要管理一個translation table, 做一對一的轉換.

From ref1:


Complete ID Translation


Reference:
1. Using Non-transparent Bridging in PCI Express Systems
2. Multi-Host System and Intelligent I/O Design with PCI Express
3. Enabling Multi-Host System Designs with PCI Express Technology

2011年10月19日 星期三

Intel 8255X (draft)

Chapter 6: Host Software Interface
1. establish a shared memory communication system with the host CPU.
2. controls the device by writing and reading data to and from this shared memory space.
Where is this shared memory? BAR?

The shared memory architecture:
Divided into three parts:
1. CSR: Control/Status Register, reside on LAN controller
2. CBL: Command Block List, reside on host memory
3. RFA: Receive Frame Area, reside on host memory

CSR:
1. The first 8 bytes of the CSR is called the System Control Block (SCB). The SCB serves as a central communication point for exchanging control and status information between the host CPU and the 8255x.
2. The host software controls the state of the Command Unit (CU) and Receive Unit (RU) (for example, active, suspended or idle) by writing commands to the SCB.
3. The device posts the status of the CU and RU in the SCB Status word and indicates status changes with an interrupt.
4. The SCB also holds pointers to a linked list of action commands called the CBL and a linked list of receive resources called the RFA.



Glossory:
1. CSR
2. SCB (System Control Block) 
The CUC and RUC fields of the Command byte specify the actions to be performed by the 8255x
3. CU (Command Unit)
4. RU (Receive Unit)


Reference:
1. Intel 8255x 10/100 Mbps Ethernet Controller Family Open Source Software Developer Manua
2.

2011年10月17日 星期一

脂漏性皮膚炎

1. Oct 17, 2011
- Clindamycin Phosphate Lotion (1% clindamycin)
- OXY face wash, acne treatment (10% Benzoyl Peroxide)

2. Nov 4, 2011
- Eryth/Benz Perox Top GEL (ERYTHROMYCIN)
http://www.healthno1.com/component/content/article/134-antibiotics/1634-hna1670.html

- DOXYCYCLINE HYC
http://tw.myblog.yahoo.com/skindr-wang/article?mid=2311

2011年10月11日 星期二

Chapter 17: Page Frame Reclaiming

Page Frame Reclaiming Algorithm (PFRA )
From Understanding the Linux Kernel, chapter 17.
討論給定一個physical page, 如何得到誰再使用此page. 可能使用的有1. file cache, 2. process address space, 3. device cache.  然後根據某些heuristic來決定哪些page要先被釋放. 

"The page frame reclaiming algorithm of the Linux kernel refills the lists of free blocks of the buddy system by "stealing" page frames from both User Mode processes and kernel caches."
- 所以可以從user mode process 或是kernel cache來選擇要回收的papge.

Page type:
Anonymous page:
A page is said to be anonymous if it belongs to an anonymous memory region of a process (for instance, all pages in the User Mode heap or stack of a process are anonymous)
Mapped page: a page is mapped to a file.
For instance, all pages in the User Mode address spaces belonging to file memory mappings are mapped, as well as any other page included in the page cache.

Anonymous Page Reclaim
當kernel要做page reclaim時, 給定一個physical page,要找到使用此page的process, 如果只有單一process, 那此page就是non-shared, 如果有多個, 就是shared anonymous page.


判斷anonymous page的方法
struct page {
  _mapcount: if >=0, 代表多個process share此page
  struct address_space *mapping: 如果= NULL, -> anonymous page. Else -> mapped to a file (inode)
}
mapping: Pointer to the owning object

Shared and non-shared
如果此page是anonymous page: kernel keeps a list called anon_vma.
anon_vma是一個list, 儲存所有來自不同process的vm_area_struct但指到同一個page, 所以這是shared的anonymous page. 當kernel需要free anonymous pages時, 可以經由此list找到所有的相關的process, 拿到vm_area_struct, 然後invalidate他的page table.

"We have seen in the previous section that descriptors for anonymous memory regions are collected in doubly linked circular lists; retrieving all page table entries referencing a given page frame involves a linear scanning of the elements in the list. The number of shared anonymous page frames is never very large, hence this approach works well." from ref2:

From ref2, fig17-1: 兩個process (mm_struct) share 同一個page, anon_vma用來連結所有shared此page的vm_area_struct.
當要釋放page時, 必須要iterate all vm_area_struct, 找到他對應的page table, do linear scanning, 然後swap out or write tot disk.


Mapped Page Reclaim
Problem1: 給定一個要reclaim的page, 此page是被一個file mapped. 此時必須找出在file的哪個offset對應到此page.

如果是mapped page: 利用struct page裡面的struct address_space mapping可以拿到inode. address_space有radix_tree_root, radix tree上面對應file offset到page cache的位置. 當所對應的page被釋放後, address_space裡面的radix tree需要被更新 (變成NULL).

Problem2: 找到此file offset之後, 可能很多個process都透過mmap來對應到此file offset, ex:libc
Idea: 假設跟anonymous page一樣, 利用一個list來存所有的vma
Complication1:
Not all file mappings map the entire file
假設要page 4 of a file, 必須要filter all vma that don't include page 4
Complication2: anonymous mappings won’t be shared much
anonymous page很少被shared, 除非利用fork, mapped file常常被shared, ex: libc
Problem: Lots of entries on the list + many that might not overlap

Solution: Priority Search Tree
Idea: binary search tree that uses overlapping ranges as node keys
Bigger, enclosing ranges are the parents, smaller ranges are children
Use case: Search for all ranges that include page N

三種Tree的不同用法
1. Radix tree:  (inode -> physical memory)
file(inode)的page cache. 處理file(inode)到address space的maping. inode有i_mapping (struct address_space),  address_space有radix_tree_root
struct address_space {
 635        struct inode            *host;          /* owner: inode, block_device */
 636        struct radix_tree_root  page_tree;      /* radix tree of all pages */
...
從此page_tree得到file offset對physical memory的mapping.

2.  Red-black tree: (process address space -> page)
處理process address space的mapping. 每個process有mm_struct, mm_struct用來管理所有的vm_area_struct, 把所有的vm_area_struct存在一個list. (note: red-black tree is balanced.)
from ref1:

3. Priority tree (process address space -> file)
"There is a priority search tree for every file; its root is stored in the i_mmap field of the address_space object embedded in the file's inode object. It is always possible to quickly retrieve the root of the search tree, because the mapping field in the descriptor of a mapped page points to the address_space object. " from ref2
case: mapped anonymous page
很多program利用mmap將file直接map到process的address space. Example: libc就是很常見的例子. (所以這不是file page cache, 也不是process address space page mapping) 然而每個program可能只有map libc file的某個部份, 並不會把整個file全部map到自己program的address space. 因此Linux利用 PST(Priority Search Tree) 來管理這個mapping.

from ref2 fig17-2:
Goal: 給定一個file offset, 找到哪些vm_area_struct有包含此file offset (透過mmap).
Each node in the PST contains a list of vmas mapping that interval. 每個node代表一個interval, 而每個node裡面的list of vm_area_struct都map到此interval.
下圖為file page 0 ~ 5.
Radix: interval的起始位置, heap: interval的終止位置
條件: 1. parent heap number > child heap number. 2. 左邊的radix number > 右邊radix number
Traversal for page n:
if n > heap number, stop
else
  if n is in the range of radix and heap, it's the answer
  continue checking ALL the children.

example: n = 4
4 < 5 and 4 is in (0,5) -> (0,5,5), 4<=4-> (0,4,4), 4 > 0, 4 >2, 4 < 5 and 4 is in (2,5) -> (2,3,5), 4>3, 4>2

下圖右上角為inode, inode指向struct prio_tree i_mmap, i_mmap指向一個priority tree.  Priority tree紀錄了這個file裡面每個file offset page對應到process的mm_struct. 而每個mm_struct包含多個vm_area_struct. 由此可知如何從一個file inode找到所有對此file使用mmap的process.

From ref1, fig 4-7:

struct address_space {
 635        struct inode            *host;          /* owner: inode, block_device */
 636        struct radix_tree_root  page_tree;      /* radix tree of all pages */
 637        spinlock_t              tree_lock;      /* and lock protecting it */
 638        unsigned int            i_mmap_writable;/* count VM_SHARED mappings */
 639        struct prio_tree_root   i_mmap;         /* tree of private and shared mappings */
 640        struct list_head        i_mmap_nonlinear;/*list VM_NONLINEAR mappings */

LRU (Least Recent Use)
所有的page都分別屬於兩個LRU list: active or inactive, 當此page有備access到就移到active list.  如果很久沒被用就移到inactive.

- Anonymous page是user space在使用的,  kernel必須要知道此page是否經常被使用, 如果很少被使用就可以被reclaim.
- Page table entry裡面有個access bit, 當此page table entry被access時, x86 hardware會set this bit, 用來通知linux kernel. Kernel periodically reset and check this bit to know whether this page is frequently used.

Page Reclaim Heuristics: 
cases: low memory, hibernate, free memory below a threshold.
要先選哪個page來釋放?
goal: minimize the performance disruption.
1. Free harmless pages first
2. Steal pages from user programs, especially those that haven’t been used recently
    - page reclaim之後要remove all reference to the page
3. Laziness:Favor pages that are “cheaper” to free
    - Waiting on write back of dirty data takes time


Reference:
1. Professional Linux Kernel Architecture, chapter 4.
2. Understanding the Linux Kernel, chapter 17.
3. Segment tree: http://en.wikipedia.org/wiki/Segment_tree (or Priority tree)

2011年10月10日 星期一

C pointer 練習

From JOS Lab1:
http://www.cs.sunysb.edu/~porter/courses/cse506/f11/lab1.html

1. Read 5.1 (Pointers and Addresses) through 5.5 (Character Pointers and Functions) in K&R. Then download the code for pointers.c, run it, and make sure you understand where all of the printed values come from. In particular, make sure you understand where the pointer addresses in lines 1 and 6 come from, how all the values in lines 2 through 4 get there, and why the values printed in line 5 are seemingly corrupted.


2. There are other references on pointers in C, though not as strongly recommended. A tutorial by Ted Jensen that cites K&R heavily is available in the course readings.

3. Here are a few specific points you read about in K&R Chapter 5 that are worth remembering for the following exercise and for future labs.
  • If int *p = (int*)100, then (int)p + 1 and (int)(p + 1) are different numbers: the first is 101 but the second is 104. When adding an integer to a pointer, as in the second case, the integer is implicitly multiplied by the size of the object the pointer points to.
  • p[i] is defined to be the same as *(p+i), referring to the i'th object in the memory pointed to by p. The above rule for addition helps this definition work when the objects are larger than one byte.
  • &p[i] is the same as (p+i), yielding the address of the i'th object in the memory pointed to by p.
pointers.c
-----------
#include <stdio.h>
#include <stdlib.h>

void
f(void)
{
    int a[4];
    int *b = malloc(16);
    int *c;
    int i;

    printf("1: a = %p, b = %p, c = %p\n", a, b, c);

    c = a;
    for (i = 0; i < 4; i++)
 a[i] = 100 + i;
    c[0] = 200;
    printf("2: a[0] = %d, a[1] = %d, a[2] = %d, a[3] = %d\n",
    a[0], a[1], a[2], a[3]);

    c[1] = 300;
    *(c + 2) = 301;
    3[c] = 302;
    printf("3: a[0] = %d, a[1] = %d, a[2] = %d, a[3] = %d\n",
    a[0], a[1], a[2], a[3]);

    c = c + 1;
    *c = 400;
    printf("4: a[0] = %d, a[1] = %d, a[2] = %d, a[3] = %d\n",
    a[0], a[1], a[2], a[3]);

    c = (int *) ((char *) c + 1);
    *c = 500;
    printf("5: a[0] = %d, a[1] = %d, a[2] = %d, a[3] = %d\n",
    a[0], a[1], a[2], a[3]);

    b = (int *) a + 1;
    c = (int *) ((char *) a + 1);
    printf("6: a = %p, b = %p, c = %p\n", a, b, c);
}

int
main(int ac, char **av)
{
    f();
    return 0;
}

2011年10月6日 星期四

CSE506: Block Device Scheduling (DRAFT)

The big picture:
From ref1: "Below the generic block layer, the "I/O scheduler " sorts the pending I/O data transfer requests according to predefined kernel policies. The purpose of the scheduler is to group requests of data that lie near each other on the physical medium. "
Scheduling的目的:
1. Throughput, Latency,
2. Safety: file system can be recovered after a crash
3. Fairness – surprisingly, very little attention is given to storage access fairness

Problems:
- Between page cache and disk, you have a queue of pending requests
- Requests are a tuple of (block #, read/write, buffer addr)
- You can reorder these as you like to improve throughput

Below is from ref2.
Elevator Algorithm
- 像搭電梯一樣, 要上樓就把所有人一起往上載, 要下樓就一起往下載
idea: Require disk arm to move in continuous “sweeps” in and out
- Reorder requests within a sweep
- This approach prevents starvation: Sectors at “inside” or “outside”get service after a bounded time

Complete Fairness Queue
- 利用round robin 保證每個request公平的被分到
idea: add a second layer of queue, one per process. Then do the round-robin.
(x) Ping-pong disk head around

Deadline Scheduler
- Deadline快到的先被執行
idea: associate expiration time with requests. As requests get close to expiration, make sure they are deployed. Good for real time applications.

Anticipatory Scheduler
- 當某個Process有burstness特性時, 先等一下 (batching), 然後在一起執行
idea: Try to anticipate locality of requests
If process P tends to issue bursts of requests for close disk blocks, When you see are quest from P, wait a bit and see if more come in before scheduling them

Reference:
1. Understanding the Linux Kernel, Chapter 14.3
2. cse506 slide by Don Porter

CSE506: Linux Page Cache

Page Cache
Page cache 的主要功能是當作files on disk的快取.  每個on disk file都有一個相對映的page cache (a radix tree), 存取file的最小單位是4K page size, 每當從disk存取後就會被存在page cache裡面.
當要讀取檔案時, find_get_page()會要求輸入inode跟index, index為此file address space裡面的offset. Linux kernel會根據此index去搜尋radix tree來獲取相對映的phyiscal page. 若此page不存在於page cache, 則必須要從disk 裡面重新讀出來.

Recap: Memory用途
1. Process virtual memory, 2. File, 3. I/O mapping, and 4. device buffer.


Note:
- Address space有很多種
1. Virtual address space: 4G,  anonymous, (VM area struct, vma)
2. physical address space: depends on the size of your physical memory, and
3. file address space: offset from the head of the file
- 每個inode有自己的page cache
- 當file size超過4G時, file address space就無法用32 bits address space來表示, 此時就需要64bits or concatenation of registers.

Radix tree
對Process來說, linux使用page table來管理virtual to physical mapping. 對file來說, linux使用radix tree.  (特性: virtual address一定是0-4G, 但file size可大可小).
"In order to perform page cache lookup efficiently, Linux 2.6 makes use of a large set of search trees, one for each address_space object. "
Radix tree利用file address的prefix來找到對應的page.  Page cache是由一個radix tree來儲存, 每個node含有64個leaves, 最後指向physical page.
- Branching factor: 64
- TRIE: prefix tree. Rather than storing the entire key in the node, traversal of parent builds a prefix, node just stores suffix.
- height:1 , size = 64 * 4K  = 256K, height:2, 64*64*4K = 16M

Positioning:
一個file address的前12 bit (4K)會先被拿掉, 之後每六個bit代表一個radix tree level的index.

from ref1:
Buffer heads (BH) 15.2
(Mapping pages to disk blocks)
需要buffer head的原因是因為disk每個單位是512byte (sector), 而OS的單位是4K byte. 因此每個page中都含有8個buffer cache. Linux kernel利用buffer_head來做housekeeping. 當寫入範圍小於一個sector時,此page cache會被mark dirty, 但update only the BH in the disk, not the whole page containing 8 buffer.
from ref2:
Example: write() system call for first 5 bytes
1. Look up first page in radix tree
2. Modify page, mark dirty
3. Only mark first buffer head dirty

Raw device caching:
Disk 也可以有自己radix tree + address space!

Page cache v.s Page table
相對於Process address space (4K), page cache 是用來cache on disk file. 而4K的virtual address 是利用page table (10bits for page directory, 10bits for page table, and 12 bits for offset.)

如下圖, page cache處理的是從address space region到backing store 的cache (左半部), 而page table處理的是virtual address space mapping到physical page frame.

review page table from ref3:

Linux represents portions of a process with a vm_area_struct, or vma. 利用red-black tree來管理mapping.  from ref4


Reference:
1. Understanding the Linux Kernel, Chapter 15
2. http://www.ibm.com/developerworks/cn/linux/l-cache/index.html
3. Professional Linux Kernel architecture
4. CSE506 slide, Don Porter

2011年10月5日 星期三

PCI useful links

Focus: linking TCP/IP with PCIe HW.

Links
1. LDD example code
http://www.cs.fsu.edu/~baker/devices/lxr/http/source/ldd-examples
2.LDD, network driver
http://www.xml.com/ldd/chapter/book/ch14.html
3. Kernel documents
http://www.mjmwired.net/kernel/Documentation/PCI/pci.txt
4. Introduction to PCI
http://teknirvana.com/documents/Pci.pdf


Papers:
Unread:
1. SR-IOV Networking in Xen: Architecture, Design and Implementation
2. Standardized but Flexible I/O for Self-Virtualizing Devices
3. Towards an Integrated IO and Clustering Solution using PCI Express

Read:

VIM筆記

1. 建立tag
> ctags -R .

2. 設定.vimrc
set tag=tags

3. cscope

4.常用指令
ctrl + n: show出所有提示字
shift + d: 刪除目前游標位置到行尾
:ta: 可用來搜尋
:!<command>: 可用來執行外部程式

Device Driver CSE506

I/O ports:

1. I/O 所使用的address space 跟CPU不同
2. ports代表的只是一個address
3. x86利用不同的instruction來區分, inb, inw,

對port的讀寫有Side effect:
- memory 的讀寫可能被cached
- operation可能被re-order
- Idiosyncrasy: composition doesn’t necessarily work
outw0x1010<port> != outb0x10<port>  + outb 0x10 <port+1>

Port 的permission
1. 透過IOPL flag in EFLAGS
2. "the processor first checks whether CPL <= IOPL. If this condition is true, the I/O operation may proceed. If not true, the processor checks the I/O permission map." ref1
3. 之後會檢查TSS裡面的I/O permission bitmap, 每個IO port代表一個byte address, bitmap裡面的每一個bip代表存取這個byte address (port)的權限

Memory Mapped IO
1. Map devices onto regions of physical memory
2. Win: Cast interface regions to a structure
3. Hardware basically redirects these accesses away from RAM at same location (if any), to devices

Side Effect
缺點是還是會有 side effects: (因此需要使用barrier)
常用的最佳化:
1. out-of-order execution
2. reorder write
3. cache values in register (cache越久越好)
但當我們想讀寫devices時, 我們希望馬上寫進去

__volatile__: 
- 直接讀寫記憶體, 不經過cache or optimization
- write must go directly to memory
- read must come from memory
- 防止記憶體存取最佳化, 防止刪除記憶體動作

Compiler and CPU barrier:
1. Clobber list in the inline assembly:
2. Basic idea: In some cases, CPU can issue loads and stores out of program order (optimize perf)
利用fense instruction去確保執行順序


Reference:
1. http://pdos.csail.mit.edu/6.828/2011/readings/i386/s08_03.htm#fig8-2
2. CSE506 slide by Professor Don Porter

2011年10月2日 星期日

JOS lab3

env_run:1. 進入user mode
先call env_pop_tf(&e->env_tf)
把關於此environment的一些register直從trapframe讀出來放到register裡面, 然後呼叫iret

__asm __volatile("movl %0,%%esp\n"
        "\tpopal\n"
        "\tpopl %%es\n"
        "\tpopl %%ds\n"
        "\taddl $0x8,%%esp\n" /* skip tf_trapno and tf_errcode */
        "\tiret" /* goto eip, set eip to elf->entry */
        : : "g" (tf) : "memory");
 
iret之後就進入user mode, 此時已經將eip指到elf->entry, 也就是此程式的開頭位置.

2. Load_icode
在load_icode中必須要根據現在的binary (ELF) 格式來load program
一個elf裡面會有多個program header, 每個代表一個segment.
Elf裡面有e_entry的virtual address, 用來代表process的起始位置.
"This member gives the virtual address to which the system first transfers control, thus starting the process."

每個environment有個trapframe, 用來儲存register當此environment沒有在執行的時候. (所以當要執行此env時要去load這些存在trapframe的registers)
"holds the saved register values for the environment while that environment is not running: i.e., when the kernel or a different environment is running. The kernel saves these when switching from user to kernel mode, so that the environment can later be resumed where it left off."


Trapframe的第一個member是tf_regs, 儲存trapframe的register
load_icode的最後必須要將env_tf.tf_eip 設定成 elf-> e_entry

3. IDT的設定: idt_init
- 先將kernel的TSS設定到GDT裡面, 之後當int發生時必須使用
- load the TSS: ltr(GD_TSS)
- 設定idt
SETGATE(idt[0x0], 1, GD_KT, do_divide_error, 0);
SETGATE(idt[0xd], 1, GD_KT, do_general_protection, 0);
SETGATE(idt[0xe], 1, GD_KT, do_page_fault, 0);
SETGATE(idt[0x28], 1, GD_KT, do_badsegment, 0);

裡面的interrupt handler是用aseembly定義的, 在trapentry.S
TRAPHANDLER_NOEC(do_divide_error, 0x0)

_alltraps:
    pushl %ds;
    pushl %es;
    pushal;
    movl $GD_KD, %eax;
    movl %eax, %ds;
    movl %eax, %es;
    movl %esp, %eax;
    pushl %eax;
    call trap;


Reference:
1. ELF header: http://www.sco.com/developers/gabi/1998-04-29/ch4.eheader.html
2. Program header: http://www.sco.com/developers/gabi/latest/ch5.pheader.html

2011年9月26日 星期一

Linux Memory Allocator

NUMA, Node, and Zone 
NUMA and Node
Linux 2.6之後支援NUMA (Non Uniform Memory Access), 記憶體的分配是有區域性的, 實體記憶體被切割成多個node. 每個CPU access屬於自己的memory node的時間相同, 若是不同的node則需要較多的時間.
每個node的相關資料存在struct pg_data_t

Zone and Region
每個node裡面在區分成多個zone, pg_data_t裡面含有node_zonelists and node_zones

因為硬體限制 P.299
1. DMA必須要access前16MB的address space
2. 32-bit系統最多只可以定址到4G
3. Kernel address space只有1G的空間 (在扣掉保留的128MB), 0xc0000000 - 0xffffffff

所以將Zone區分成三種
1. ZONE_DMA: below 16MB
2. ZONE_NORMAL: 16MB ~ 896MB
3. ZONE_HIGHMEM: > 896MB
"the ZONE_HIGHMEM zone includes page frames that cannot be directly accessed by the kernel through the linear mapping in the fourth gigabyte of linear address space" from ref1

Kernel address space 介於3G ~ 4G, 大於896MB的記憶體不值接linear map到kernel. 小於896MB可值接將physical memory address對應到linear address.

* 896MB = 1024 - 128MB, 比較高的128MB用來做Fixed Mapped Linear Address for HIGHMEM.

from ref1, fig 8-2: 每區的zone有自己的buddy system, buddy system底下有slab. 當使用ZONE_HIGHMEM時, 會始用highmem的buddy system.


解決highmem mapping的方法
1. permanent kernel mapping, 2. temporary kernel mapping, 3. noncontiguous memory allocation. (p. 306)

Buddy Allocator
Buddy algorithm:
- no external fragmentation
- worse case: waste less than half of the requested size
- kernel need to keep track of the existing blocks of free, contiguous page frames.
- 所有的free page frame被分配到11個list, 根據大小1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page frames.
- Zone allocator is the frontend of buddy system
- 在buddy allocator之上還有slab allocator

Buddy system每次以page為單位太大了, 所以有slab allocator來管理較小的memory request
- buddy allocator是以page為管理單位, 而slab是以cache為管理單位

Slab allocator
- the default cache allocator
- "kernel creates 13 geometrically distributed lists of free memory areas whose sizes range from 32 to 131, 072 bytes" from ref1
- The slab allocator groups objects into caches . Each cache is a "store" of objects of the same type
- 缺點: Similar basic ideas, but substantially more complex bookkeeping.
- 缺點: 2 groups upset 1. Users of very small system (ex: watch) 2. Users of large multi-processor systems
- example: Slab再不同CPU之間溝通需要較高的overhead.
For very large (thousands of CPU) systems, complex allocator bookkeeping gets out of hand. slabs try to migrate objects from one CPU to another to avoid synchronization. (ref2)

SLOB allocator
- 給小的系統使用,例如一些嵌入式系統, overhead較小
- Simple List of Blocks
- Keep a free list of available chunk and its size
- Grab the first one big enough chunk to work
- No internal fragmentation
- Has external fragmentation but trade with low overhead

SLUB allocator
- SLAB有bookkeeping overhead, 因為不同CPU之間需要協調
- The Unqueue Slab Allocator
- All objects of same size from same slab
- simple free list of per slab
- no cross-CPU nonsense
- Does better than SLAB in many cases


from ref4, fig 8-1


Cache line alignment:
1. cache line: 64 ~ 128 bytes on most CPUs
2. word: 32 bits or 64 bits
3. cache line是memory cache的基本單位


Reference:
1. Understanding the Linux Kernel, chapter 8
2. Slides from cse506, Don Porter
3. False Sharing:
Performance overhead under multiprocessor cache system, see http://en.wikipedia.org/wiki/False_sharing
4. Linux device driver programming

I/O Addressing and Memory Mapped I/O

Driver的功能是幫助kernel控制硬體, Kernel不了解的裝置就交給driver, 透過drive, user program可以藉由system call來控制硬體

80386提供兩種input/output的控制方式去讀寫裝置上的register
1. 獨立的I/O address space (I/O mapped I/O)
2. Memory mapped I/O (和main memory共同使用memory address space)

I/O address space
- cat /proc/ioports
- 提供2^16 (64K) address space
- 每次可讀寫1, 2, 4btyes
- inb(), outb(), inw(), outw()...
- 每個硬體裝置都有自己的I/O port address space, 必須保證其他硬體不會重複使用
- instructions in and out move data between a register and a port I/O address space

Memory-mapped I/O
- cat /proc/iomem
- 以讀寫記憶體代替I/O port
- Place the I/O devices in the 80386 memory address space.
- Driver只要向預先定義好的位址範圍做讀寫 (4G - 3G 空間可當作address mapping)
- ioremap(), ioremap_nocache(), iounmap()
- 此方法提供較大的彈性, 使用mov, and, or就可值接對裝置的device register做運算
- Memory map定址到3G-4G的physical address space, 但實體記憶體可能小於4G, 所以當軟體讀寫這個區段的時候, 會被硬體當作memory map的讀寫動作, 然後轉換成對裝置的讀寫.
(see ref2, chapter 12 PCI).
(如果所安裝的實體記憶體超過3G, 就需要PAE: physical address extension support)

PCI
- 讀寫PCI裝置時可用I/O address space or Memory mapped I/O
- Linux提供interface來統一處理兩種不同的讀寫方式
- pci_iomap(), ioread8(), iowrite8()
- IA-32的平台式大部份透過I/O mapping來讀寫配置空間 (configuration space).

IOMMU 
- MMIO必須要占掉一塊連續的physical memory讓device做mapping, IOMMU maintain一個table用來做translation, 因此不需要連續的physical memory. Vector I/O (scatter-gather list) can be avoided.
-  In virtual environment, VM會設定DMA map到guest physical, 但guest physical無法直接存取host physical memory. IOMMU 會做remap, 把guest physical address map到host physical上來達成DMA. 所以native device driver就可以直接讓gust VM使用.
- IOMMU概念類似Extended Page Table, 負責做guest virtual -> guest physical -> host physical的後半段translation. (其實EPT就是MMU的延伸版, 多做一次轉換從guest physical to host physical)
- MMU是用來做CPU virtual address space轉換到Main memory, IOMMU則是用來轉換device address space 到 Main memory
- IOMMU is not used when the CPU communicates with devices via I/O ports.
- Connect a DMA-capable I/O bus to the main memory.
- Intel VT-d (Virtualization Technology for Directed I/O)

IOMMU in relation to virtualization:
Guest OS would like to directly access the I/O to improve performance using DMA. If the guest OS can directly instruct the hardware from guest physical memory, it would likely to corrupt the memory. The corruption is avoided because of hypervior's intervention to apply translation. Unfortunately this delays the I/O operation.
An IOMMU can solve this problem by re-mapping the addresses accessed by the hardware according to the same (or a compatible) translation table [from wiki IOMMU]

From wiki: IOMMU

Reference:
1. http://www.cs.sunysb.edu/~porter/courses/cse506/f11/ref/i386/toc.htm
2. Linux Device Driver Programming 驅動程式設計
3. http://en.wikipedia.org/wiki/IOMMU

2011年9月24日 星期六

藥品

VapoRub
蠻神奇的藥膏, 功能是
1. cough suppresant
2. topical analgesic
可以治咳嗽和肌肉酸痛

2011年9月23日 星期五

JOS lab2

boot_alloc: 
simple memory allocator
只有在boot的時候使用, allocate physical memory, 之後就使用page_alloc

i386_vm_init 
( called by i386_init(void) )
1. 分配一塊空間給page directory, cr3指向此地址
2. 分配一塊空間給struct Page, 用來儲存physical page entry
3. page_init
4. map pages toUPAGES
5. map physical bootstack 到 virtual address上
6. 把從KERNBASE到4G map到physical address 0x00000000的地方
7. 因為一開始segmentation會把va = va - KERNBASE, 所以在開paging之前要設定pgdir[0]
access to address higher than KERNBASE will subtract KERNBASE by segmentation and map to pgdir[0];
- pgdir[0] = pgdir[PDX(KERNBASE)]
然後turn on paging, 把新的segment mapping(不做轉換)寫入, 然後才清空pgdir[0].

page_init
用一個linked list把所有的free physical page都link起來, 0-640K是可以用, 大於640K到1M, 1M到kernel end是不可以用.

page_alloc:
allocate a physical page, remove it from the page_free_list

page_free: 
return the page to the free list

page_insert: 
insert a mapping (virtual to physical) at page table. Need to call pgdir_walk.

page_lookup:
給一個virtual address, 回傳mapped到的physical page和pte

pgdir_walk:
給一個virtual address, 到page directory裡面去找出paget table entry.
如果已經存在pdt, 回傳現有的 
如果沒有pdt, allocate一個新的page table, 最後回傳page table entry.


boot_map_segment:
把virtual address map到physical address

2011年9月21日 星期三

Chapter 7: Process Scheduling

7.1 Scheduling Policy
- Based on time-sharing technique. CPU time is divided into slices, one for each runnable process.
- The scheduling policy is based on ranking process according to priority.
- Linux applies dynamic process priority assignment. If a process is denied for using CPU for a long time, its priority is increased dynamically. 

Linux 把process 區分為三種
- Interactive process: spend lots of time waiting for events, example: shell, text editor
- Batch process: no need to be responsive, no user interaction, run in background
- Real-time process: have very stringent scheduling requirements. ex: video and sound application

"Linux 2.6 scheduler implements a sophisticated heuristic algorithm based on the past behavior of the processes to decide whether a given process should be considered as interactive or batch. " [ref1]

7.1.1 Preemption
- 當Process在執行時(TASK_RUNNING), 如果priority比現在current的process較高, current process 會被preempt然後context switch.
- 第二種情形是quantum expires

- Determine the quantum duration is critical for system performance. If the average quantum is too short, the system overhead caused by context switching is high.
- Quantum太長不會影響interactive program的performance, 因為interactive program有比較高的priority, 會preempt current running process.
- The rule of thumb adopted by Linux is choose a duration as long as possible, while keeping good system response time.

7.2 The scheduling algorithm
- 每個processor都有個swapper (PID=0) , 當CPU沒有process可執行時, 執行swapper
- 每個linux process is always scheduled according to one of the scheduling classes:
1. SCHED_FIFO: FIFO real-time process. 如果沒有更高優先權的real-time process, 此process會持續使用CPU. (Leave the process descr. in its current position in the runqueue)
2. SCHED_RR: put the process descr. at the end of the runqueue. 當被higher priority process interrupt, 就會從後面開始
3. SCHED_NORMAL: conventional, time-shared process.

7.2.1 Scheduling of Conventional Process
一個process會有static priority跟dynamic priority
Static
- static priority: 100(high) - 139(low)
- use nice() or setpriority() system calls
- base quantum time會根據不同的static priority來作調整
- 較高的priority有較大的quantum time
- ex: priority: 100, nice value: -20, base time quantum: 800ms
Dynamic
- 100 - 139
- dynamic priority = max( 100, min (static priority - bonus + 5, 139))
- bonus會根據過去的紀錄來決定, related to the average sleep time of the process.
- bonus range between 0 -10, greater than 5 is a premium that raises the dynamic priority.
- ex: sleep less than 100 ms: bonus = 0, greater than 1 second: bonus = 10

Starvation:
較高priority的process可能造成較低的process無法使用CPU.
"when a process finishes its time quantum, it can be replaced by a lower priority process whose time quantum has not yet been exhausted"
The scheduler keeps two set of runnable processes:

Active process: process自已的quantum還沒用完.
Expired process: the runnable process have exhausted their time quantum and are thus forbidden to run until all active processes expire. quantum用完後就放到expire list
from ref1: figure 7-1

每一個CPU有一個runqueue, 裡面包含了所有的runnable processes.  而每一個process只能存在於一個runqueue. Struct runqueue有兩個pointer分別指到active processes跟expired processes. active 跟expire的pointer會週期性的互相調換.
schedule()的功能: 從runqueue之中選一個新的process來執行, 如果沒有就執行idle process (swapper). Kernel藉由觀察runqueue的大小來決決定一個CPU的覆載量. 如果不平均, 就需要load balance...

Runqueue Balancing
Multiprocessor machines有這三種:
1. SMP (Symmetric Multi Processing)指的是多顆CPU有相同的能力, Kernel在處理時不續要特別針對不同特性的CPU做特別的處理
2. Hyper-threading: CPU上可執行多個thread, CPU內有多組的registers互相交替使用並模擬出多個虛擬的CPU給kernel使用
3. NUMA (Non Uniform Memory Access): 所有CPU對RAM的存取並不是相同的cost, 而是某部份的RAM分給CPU當他的local memory, 因此不需要考慮其他CPU會始用到此區記憶體, 而省去了synchronization overhead.

Linux利用scheduling Domain的概念來keep the CPUs balanced by the kernel.

相關的System Calls
Scheduling:
1. nice(), getpriority(), setpriority(): 設定process的static base priority,  如果沒有root權限, 只能降低priority
2. sched_getaffinity(), sched_setaffinity():  設定此process要執行在那個CPU. cpu_allowed field in the process descriptor.

real-time process:
1. sched_getscheduler(), sched_setscheduler(): 回傳此process所使用的scheduling policy (FIFO, RR, or NORMAL)
2. sched_yield(): allow a process to relinquish the CPU voluntarily without being suspended.
3. sched_get_priority_min(), sched_get_priority_max():
4. sched_rr_get_interval():

Question:
1. 每個process有自己的scheduling policy?還是整個系統採用一種policy?

--- Appendix ---
from ref2
TASK_RUNNING:
means that a task is in a runnable state. It does not mean that a CPU is actually allocated. The task can wait until it is selected by the scheduler. This state guarantees that the process really is ready to run and is not waiting for an external event.
TASK_INTERRUPTIBLE:
is set for a sleeping process that is waiting for some event or other. When the kernel signals to the process that the event has occurred, it is placed in the TASK_RUNNING state and may resume execution as soon as it is selected by the scheduler.
TASK_UNINTERRUPTIBLE:
is used for sleeping processes disabled on the instructions of the kernel. They may not be woken by external signals, only by the kernel itself.
TASK_STOPPED:
indicates that the process was stopped on purpose — by a debugger, for example.

Reference:
1. Understanding the Linux Kernel, Chapter 7
2. Professional Linux Kernel Architecture

2011年9月19日 星期一

PCI-2

Questions:
1. How does PCI work at boot time?
2. How does a driver look for its device?
The driver uses (vendor id, device id, class, ...) to tell kernel the devices it supports. The struct pci_device_id is used to define a list of different types of PCI devices the driver supports.
3. How does a driver access the device's configuration space?

Register a PCI driver:
1. name: driver name, must be unique
2. id_table: all supported device, point to struct pci_device_id (一個driver可以support多個device, 列在此list之中)
3. int (*probe): "This function is called by the PCI core when it has a struct pci_dev that it thinks this driver wants to control" [ref1]
PCI core找到一個pci_dev, 呼叫driver, 檢查driver裡面的pci_device_id list. If the PCI driver claims the pci_dev that is passed to it, it should initialize the device properly and return 0.
4. void (*remove):
5. int (*suspend):
6. int (*resume):

Kernel PCI functions:
1. pci_get_device: scan the list of PCI devices in the system.
2. pci_get_subsys: allow subsystem vendor and subsystem device ID to be specified.
3. pci_get_slot: search the pci device on specific slot.
(all these functions can no be called from interrupt context)
4. pci_enable_device:Enabling the PCI device, setup interrupt line and I/O region

Accessing the configuration space:
當device driver已經偵測到device, driver就會去讀寫configuration space去找到這個device被map到那些memory or I/O space.
pci_read_config_byte
pci_read_config_word
pci_write_config_byte

Accessing the I/O and Memory space:
每個PCI device有六個I/O address regions.每個region可以是memory or I/O locations.
pci_resource_start
pci_resource_end
pci_resource_flags


>lspci | cut -d: -f1,2
[bus number, 8 bit]:[device number, 5 bit].[function, 3 bit]

00:00.0 Host bridge
00:01.0 PCI bridge
00:1c.0 PCI bridge
00:1c.4 PCI bridge
00:1c.5 PCI bridge
00:1d.0 USB Controller
00:1d.1 USB Controller
00:1d.2 USB Controller
00:1d.7 USB Controller
00:1e.0 PCI bridge
00:1f.0 ISA bridge
00:1f.1 IDE interface
00:1f.2 IDE interface
00:1f.3 SMBus
02:00.0 PCI bridge
03:02.0 Ethernet controller
04:00.0 Ethernet controller
06:05.0 VGA compatible controller


ecslgw@ecslgw-cewit:~$ cat /proc/bus/pci/devices  | cut -f1,2,3

Field 1 - Bus Dev Func
Field 2 - Vendor Id + Device Id 
Field 3 - Interrupt Line 
Field 4 - BAR 0 
and the rest of the BAR registers (0 - 5) after that.

0000 80862778 0               0
0008 80862779 df               0
00e0 808627d0 de               0
00e4 808627e0 dd               0
00e5 808627e2 dc               0
00e8 808627c8 14               0
00e9 808627c9 15               0
00ea 808627ca 16               0
0200 8086032c 0               0
0310 80861026 23        fe9e0004
0400 14e41659 10        fe6f0004
0628 18ca0020 0        fd000008


3. /lib/modules/2.6.39.4/modules.pcimap
The kernel search this file when hot-pluggable devices are found.

4. drivers/pci/search.c
/* For boot time work */
EXPORT_SYMBOL(pci_find_bus);
EXPORT_SYMBOL(pci_find_next_bus);
/* For everyone */
EXPORT_SYMBOL(pci_get_device);
EXPORT_SYMBOL(pci_get_subsys);
EXPORT_SYMBOL(pci_get_slot);
EXPORT_SYMBOL(pci_get_class);


Reference:
1. Linux Device Driver, Chapter 12

Writing and Presentation Guides

Writing and Presentation Guides

reference:  http://www.cs.sunysb.edu/~ezk/grad-res/index.html#writing

Strunk and White's Elements of Style
Encyclopedias, Dictionaries, and Thesauri
Checklist for Articles
How Not to Write a Paper
Mathematical Writing
Writing Up Your Final Project
Views and Feelings: A Word of Encouragement
A course on Writing and Oral Skills in Computer Science
Tips on Oral Presentations, an annual Usenet post by Henry Spencer.
The task of the referee, by Smith, A J., Computer 23(4): p. 65-71. Apr. 1990.
Collected Advice on Research and Writing
The Science of Scientfic Writing, an excellent article written by English professors for scientists (PDF).

http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mleone/web/how-to.html

2011年9月18日 星期日

PCI

PCI (Peripheral Component Interconnect)
Purpose: replace the ISA standard with three main goals:
1. better performance between peripherals, 2. platform independent, 3. simplify adding/removing devices.
PCI devices are jumperless (不懂?), and are automatically configured at boot time.

- Identification of a device
PCI定義三個參數來決定一個device:
1. bus number: device所屬的bus
2. device number: bus所指定的一個unique number, 每個bus最多可指定32個devices
3. function number: 多個device可以共用一個PCI slot, 利用function number來區分.
Much used in laptops are multifunction chipsets, which are attached to the PCI and integrate a whole range of expansions (IDE controller, USB controller, modem, network, etc.) in a minimum of space [ref2]
4. pci domain: Each PCI domain can host up to 256 buses.
每個bus可以有32 devices, 每個device可以有multifunction board, 最多八個function (identified by 16-bit address).

- Address space
PCI使用三個address space來跟device溝通
1. I/O space: described by 32 bits address
2. Data space (memory locations): 32 byte or 64 byte are available for data space, depending on the processor type
3. Configuration space: contains type and characteristics of the individual devices. 256 byte for each device function.

*每個pci bus使用同一個I/O space and data space, 每一個device有自己的configuration space.
*

from fig 6.1, ref1


PCI address space and configuration header
每一個PCI device都一個configuration header 存在他的address space, 系統根據PCI topology來決定此address space的位址 (tied to the physical PCI device location, which PCI slot it locates)
每一個PCI slot都有一個對應到的PCI configuration header 和offset. 所以系統可以從第一個offset去讀所有的offset來偵測PCI device, 並屆由vendor ID來判斷type.
From: ref2
Class code:  決定哪種device, ex: SCSI = 0x0100
BAR (Base Address Registers):  用來設定device可使用哪些PCI I/O and PCI memory
Interrupt Pin: 有四個pin可用來產生interrupt, ABCD.
Interrupt Line: 用來決定當interrupt發生時要執行那個interrupt handler.
六個Base address (32 bit each)用來定義PCI和其他系統溝通用的位址, 如果是64 bit, 那就要用兩個base addresses.

PCI I/O and PCI Memory address:
PCI device和driver溝通所使用的address space. 

PCI-ISA bridge:
用來轉換PCI I/O and memory address to ISA I/O and memory address for legacy ISA devices.

PCI-PCI bridge:
Devices that glue PCI buses of the system together. PCI bus 可以支援的device數量有限, 所以利用PCI-PCI bridge來連接多個devices. 利用兩種cycles: Type 0 and type 1來設定PCI-PCI bridges. 

"The PCI-PCI bridges must be programmed with a base and limit for PCI I/O and PCI Memory space access that they have to pass from their primary bus onto their secondary bus. Once the PCI-PCI Bridges in a system have been configured then so long as the Linux device drivers only access PCI I/O and PCI Memory space via these windows, the PCI-PCI Bridges are invisible." [ref1]
PCI-PCI bridge會被設定有一個memory map window, 讓device driver直接透過window去讀寫PCI I/O and memory space, 這樣讓系統感覺不到PCI-PCI bridge的存在.

PCI configuration cycle: A cycle is just an address as it appears on the PCI bus. 
Type 0: fig 6.3
沒有bus number, interpret as PCI configuration address. 

Type 1: fig 6.4
用來設定PCI-PCI bridge.
每個PCI-PCI bridge有三個值1) primary bus number: the upstream bus. 2) secondary bus number: the downstream bus. 3) subordinary bus number: the highest number of all buses that can be reached downstream of the bridge. 
如果PCI-PCI bridge收到type 1 frame. 
1. If bus number > subordinary number: ignore
2. If bus number == secondary number: 表示device在secondary bus上, convert to type 0
3. If subordinary > bus number > secondary: pass to the secondary interface.

Linux PCI Initialization:

Linux Kernel PCI data structure:
- each pci device is described by pci_dev struct
- the result is a tree topology from root to all leaf nodes
- pci_root points to pci_bus, pci_bus is PCI-PCI bridge as PCI bus can only be reached by PCI-PCI bridge.
Fig 6.5 from ref1:


Initialization可分成三部份: Device driver, BIOS, and Fixup


The PCI device driver:
- a function of the OS called at system initialization.
- the PCI init code scan all the PCI buses looking for PCI devices
- 利用PCI bios code去確認是否PCI slot上有裝置, if yes, setup the pci_dev.
- 每個pci bus存在struct pci_bus
PCI Device driver: pseudo-device driver searches the PCI system starting at BUS 0 and allocates all PCI devices and bridges. A PCI topology is built by a linked list of data structure. 接下來要做PCI-PCI bridge numbering, 用來偵測整個PCI topology和決定primary, secondary, and subordinary number for PCI-PCI bridge.

BIOS: 好像只是一些functions.
Fixup: system specific fixup code.

PCI Bar (From Wiki, PCI configuration space)



Reference:
ref1: http://tldp.org/LDP/tlk/dd/pci.html
ref2: The professional linux kernel
ref3: Linux Device Driver, chapter 12

2011年9月17日 星期六

食譜

1. 水餃
麵皮
中筋麵粉兩包,  加水後揉成麵粉團

內餡
絞肉 + 香油 + 薑 + 蔥 + 一點醬油後攪拌
最後跟韭菜拌在一起 (可以打一顆蛋)
也可以加一些鹽巴跟五香粉

2. 家常豆腐
老豆腐切片後沾但下去煎, 在拿出來放盤子
把香菇跟紅蘿蔔切片後炒一下, 把豆腐放進去, 加點香菇素蠔油
最後在加小黃瓜, 蔥, 和香菜

3. 肉燥
洋蔥一顆半, 絞肉, 五香粉, 油蔥
把洋蔥炒一炒後加入絞肉, 油蔥和醬油
炒熟後加水跟五香粉慢慢煮一小時
可加入水煮蛋

Page table, shadow and extended page table (Intel VT support)

Page table基本概念
Page table的功能是用來做virtual address to physical address轉換. Virtual address space是一個4G的連續記憶體位址, 專屬於某一個process, 但其所指到的實體記憶體其實是分散在不同physical memory 上, 甚至有可能不在physical memory裡面 (swap out). Modern OS利用Page table的功能提供了讓process之間有自己的連續記憶體區塊, 並提供process address space之間的isolation.

Page fault 
Page fault的發生有兩種可能, 第一個是當要去查詢某個virtual address時發現沒有此entry在page table裡面, 此時會發生segmentation fault.
第二種是查詢到的entry不在physical memory裡面, (page table entry裡面有個present bit = 0), 此時表示此區記憶體已經swap out, 必須要在被load進memory.

Paravirtualization 
需要修改guest OS code to call hypercall. 只有Linux能用 因為需要改OS
不然則需要利用binary translation (ex: windows), 直接修改binary, overhead較高 (called: full virtualization)
From Xen wiki about hypercall:
A hypercall is to a syscall what a hypervisor is to an OS. Alternatively, a hypercall is to a hypervisor what a syscall is to a kernel. A hypercall is a software trap from a domain to the hypervisor, just as a syscall is a software trap from an application to the kernel. Domains will use hypercalls to request privileged operations like updating pagetables.

Shadow page table
- guest virtual to host physical
當host要做context switching時, cr3會指向下一個page table. 當此情形發生在VM時, hypervisor會trap 此 instruction, 並將此cr3重新指向在host physical底下的page table. Hypervisor因此需要管理一個mapping, 從guest page table 轉到 host page table.
Host會將此page table設定成read-only, 當guest對pte (page table entry)做修改時, host 會trap此情況, 然後將修改後的直sync回guest page table.
因為修改cr3, trap instruction, synchronize需要較多的overhead, 因此有了EPT.

Extended page table
EPT是x86提供給虛擬畫技術的功能, 主要目的在加強MMU做VM和PM之間memory address的轉換("Extend" from guest physical to host physical). 當有EPT support時, VM的 physical address可以直接經由MMU轉換到PM的 physical address (因此不需要像shadow page table利用trap, 造成overhead). 此時cr3指到guest VM的 address, 因此Guest VM可以直接修改 page table( guest physical address) 然後MMU會直接寫到 host physical address.
Intel x86提供兩個instruction: vmentry and vmexit. 當vmentry時, MMU extended page table 功能被啓動, 因此VM可值接寫入host physical.

Reference:
http://en.wikipedia.org/wiki/Page_tables

好句分享

楊德昌說:
"我們拍片不要迎合觀眾,因為觀眾從來也不知道他們要什麼。但是你做出來以後,他會跟著去看"

賈柏斯說:

"Apple不做市場調查,他們只做自己想要用的東西"
"在真正的世界裡,好人不會擊敗壞人,壞人也不會擊敗好人,而是主動的人會擊敗懶惰的人"
"A good traveler has no fixed plans and is not intent on arriving." - Lao Tzu

2011年9月16日 星期五

x86 Segment Translation

Purpose
Virtual memory的目的是為了提供
1. abstraction of contiguous, isolated memory to a program.
2. prevent illegal operation

x86 processor modes

1. real mode: 20-bit address space (1MB), direct memory access, segmentation available
2. protected mode: 這時候就有segmentation and paging, 加上user / kernel space
3. long mode: 64-bit mode for x86-64
機器一開機後先進入real mode, boot loader轉成protected mode之後交給OS

x86 segmentation
a segment has: base address (linear addr), length, and type (code, data, etc)
a segment for: code, data, stack, and some extra...
一個program最多可以有6個segments, ex: CS, DS, SS, ES, FS, GS

利用不同的instruction type來推測現在是使用甚麼segment register
1. Control flow: CS
2. Stack management: SS
3. Most load/stores: DS
segment register有16bit, 包含index, 指向table裡面的entry.
所以當值行這些instruction時, processor自動去把相對映的segment selector load進來, 然後把base address, limit, type從Descriptor table中讀出 (see 5.1.4, ref2)


為何OS要定義segment?
1. memory protection (similar to paging)
2. 每個segment都有自己的permission, ex: CS: read-only
3. Process只能對特定的segment做特定的存取, ex: write to CS會造成segmentation fault!
4. provide location information in memory and flag to indicate whether it resides in memory.
MMU: 用來做virtual to physical memory位址轉換
reference: http://en.wikipedia.org/wiki/Memory_segmentation

Memory management register
"The processor provides four memory-management registers (GDTR, LDTR, IDTR, and TR) that specify the locations of the data structures which control segmented memory management" -- IA-32
x86 processor定義了四個register: gdtr, ldtr, idtr, and tr. 目的是為了方便存取這些table, 由這些table去拿到linear address.
GDT: any process can use these segments
LDT: segment definitions for a specific process

From: IA-32 chapter2
1. Exception發生: idtr -> trap gate -> LDT -> exception handler
2. Interrupt發生: IDT -> GDT-> interrupt handler
3. Process context switch: TR -> GDT -> TSS (把資料存下後 context switch)

Interrupt分為三種: Interrupt Gate, Trap Gate, and Task Gate
Interrupt發生時先根據IRQ number查IDT, 然後到GDT裡面拿到要執行code的segment. 之後就下instruction "CALL CS:IP"跳到handler
Gate通常指的是從一種privilege跳到另一個level of privilege.








 Reference:
1. CSE506 slides by Don Porter
2. http://www.cs.sunysb.edu/~porter/courses/cse506/f11/ref/i386/s05_01.htm