2011年10月25日 星期二

ELF, Linking and Loading (draft)

討論ELF檔案格式, Program如何開始執行, linker/loader如何將library和elf整合

ELF & ELF Loading
- Executable and Linkable Format
ELF包含:
1. Program header: 存各種segments (data, text, bss)
.text – Where read/execute code goes
.data – Programmer initialized read/write data
.bss – Uninitialized data (initially zero by convention)

2. Section header: 存各種sections用來做linking

- Static elf: load .text, .data, .bss section
- Dynamic elf: 會先load dynamic linker 然後在開始load .text, .data等等.

Dynamic Linking
1. Static linking is the result of the linker copying all library routines used in the program into the executable image.
2. Dynamic linking is accomplished by placing the name of a sharable library in the executable image. Actual linking with the library routines does not occur until the image is run, when both executable and library are placed in memory. An advantage of dynamic linking is that multiple programs can share a single copy of the library.



當使用dynamic linking時, 不直接呼叫main, 而要先做下列動作
1. 檢查heaader裡面是否有需要的library
2. Call mmap把library map到目前process的address space
3. Do some bookkeeping and call main()

如何呼叫library function?
1. Library, ex: libc, 大部份的時候都已經load到memory裡面, 因為其他program可能已經在使用.
2. Linker必須要把目前要直行的program所呼叫的function address填上在library裡面的address
3. If the linker doesn’t know where a function will end up, it creates a relocation -> 要求之後再做relocate
4. Part of loading: linker marches through each relocation and overwrites the call target -> 此時linker要去改.text裡面的function address

概念:
Compiler creates a jump table for all external calls
- 這個table就是PLT (Program Linkage Table), PLT指向GOT(Global Offset Table)
- GOT 存的就是<function, address in memory>
所以linker可以藉由PLT拿到某個library function目前在memory裡面的address. 

PLT的詳細解釋以及步驟:
A PLT is a table of absolute addresses to functions. It is used because the link editor doesn’t know where functions in shared objects will be located. Instead, a table is created so that the program and the dynamic linker can work together to find and execute functions in shared objects. I’ve simplified the explanation a bit1, but at a high level:
  1. Program calls a function in a shared object, the link editor makes sure that the program jumps to a slot in the PLT.
  2. The program sets some data up for the dynamic linker and then hands control over to it.
  3. The dynamic linker looks at the info set up by the program and fills in the absolute address of the function that was called in the PLT.
  4. Then the dynamic linker calls the function.
  5. Subsequent calls to the same function jump to the same slot in the PLT, but every time after the first call the absolute address is already in the PLT (because when the dynamic linker is invoked the first time, it fills in the absolute address in the PLT).
此方法又叫做Lazy resolved. 當function在PLT裡面被呼叫時, linker 才去figure out relocation, 從GOT裡面找到address. ltrace就是裡用在PLT裡面插入breakpoint來知道某個program用到了那些library.


reference:
1. address-space slide from cse506 Don Porter
2. Linker and Loader

2011年10月23日 星期日

CSE506: Networking Softirq and NAPI

Why top half and bottom half?
1. to minimize time in an interrupt handler with other interrupt disabled.
2. Give kernel more scheduling flexibility.
3. Simplify service routines.

Top half:
1. allocate skb, 2. copy into buffer, 3. initialize fields, 3. calling bottom half (for later processing)
Buffer preallocating using DMA:

In some cases, sk_buff can be pre-allocated, and network card can copy data in (DMA) before firing the interrupt. 

Softirqs:
- hardware irq is hardware interrupt line, calls the top half
- Soft IRQ is the associated software “interrupt” handler : the bottom half
- Kernel’s view: per-CPU work lists: Tuples of <function, data>
甚麼時候被執行:
1. return from exception, interrupt, and system call
2. each CPU has a kernel thread "ksoftirqd_CPU#" that processes pending requests
* softirq一定要run在同一個CPU, why? locking. (deadlock, code complexity, scale better!)

從programmer的角度:
1. Only one instance of a softirq function will run on a CPU at a time
2. Doesn’t need to be reentrant
3. If interrupted,won’t be called again by interrupt handler **Subsequent calls enqueued!**
4. One instance can run on each CPU concurrently, though Must use locks

Tasklet: Constrained to run one at a time on any CPU.

Softirq Priority:
HI_SOFTIRQ(high/first) 
TIMER
NET_TX
NET_RX
SCSI
TASKLET(low/last)
- Devices can decide whether their bottom half is higher or lower priority than network traffic (HI or TASKLET) 
- Transmit traffic prioritized above receive. Why? 
- The ability to send packets may stem the tide of incoming packets

Receive livelock

Reference:
1. cse506 course slide by Don Porter.

Non-transparent Bridging (NTB)

What is NTB (Non-Transparent Bridging)?
Introduction:
傳統的PCI架構是用在PC上面來連接數個IO devices, 當初的protocol架構並沒有考慮到多個CPU的情況, 而是由一個processor來emulate entire memory space, 控制所有的devices. 在此設計下如果有另一個CPU加入, PCI的operation就會產生問題.

NTB是一個解決方案讓多個host可以share多個PCI devices. 一個NTB其實是模擬出兩面的PCI endpoint, 並屆由address translation來達到兩邊資料的傳遞. NTB會expose CSR type 0 header來代表endpoint, 所以此方的host 不會繼續做forwarding or discovery.

downstream/upstream side:
Downstream代表的是NTB面向intelligent system (controller)的那面
Upstream則是面向host/root complex那面, 所以upstream side host會看到一個NTB, 以為他是一個pci device endpoint, 然後就結束.
"Device on the downstream side is not visible from the upstream side, this allows an intelligent controller on the downstream side to manage devices here" -> 所以其他所有的devices是由downstream side controller來控制.

from ref2:
"A system host will enumerate through Bridges A and B (both transparent) on the left branch of the figure 2 until it reaches the endpoint X.  On the right side branch, the host will stop enumeration at Bridge D (NTBD).  Similarly, the local CPU will enumerate through Bridge E and F (both virtual bridges within the switch) and discover the endpoint Y, but will not attempt to discover elements beyond Bridge D. "

Device Identification:
一般正常transparent PCI-PCI bridge的class code是0604h, 而NTB的class code是050000h. 代表RAM controller. 因為NTB就是用來做memory map, from one address domain to another.
每個device藉由<device, port, type>來定義,
- TB (Transparent Bridge) switch port又可分upstream跟downstream.  對NTB來說就是endpoint.
這些upstream/downstream property是由CSR BAR 來控制.

CSR header:
Transparent bridge用的是type 1 CSR header, 而NTB用的是type 0 CSR header.

Scratchpad and doorbell registers:
- scratchpad是用來做兩邊溝通的橋梁, 共有八個registers.
- can be accessed in either memory or I/O space from both the primary and secondary interface.
- 可用來傳遞control and status information between primary and secondary bus devices
- Doorbell是用來產生interrupt從一邊到另一邊
- 每個interface (primary and secondary)都有一組registers 可以藉由memory or I/O space來access.
- Standard PCIe allows interrupt to be configured as INTx or MSI (Message Signal Interrupt)

Address Translation
NTB利用BAR來對兩邊的adress做轉換
- BAR定義address translation window into memory space on another side of the NTB
- BAR0 and BAR1保留用來做memory and I/O mapped到CSR
- 每個BAR有個setup register用來指定window size跟type.
- Transparent bridge forwards all CSRs based on bus number, NTB only accept CSR transactions addressed to the device itself.

兩種address translation的方式如下
1. Direct Address Translation (利用 BAR Setup register和BAR Address Translation register)

How to setup the address translation of NT BARs:
A.Setup up the “BAR setup register” to reserved a range of memory address
              (Normally do this in EEPROM)
B.Setup up the “BAR Address Translation Register “ to map the range of
          memory address to the memory address on the other NT side
              (Normally do this in runtime)

From PLX:



2. Lookup Table Based Address Translation
另一種方式是用lookup table 來作address translation.

Request ID Translation
(這部份還沒完全弄董)
當有一個request從NTB的某一面進來, 要轉換到另一面時, Request ID (bus no, device, fun)在兩面是不一樣的, 所以需要管理一個translation table, 做一對一的轉換.

From ref1:


Complete ID Translation


Reference:
1. Using Non-transparent Bridging in PCI Express Systems
2. Multi-Host System and Intelligent I/O Design with PCI Express
3. Enabling Multi-Host System Designs with PCI Express Technology

2011年10月19日 星期三

Intel 8255X (draft)

Chapter 6: Host Software Interface
1. establish a shared memory communication system with the host CPU.
2. controls the device by writing and reading data to and from this shared memory space.
Where is this shared memory? BAR?

The shared memory architecture:
Divided into three parts:
1. CSR: Control/Status Register, reside on LAN controller
2. CBL: Command Block List, reside on host memory
3. RFA: Receive Frame Area, reside on host memory

CSR:
1. The first 8 bytes of the CSR is called the System Control Block (SCB). The SCB serves as a central communication point for exchanging control and status information between the host CPU and the 8255x.
2. The host software controls the state of the Command Unit (CU) and Receive Unit (RU) (for example, active, suspended or idle) by writing commands to the SCB.
3. The device posts the status of the CU and RU in the SCB Status word and indicates status changes with an interrupt.
4. The SCB also holds pointers to a linked list of action commands called the CBL and a linked list of receive resources called the RFA.



Glossory:
1. CSR
2. SCB (System Control Block) 
The CUC and RUC fields of the Command byte specify the actions to be performed by the 8255x
3. CU (Command Unit)
4. RU (Receive Unit)


Reference:
1. Intel 8255x 10/100 Mbps Ethernet Controller Family Open Source Software Developer Manua
2.

2011年10月17日 星期一

脂漏性皮膚炎

1. Oct 17, 2011
- Clindamycin Phosphate Lotion (1% clindamycin)
- OXY face wash, acne treatment (10% Benzoyl Peroxide)

2. Nov 4, 2011
- Eryth/Benz Perox Top GEL (ERYTHROMYCIN)
http://www.healthno1.com/component/content/article/134-antibiotics/1634-hna1670.html

- DOXYCYCLINE HYC
http://tw.myblog.yahoo.com/skindr-wang/article?mid=2311

2011年10月11日 星期二

Chapter 17: Page Frame Reclaiming

Page Frame Reclaiming Algorithm (PFRA )
From Understanding the Linux Kernel, chapter 17.
討論給定一個physical page, 如何得到誰再使用此page. 可能使用的有1. file cache, 2. process address space, 3. device cache.  然後根據某些heuristic來決定哪些page要先被釋放. 

"The page frame reclaiming algorithm of the Linux kernel refills the lists of free blocks of the buddy system by "stealing" page frames from both User Mode processes and kernel caches."
- 所以可以從user mode process 或是kernel cache來選擇要回收的papge.

Page type:
Anonymous page:
A page is said to be anonymous if it belongs to an anonymous memory region of a process (for instance, all pages in the User Mode heap or stack of a process are anonymous)
Mapped page: a page is mapped to a file.
For instance, all pages in the User Mode address spaces belonging to file memory mappings are mapped, as well as any other page included in the page cache.

Anonymous Page Reclaim
當kernel要做page reclaim時, 給定一個physical page,要找到使用此page的process, 如果只有單一process, 那此page就是non-shared, 如果有多個, 就是shared anonymous page.


判斷anonymous page的方法
struct page {
  _mapcount: if >=0, 代表多個process share此page
  struct address_space *mapping: 如果= NULL, -> anonymous page. Else -> mapped to a file (inode)
}
mapping: Pointer to the owning object

Shared and non-shared
如果此page是anonymous page: kernel keeps a list called anon_vma.
anon_vma是一個list, 儲存所有來自不同process的vm_area_struct但指到同一個page, 所以這是shared的anonymous page. 當kernel需要free anonymous pages時, 可以經由此list找到所有的相關的process, 拿到vm_area_struct, 然後invalidate他的page table.

"We have seen in the previous section that descriptors for anonymous memory regions are collected in doubly linked circular lists; retrieving all page table entries referencing a given page frame involves a linear scanning of the elements in the list. The number of shared anonymous page frames is never very large, hence this approach works well." from ref2:

From ref2, fig17-1: 兩個process (mm_struct) share 同一個page, anon_vma用來連結所有shared此page的vm_area_struct.
當要釋放page時, 必須要iterate all vm_area_struct, 找到他對應的page table, do linear scanning, 然後swap out or write tot disk.


Mapped Page Reclaim
Problem1: 給定一個要reclaim的page, 此page是被一個file mapped. 此時必須找出在file的哪個offset對應到此page.

如果是mapped page: 利用struct page裡面的struct address_space mapping可以拿到inode. address_space有radix_tree_root, radix tree上面對應file offset到page cache的位置. 當所對應的page被釋放後, address_space裡面的radix tree需要被更新 (變成NULL).

Problem2: 找到此file offset之後, 可能很多個process都透過mmap來對應到此file offset, ex:libc
Idea: 假設跟anonymous page一樣, 利用一個list來存所有的vma
Complication1:
Not all file mappings map the entire file
假設要page 4 of a file, 必須要filter all vma that don't include page 4
Complication2: anonymous mappings won’t be shared much
anonymous page很少被shared, 除非利用fork, mapped file常常被shared, ex: libc
Problem: Lots of entries on the list + many that might not overlap

Solution: Priority Search Tree
Idea: binary search tree that uses overlapping ranges as node keys
Bigger, enclosing ranges are the parents, smaller ranges are children
Use case: Search for all ranges that include page N

三種Tree的不同用法
1. Radix tree:  (inode -> physical memory)
file(inode)的page cache. 處理file(inode)到address space的maping. inode有i_mapping (struct address_space),  address_space有radix_tree_root
struct address_space {
 635        struct inode            *host;          /* owner: inode, block_device */
 636        struct radix_tree_root  page_tree;      /* radix tree of all pages */
...
從此page_tree得到file offset對physical memory的mapping.

2.  Red-black tree: (process address space -> page)
處理process address space的mapping. 每個process有mm_struct, mm_struct用來管理所有的vm_area_struct, 把所有的vm_area_struct存在一個list. (note: red-black tree is balanced.)
from ref1:

3. Priority tree (process address space -> file)
"There is a priority search tree for every file; its root is stored in the i_mmap field of the address_space object embedded in the file's inode object. It is always possible to quickly retrieve the root of the search tree, because the mapping field in the descriptor of a mapped page points to the address_space object. " from ref2
case: mapped anonymous page
很多program利用mmap將file直接map到process的address space. Example: libc就是很常見的例子. (所以這不是file page cache, 也不是process address space page mapping) 然而每個program可能只有map libc file的某個部份, 並不會把整個file全部map到自己program的address space. 因此Linux利用 PST(Priority Search Tree) 來管理這個mapping.

from ref2 fig17-2:
Goal: 給定一個file offset, 找到哪些vm_area_struct有包含此file offset (透過mmap).
Each node in the PST contains a list of vmas mapping that interval. 每個node代表一個interval, 而每個node裡面的list of vm_area_struct都map到此interval.
下圖為file page 0 ~ 5.
Radix: interval的起始位置, heap: interval的終止位置
條件: 1. parent heap number > child heap number. 2. 左邊的radix number > 右邊radix number
Traversal for page n:
if n > heap number, stop
else
  if n is in the range of radix and heap, it's the answer
  continue checking ALL the children.

example: n = 4
4 < 5 and 4 is in (0,5) -> (0,5,5), 4<=4-> (0,4,4), 4 > 0, 4 >2, 4 < 5 and 4 is in (2,5) -> (2,3,5), 4>3, 4>2

下圖右上角為inode, inode指向struct prio_tree i_mmap, i_mmap指向一個priority tree.  Priority tree紀錄了這個file裡面每個file offset page對應到process的mm_struct. 而每個mm_struct包含多個vm_area_struct. 由此可知如何從一個file inode找到所有對此file使用mmap的process.

From ref1, fig 4-7:

struct address_space {
 635        struct inode            *host;          /* owner: inode, block_device */
 636        struct radix_tree_root  page_tree;      /* radix tree of all pages */
 637        spinlock_t              tree_lock;      /* and lock protecting it */
 638        unsigned int            i_mmap_writable;/* count VM_SHARED mappings */
 639        struct prio_tree_root   i_mmap;         /* tree of private and shared mappings */
 640        struct list_head        i_mmap_nonlinear;/*list VM_NONLINEAR mappings */

LRU (Least Recent Use)
所有的page都分別屬於兩個LRU list: active or inactive, 當此page有備access到就移到active list.  如果很久沒被用就移到inactive.

- Anonymous page是user space在使用的,  kernel必須要知道此page是否經常被使用, 如果很少被使用就可以被reclaim.
- Page table entry裡面有個access bit, 當此page table entry被access時, x86 hardware會set this bit, 用來通知linux kernel. Kernel periodically reset and check this bit to know whether this page is frequently used.

Page Reclaim Heuristics: 
cases: low memory, hibernate, free memory below a threshold.
要先選哪個page來釋放?
goal: minimize the performance disruption.
1. Free harmless pages first
2. Steal pages from user programs, especially those that haven’t been used recently
    - page reclaim之後要remove all reference to the page
3. Laziness:Favor pages that are “cheaper” to free
    - Waiting on write back of dirty data takes time


Reference:
1. Professional Linux Kernel Architecture, chapter 4.
2. Understanding the Linux Kernel, chapter 17.
3. Segment tree: http://en.wikipedia.org/wiki/Segment_tree (or Priority tree)

2011年10月10日 星期一

C pointer 練習

From JOS Lab1:
http://www.cs.sunysb.edu/~porter/courses/cse506/f11/lab1.html

1. Read 5.1 (Pointers and Addresses) through 5.5 (Character Pointers and Functions) in K&R. Then download the code for pointers.c, run it, and make sure you understand where all of the printed values come from. In particular, make sure you understand where the pointer addresses in lines 1 and 6 come from, how all the values in lines 2 through 4 get there, and why the values printed in line 5 are seemingly corrupted.


2. There are other references on pointers in C, though not as strongly recommended. A tutorial by Ted Jensen that cites K&R heavily is available in the course readings.

3. Here are a few specific points you read about in K&R Chapter 5 that are worth remembering for the following exercise and for future labs.
  • If int *p = (int*)100, then (int)p + 1 and (int)(p + 1) are different numbers: the first is 101 but the second is 104. When adding an integer to a pointer, as in the second case, the integer is implicitly multiplied by the size of the object the pointer points to.
  • p[i] is defined to be the same as *(p+i), referring to the i'th object in the memory pointed to by p. The above rule for addition helps this definition work when the objects are larger than one byte.
  • &p[i] is the same as (p+i), yielding the address of the i'th object in the memory pointed to by p.
pointers.c
-----------
#include <stdio.h>
#include <stdlib.h>

void
f(void)
{
    int a[4];
    int *b = malloc(16);
    int *c;
    int i;

    printf("1: a = %p, b = %p, c = %p\n", a, b, c);

    c = a;
    for (i = 0; i < 4; i++)
 a[i] = 100 + i;
    c[0] = 200;
    printf("2: a[0] = %d, a[1] = %d, a[2] = %d, a[3] = %d\n",
    a[0], a[1], a[2], a[3]);

    c[1] = 300;
    *(c + 2) = 301;
    3[c] = 302;
    printf("3: a[0] = %d, a[1] = %d, a[2] = %d, a[3] = %d\n",
    a[0], a[1], a[2], a[3]);

    c = c + 1;
    *c = 400;
    printf("4: a[0] = %d, a[1] = %d, a[2] = %d, a[3] = %d\n",
    a[0], a[1], a[2], a[3]);

    c = (int *) ((char *) c + 1);
    *c = 500;
    printf("5: a[0] = %d, a[1] = %d, a[2] = %d, a[3] = %d\n",
    a[0], a[1], a[2], a[3]);

    b = (int *) a + 1;
    c = (int *) ((char *) a + 1);
    printf("6: a = %p, b = %p, c = %p\n", a, b, c);
}

int
main(int ac, char **av)
{
    f();
    return 0;
}

2011年10月6日 星期四

CSE506: Block Device Scheduling (DRAFT)

The big picture:
From ref1: "Below the generic block layer, the "I/O scheduler " sorts the pending I/O data transfer requests according to predefined kernel policies. The purpose of the scheduler is to group requests of data that lie near each other on the physical medium. "
Scheduling的目的:
1. Throughput, Latency,
2. Safety: file system can be recovered after a crash
3. Fairness – surprisingly, very little attention is given to storage access fairness

Problems:
- Between page cache and disk, you have a queue of pending requests
- Requests are a tuple of (block #, read/write, buffer addr)
- You can reorder these as you like to improve throughput

Below is from ref2.
Elevator Algorithm
- 像搭電梯一樣, 要上樓就把所有人一起往上載, 要下樓就一起往下載
idea: Require disk arm to move in continuous “sweeps” in and out
- Reorder requests within a sweep
- This approach prevents starvation: Sectors at “inside” or “outside”get service after a bounded time

Complete Fairness Queue
- 利用round robin 保證每個request公平的被分到
idea: add a second layer of queue, one per process. Then do the round-robin.
(x) Ping-pong disk head around

Deadline Scheduler
- Deadline快到的先被執行
idea: associate expiration time with requests. As requests get close to expiration, make sure they are deployed. Good for real time applications.

Anticipatory Scheduler
- 當某個Process有burstness特性時, 先等一下 (batching), 然後在一起執行
idea: Try to anticipate locality of requests
If process P tends to issue bursts of requests for close disk blocks, When you see are quest from P, wait a bit and see if more come in before scheduling them

Reference:
1. Understanding the Linux Kernel, Chapter 14.3
2. cse506 slide by Don Porter

CSE506: Linux Page Cache

Page Cache
Page cache 的主要功能是當作files on disk的快取.  每個on disk file都有一個相對映的page cache (a radix tree), 存取file的最小單位是4K page size, 每當從disk存取後就會被存在page cache裡面.
當要讀取檔案時, find_get_page()會要求輸入inode跟index, index為此file address space裡面的offset. Linux kernel會根據此index去搜尋radix tree來獲取相對映的phyiscal page. 若此page不存在於page cache, 則必須要從disk 裡面重新讀出來.

Recap: Memory用途
1. Process virtual memory, 2. File, 3. I/O mapping, and 4. device buffer.


Note:
- Address space有很多種
1. Virtual address space: 4G,  anonymous, (VM area struct, vma)
2. physical address space: depends on the size of your physical memory, and
3. file address space: offset from the head of the file
- 每個inode有自己的page cache
- 當file size超過4G時, file address space就無法用32 bits address space來表示, 此時就需要64bits or concatenation of registers.

Radix tree
對Process來說, linux使用page table來管理virtual to physical mapping. 對file來說, linux使用radix tree.  (特性: virtual address一定是0-4G, 但file size可大可小).
"In order to perform page cache lookup efficiently, Linux 2.6 makes use of a large set of search trees, one for each address_space object. "
Radix tree利用file address的prefix來找到對應的page.  Page cache是由一個radix tree來儲存, 每個node含有64個leaves, 最後指向physical page.
- Branching factor: 64
- TRIE: prefix tree. Rather than storing the entire key in the node, traversal of parent builds a prefix, node just stores suffix.
- height:1 , size = 64 * 4K  = 256K, height:2, 64*64*4K = 16M

Positioning:
一個file address的前12 bit (4K)會先被拿掉, 之後每六個bit代表一個radix tree level的index.

from ref1:
Buffer heads (BH) 15.2
(Mapping pages to disk blocks)
需要buffer head的原因是因為disk每個單位是512byte (sector), 而OS的單位是4K byte. 因此每個page中都含有8個buffer cache. Linux kernel利用buffer_head來做housekeeping. 當寫入範圍小於一個sector時,此page cache會被mark dirty, 但update only the BH in the disk, not the whole page containing 8 buffer.
from ref2:
Example: write() system call for first 5 bytes
1. Look up first page in radix tree
2. Modify page, mark dirty
3. Only mark first buffer head dirty

Raw device caching:
Disk 也可以有自己radix tree + address space!

Page cache v.s Page table
相對於Process address space (4K), page cache 是用來cache on disk file. 而4K的virtual address 是利用page table (10bits for page directory, 10bits for page table, and 12 bits for offset.)

如下圖, page cache處理的是從address space region到backing store 的cache (左半部), 而page table處理的是virtual address space mapping到physical page frame.

review page table from ref3:

Linux represents portions of a process with a vm_area_struct, or vma. 利用red-black tree來管理mapping.  from ref4


Reference:
1. Understanding the Linux Kernel, Chapter 15
2. http://www.ibm.com/developerworks/cn/linux/l-cache/index.html
3. Professional Linux Kernel architecture
4. CSE506 slide, Don Porter

2011年10月5日 星期三

PCI useful links

Focus: linking TCP/IP with PCIe HW.

Links
1. LDD example code
http://www.cs.fsu.edu/~baker/devices/lxr/http/source/ldd-examples
2.LDD, network driver
http://www.xml.com/ldd/chapter/book/ch14.html
3. Kernel documents
http://www.mjmwired.net/kernel/Documentation/PCI/pci.txt
4. Introduction to PCI
http://teknirvana.com/documents/Pci.pdf


Papers:
Unread:
1. SR-IOV Networking in Xen: Architecture, Design and Implementation
2. Standardized but Flexible I/O for Self-Virtualizing Devices
3. Towards an Integrated IO and Clustering Solution using PCI Express

Read:

VIM筆記

1. 建立tag
> ctags -R .

2. 設定.vimrc
set tag=tags

3. cscope

4.常用指令
ctrl + n: show出所有提示字
shift + d: 刪除目前游標位置到行尾
:ta: 可用來搜尋
:!<command>: 可用來執行外部程式

Device Driver CSE506

I/O ports:

1. I/O 所使用的address space 跟CPU不同
2. ports代表的只是一個address
3. x86利用不同的instruction來區分, inb, inw,

對port的讀寫有Side effect:
- memory 的讀寫可能被cached
- operation可能被re-order
- Idiosyncrasy: composition doesn’t necessarily work
outw0x1010<port> != outb0x10<port>  + outb 0x10 <port+1>

Port 的permission
1. 透過IOPL flag in EFLAGS
2. "the processor first checks whether CPL <= IOPL. If this condition is true, the I/O operation may proceed. If not true, the processor checks the I/O permission map." ref1
3. 之後會檢查TSS裡面的I/O permission bitmap, 每個IO port代表一個byte address, bitmap裡面的每一個bip代表存取這個byte address (port)的權限

Memory Mapped IO
1. Map devices onto regions of physical memory
2. Win: Cast interface regions to a structure
3. Hardware basically redirects these accesses away from RAM at same location (if any), to devices

Side Effect
缺點是還是會有 side effects: (因此需要使用barrier)
常用的最佳化:
1. out-of-order execution
2. reorder write
3. cache values in register (cache越久越好)
但當我們想讀寫devices時, 我們希望馬上寫進去

__volatile__: 
- 直接讀寫記憶體, 不經過cache or optimization
- write must go directly to memory
- read must come from memory
- 防止記憶體存取最佳化, 防止刪除記憶體動作

Compiler and CPU barrier:
1. Clobber list in the inline assembly:
2. Basic idea: In some cases, CPU can issue loads and stores out of program order (optimize perf)
利用fense instruction去確保執行順序


Reference:
1. http://pdos.csail.mit.edu/6.828/2011/readings/i386/s08_03.htm#fig8-2
2. CSE506 slide by Professor Don Porter

2011年10月2日 星期日

JOS lab3

env_run:1. 進入user mode
先call env_pop_tf(&e->env_tf)
把關於此environment的一些register直從trapframe讀出來放到register裡面, 然後呼叫iret

__asm __volatile("movl %0,%%esp\n"
        "\tpopal\n"
        "\tpopl %%es\n"
        "\tpopl %%ds\n"
        "\taddl $0x8,%%esp\n" /* skip tf_trapno and tf_errcode */
        "\tiret" /* goto eip, set eip to elf->entry */
        : : "g" (tf) : "memory");
 
iret之後就進入user mode, 此時已經將eip指到elf->entry, 也就是此程式的開頭位置.

2. Load_icode
在load_icode中必須要根據現在的binary (ELF) 格式來load program
一個elf裡面會有多個program header, 每個代表一個segment.
Elf裡面有e_entry的virtual address, 用來代表process的起始位置.
"This member gives the virtual address to which the system first transfers control, thus starting the process."

每個environment有個trapframe, 用來儲存register當此environment沒有在執行的時候. (所以當要執行此env時要去load這些存在trapframe的registers)
"holds the saved register values for the environment while that environment is not running: i.e., when the kernel or a different environment is running. The kernel saves these when switching from user to kernel mode, so that the environment can later be resumed where it left off."


Trapframe的第一個member是tf_regs, 儲存trapframe的register
load_icode的最後必須要將env_tf.tf_eip 設定成 elf-> e_entry

3. IDT的設定: idt_init
- 先將kernel的TSS設定到GDT裡面, 之後當int發生時必須使用
- load the TSS: ltr(GD_TSS)
- 設定idt
SETGATE(idt[0x0], 1, GD_KT, do_divide_error, 0);
SETGATE(idt[0xd], 1, GD_KT, do_general_protection, 0);
SETGATE(idt[0xe], 1, GD_KT, do_page_fault, 0);
SETGATE(idt[0x28], 1, GD_KT, do_badsegment, 0);

裡面的interrupt handler是用aseembly定義的, 在trapentry.S
TRAPHANDLER_NOEC(do_divide_error, 0x0)

_alltraps:
    pushl %ds;
    pushl %es;
    pushal;
    movl $GD_KD, %eax;
    movl %eax, %ds;
    movl %eax, %es;
    movl %esp, %eax;
    pushl %eax;
    call trap;


Reference:
1. ELF header: http://www.sco.com/developers/gabi/1998-04-29/ch4.eheader.html
2. Program header: http://www.sco.com/developers/gabi/latest/ch5.pheader.html