2011年9月26日 星期一

Linux Memory Allocator

NUMA, Node, and Zone 
NUMA and Node
Linux 2.6之後支援NUMA (Non Uniform Memory Access), 記憶體的分配是有區域性的, 實體記憶體被切割成多個node. 每個CPU access屬於自己的memory node的時間相同, 若是不同的node則需要較多的時間.
每個node的相關資料存在struct pg_data_t

Zone and Region
每個node裡面在區分成多個zone, pg_data_t裡面含有node_zonelists and node_zones

因為硬體限制 P.299
1. DMA必須要access前16MB的address space
2. 32-bit系統最多只可以定址到4G
3. Kernel address space只有1G的空間 (在扣掉保留的128MB), 0xc0000000 - 0xffffffff

所以將Zone區分成三種
1. ZONE_DMA: below 16MB
2. ZONE_NORMAL: 16MB ~ 896MB
3. ZONE_HIGHMEM: > 896MB
"the ZONE_HIGHMEM zone includes page frames that cannot be directly accessed by the kernel through the linear mapping in the fourth gigabyte of linear address space" from ref1

Kernel address space 介於3G ~ 4G, 大於896MB的記憶體不值接linear map到kernel. 小於896MB可值接將physical memory address對應到linear address.

* 896MB = 1024 - 128MB, 比較高的128MB用來做Fixed Mapped Linear Address for HIGHMEM.

from ref1, fig 8-2: 每區的zone有自己的buddy system, buddy system底下有slab. 當使用ZONE_HIGHMEM時, 會始用highmem的buddy system.


解決highmem mapping的方法
1. permanent kernel mapping, 2. temporary kernel mapping, 3. noncontiguous memory allocation. (p. 306)

Buddy Allocator
Buddy algorithm:
- no external fragmentation
- worse case: waste less than half of the requested size
- kernel need to keep track of the existing blocks of free, contiguous page frames.
- 所有的free page frame被分配到11個list, 根據大小1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page frames.
- Zone allocator is the frontend of buddy system
- 在buddy allocator之上還有slab allocator

Buddy system每次以page為單位太大了, 所以有slab allocator來管理較小的memory request
- buddy allocator是以page為管理單位, 而slab是以cache為管理單位

Slab allocator
- the default cache allocator
- "kernel creates 13 geometrically distributed lists of free memory areas whose sizes range from 32 to 131, 072 bytes" from ref1
- The slab allocator groups objects into caches . Each cache is a "store" of objects of the same type
- 缺點: Similar basic ideas, but substantially more complex bookkeeping.
- 缺點: 2 groups upset 1. Users of very small system (ex: watch) 2. Users of large multi-processor systems
- example: Slab再不同CPU之間溝通需要較高的overhead.
For very large (thousands of CPU) systems, complex allocator bookkeeping gets out of hand. slabs try to migrate objects from one CPU to another to avoid synchronization. (ref2)

SLOB allocator
- 給小的系統使用,例如一些嵌入式系統, overhead較小
- Simple List of Blocks
- Keep a free list of available chunk and its size
- Grab the first one big enough chunk to work
- No internal fragmentation
- Has external fragmentation but trade with low overhead

SLUB allocator
- SLAB有bookkeeping overhead, 因為不同CPU之間需要協調
- The Unqueue Slab Allocator
- All objects of same size from same slab
- simple free list of per slab
- no cross-CPU nonsense
- Does better than SLAB in many cases


from ref4, fig 8-1


Cache line alignment:
1. cache line: 64 ~ 128 bytes on most CPUs
2. word: 32 bits or 64 bits
3. cache line是memory cache的基本單位


Reference:
1. Understanding the Linux Kernel, chapter 8
2. Slides from cse506, Don Porter
3. False Sharing:
Performance overhead under multiprocessor cache system, see http://en.wikipedia.org/wiki/False_sharing
4. Linux device driver programming

I/O Addressing and Memory Mapped I/O

Driver的功能是幫助kernel控制硬體, Kernel不了解的裝置就交給driver, 透過drive, user program可以藉由system call來控制硬體

80386提供兩種input/output的控制方式去讀寫裝置上的register
1. 獨立的I/O address space (I/O mapped I/O)
2. Memory mapped I/O (和main memory共同使用memory address space)

I/O address space
- cat /proc/ioports
- 提供2^16 (64K) address space
- 每次可讀寫1, 2, 4btyes
- inb(), outb(), inw(), outw()...
- 每個硬體裝置都有自己的I/O port address space, 必須保證其他硬體不會重複使用
- instructions in and out move data between a register and a port I/O address space

Memory-mapped I/O
- cat /proc/iomem
- 以讀寫記憶體代替I/O port
- Place the I/O devices in the 80386 memory address space.
- Driver只要向預先定義好的位址範圍做讀寫 (4G - 3G 空間可當作address mapping)
- ioremap(), ioremap_nocache(), iounmap()
- 此方法提供較大的彈性, 使用mov, and, or就可值接對裝置的device register做運算
- Memory map定址到3G-4G的physical address space, 但實體記憶體可能小於4G, 所以當軟體讀寫這個區段的時候, 會被硬體當作memory map的讀寫動作, 然後轉換成對裝置的讀寫.
(see ref2, chapter 12 PCI).
(如果所安裝的實體記憶體超過3G, 就需要PAE: physical address extension support)

PCI
- 讀寫PCI裝置時可用I/O address space or Memory mapped I/O
- Linux提供interface來統一處理兩種不同的讀寫方式
- pci_iomap(), ioread8(), iowrite8()
- IA-32的平台式大部份透過I/O mapping來讀寫配置空間 (configuration space).

IOMMU 
- MMIO必須要占掉一塊連續的physical memory讓device做mapping, IOMMU maintain一個table用來做translation, 因此不需要連續的physical memory. Vector I/O (scatter-gather list) can be avoided.
-  In virtual environment, VM會設定DMA map到guest physical, 但guest physical無法直接存取host physical memory. IOMMU 會做remap, 把guest physical address map到host physical上來達成DMA. 所以native device driver就可以直接讓gust VM使用.
- IOMMU概念類似Extended Page Table, 負責做guest virtual -> guest physical -> host physical的後半段translation. (其實EPT就是MMU的延伸版, 多做一次轉換從guest physical to host physical)
- MMU是用來做CPU virtual address space轉換到Main memory, IOMMU則是用來轉換device address space 到 Main memory
- IOMMU is not used when the CPU communicates with devices via I/O ports.
- Connect a DMA-capable I/O bus to the main memory.
- Intel VT-d (Virtualization Technology for Directed I/O)

IOMMU in relation to virtualization:
Guest OS would like to directly access the I/O to improve performance using DMA. If the guest OS can directly instruct the hardware from guest physical memory, it would likely to corrupt the memory. The corruption is avoided because of hypervior's intervention to apply translation. Unfortunately this delays the I/O operation.
An IOMMU can solve this problem by re-mapping the addresses accessed by the hardware according to the same (or a compatible) translation table [from wiki IOMMU]

From wiki: IOMMU

Reference:
1. http://www.cs.sunysb.edu/~porter/courses/cse506/f11/ref/i386/toc.htm
2. Linux Device Driver Programming 驅動程式設計
3. http://en.wikipedia.org/wiki/IOMMU

2011年9月24日 星期六

藥品

VapoRub
蠻神奇的藥膏, 功能是
1. cough suppresant
2. topical analgesic
可以治咳嗽和肌肉酸痛

2011年9月23日 星期五

JOS lab2

boot_alloc: 
simple memory allocator
只有在boot的時候使用, allocate physical memory, 之後就使用page_alloc

i386_vm_init 
( called by i386_init(void) )
1. 分配一塊空間給page directory, cr3指向此地址
2. 分配一塊空間給struct Page, 用來儲存physical page entry
3. page_init
4. map pages toUPAGES
5. map physical bootstack 到 virtual address上
6. 把從KERNBASE到4G map到physical address 0x00000000的地方
7. 因為一開始segmentation會把va = va - KERNBASE, 所以在開paging之前要設定pgdir[0]
access to address higher than KERNBASE will subtract KERNBASE by segmentation and map to pgdir[0];
- pgdir[0] = pgdir[PDX(KERNBASE)]
然後turn on paging, 把新的segment mapping(不做轉換)寫入, 然後才清空pgdir[0].

page_init
用一個linked list把所有的free physical page都link起來, 0-640K是可以用, 大於640K到1M, 1M到kernel end是不可以用.

page_alloc:
allocate a physical page, remove it from the page_free_list

page_free: 
return the page to the free list

page_insert: 
insert a mapping (virtual to physical) at page table. Need to call pgdir_walk.

page_lookup:
給一個virtual address, 回傳mapped到的physical page和pte

pgdir_walk:
給一個virtual address, 到page directory裡面去找出paget table entry.
如果已經存在pdt, 回傳現有的 
如果沒有pdt, allocate一個新的page table, 最後回傳page table entry.


boot_map_segment:
把virtual address map到physical address

2011年9月21日 星期三

Chapter 7: Process Scheduling

7.1 Scheduling Policy
- Based on time-sharing technique. CPU time is divided into slices, one for each runnable process.
- The scheduling policy is based on ranking process according to priority.
- Linux applies dynamic process priority assignment. If a process is denied for using CPU for a long time, its priority is increased dynamically. 

Linux 把process 區分為三種
- Interactive process: spend lots of time waiting for events, example: shell, text editor
- Batch process: no need to be responsive, no user interaction, run in background
- Real-time process: have very stringent scheduling requirements. ex: video and sound application

"Linux 2.6 scheduler implements a sophisticated heuristic algorithm based on the past behavior of the processes to decide whether a given process should be considered as interactive or batch. " [ref1]

7.1.1 Preemption
- 當Process在執行時(TASK_RUNNING), 如果priority比現在current的process較高, current process 會被preempt然後context switch.
- 第二種情形是quantum expires

- Determine the quantum duration is critical for system performance. If the average quantum is too short, the system overhead caused by context switching is high.
- Quantum太長不會影響interactive program的performance, 因為interactive program有比較高的priority, 會preempt current running process.
- The rule of thumb adopted by Linux is choose a duration as long as possible, while keeping good system response time.

7.2 The scheduling algorithm
- 每個processor都有個swapper (PID=0) , 當CPU沒有process可執行時, 執行swapper
- 每個linux process is always scheduled according to one of the scheduling classes:
1. SCHED_FIFO: FIFO real-time process. 如果沒有更高優先權的real-time process, 此process會持續使用CPU. (Leave the process descr. in its current position in the runqueue)
2. SCHED_RR: put the process descr. at the end of the runqueue. 當被higher priority process interrupt, 就會從後面開始
3. SCHED_NORMAL: conventional, time-shared process.

7.2.1 Scheduling of Conventional Process
一個process會有static priority跟dynamic priority
Static
- static priority: 100(high) - 139(low)
- use nice() or setpriority() system calls
- base quantum time會根據不同的static priority來作調整
- 較高的priority有較大的quantum time
- ex: priority: 100, nice value: -20, base time quantum: 800ms
Dynamic
- 100 - 139
- dynamic priority = max( 100, min (static priority - bonus + 5, 139))
- bonus會根據過去的紀錄來決定, related to the average sleep time of the process.
- bonus range between 0 -10, greater than 5 is a premium that raises the dynamic priority.
- ex: sleep less than 100 ms: bonus = 0, greater than 1 second: bonus = 10

Starvation:
較高priority的process可能造成較低的process無法使用CPU.
"when a process finishes its time quantum, it can be replaced by a lower priority process whose time quantum has not yet been exhausted"
The scheduler keeps two set of runnable processes:

Active process: process自已的quantum還沒用完.
Expired process: the runnable process have exhausted their time quantum and are thus forbidden to run until all active processes expire. quantum用完後就放到expire list
from ref1: figure 7-1

每一個CPU有一個runqueue, 裡面包含了所有的runnable processes.  而每一個process只能存在於一個runqueue. Struct runqueue有兩個pointer分別指到active processes跟expired processes. active 跟expire的pointer會週期性的互相調換.
schedule()的功能: 從runqueue之中選一個新的process來執行, 如果沒有就執行idle process (swapper). Kernel藉由觀察runqueue的大小來決決定一個CPU的覆載量. 如果不平均, 就需要load balance...

Runqueue Balancing
Multiprocessor machines有這三種:
1. SMP (Symmetric Multi Processing)指的是多顆CPU有相同的能力, Kernel在處理時不續要特別針對不同特性的CPU做特別的處理
2. Hyper-threading: CPU上可執行多個thread, CPU內有多組的registers互相交替使用並模擬出多個虛擬的CPU給kernel使用
3. NUMA (Non Uniform Memory Access): 所有CPU對RAM的存取並不是相同的cost, 而是某部份的RAM分給CPU當他的local memory, 因此不需要考慮其他CPU會始用到此區記憶體, 而省去了synchronization overhead.

Linux利用scheduling Domain的概念來keep the CPUs balanced by the kernel.

相關的System Calls
Scheduling:
1. nice(), getpriority(), setpriority(): 設定process的static base priority,  如果沒有root權限, 只能降低priority
2. sched_getaffinity(), sched_setaffinity():  設定此process要執行在那個CPU. cpu_allowed field in the process descriptor.

real-time process:
1. sched_getscheduler(), sched_setscheduler(): 回傳此process所使用的scheduling policy (FIFO, RR, or NORMAL)
2. sched_yield(): allow a process to relinquish the CPU voluntarily without being suspended.
3. sched_get_priority_min(), sched_get_priority_max():
4. sched_rr_get_interval():

Question:
1. 每個process有自己的scheduling policy?還是整個系統採用一種policy?

--- Appendix ---
from ref2
TASK_RUNNING:
means that a task is in a runnable state. It does not mean that a CPU is actually allocated. The task can wait until it is selected by the scheduler. This state guarantees that the process really is ready to run and is not waiting for an external event.
TASK_INTERRUPTIBLE:
is set for a sleeping process that is waiting for some event or other. When the kernel signals to the process that the event has occurred, it is placed in the TASK_RUNNING state and may resume execution as soon as it is selected by the scheduler.
TASK_UNINTERRUPTIBLE:
is used for sleeping processes disabled on the instructions of the kernel. They may not be woken by external signals, only by the kernel itself.
TASK_STOPPED:
indicates that the process was stopped on purpose — by a debugger, for example.

Reference:
1. Understanding the Linux Kernel, Chapter 7
2. Professional Linux Kernel Architecture

2011年9月19日 星期一

PCI-2

Questions:
1. How does PCI work at boot time?
2. How does a driver look for its device?
The driver uses (vendor id, device id, class, ...) to tell kernel the devices it supports. The struct pci_device_id is used to define a list of different types of PCI devices the driver supports.
3. How does a driver access the device's configuration space?

Register a PCI driver:
1. name: driver name, must be unique
2. id_table: all supported device, point to struct pci_device_id (一個driver可以support多個device, 列在此list之中)
3. int (*probe): "This function is called by the PCI core when it has a struct pci_dev that it thinks this driver wants to control" [ref1]
PCI core找到一個pci_dev, 呼叫driver, 檢查driver裡面的pci_device_id list. If the PCI driver claims the pci_dev that is passed to it, it should initialize the device properly and return 0.
4. void (*remove):
5. int (*suspend):
6. int (*resume):

Kernel PCI functions:
1. pci_get_device: scan the list of PCI devices in the system.
2. pci_get_subsys: allow subsystem vendor and subsystem device ID to be specified.
3. pci_get_slot: search the pci device on specific slot.
(all these functions can no be called from interrupt context)
4. pci_enable_device:Enabling the PCI device, setup interrupt line and I/O region

Accessing the configuration space:
當device driver已經偵測到device, driver就會去讀寫configuration space去找到這個device被map到那些memory or I/O space.
pci_read_config_byte
pci_read_config_word
pci_write_config_byte

Accessing the I/O and Memory space:
每個PCI device有六個I/O address regions.每個region可以是memory or I/O locations.
pci_resource_start
pci_resource_end
pci_resource_flags


>lspci | cut -d: -f1,2
[bus number, 8 bit]:[device number, 5 bit].[function, 3 bit]

00:00.0 Host bridge
00:01.0 PCI bridge
00:1c.0 PCI bridge
00:1c.4 PCI bridge
00:1c.5 PCI bridge
00:1d.0 USB Controller
00:1d.1 USB Controller
00:1d.2 USB Controller
00:1d.7 USB Controller
00:1e.0 PCI bridge
00:1f.0 ISA bridge
00:1f.1 IDE interface
00:1f.2 IDE interface
00:1f.3 SMBus
02:00.0 PCI bridge
03:02.0 Ethernet controller
04:00.0 Ethernet controller
06:05.0 VGA compatible controller


ecslgw@ecslgw-cewit:~$ cat /proc/bus/pci/devices  | cut -f1,2,3

Field 1 - Bus Dev Func
Field 2 - Vendor Id + Device Id 
Field 3 - Interrupt Line 
Field 4 - BAR 0 
and the rest of the BAR registers (0 - 5) after that.

0000 80862778 0               0
0008 80862779 df               0
00e0 808627d0 de               0
00e4 808627e0 dd               0
00e5 808627e2 dc               0
00e8 808627c8 14               0
00e9 808627c9 15               0
00ea 808627ca 16               0
0200 8086032c 0               0
0310 80861026 23        fe9e0004
0400 14e41659 10        fe6f0004
0628 18ca0020 0        fd000008


3. /lib/modules/2.6.39.4/modules.pcimap
The kernel search this file when hot-pluggable devices are found.

4. drivers/pci/search.c
/* For boot time work */
EXPORT_SYMBOL(pci_find_bus);
EXPORT_SYMBOL(pci_find_next_bus);
/* For everyone */
EXPORT_SYMBOL(pci_get_device);
EXPORT_SYMBOL(pci_get_subsys);
EXPORT_SYMBOL(pci_get_slot);
EXPORT_SYMBOL(pci_get_class);


Reference:
1. Linux Device Driver, Chapter 12

Writing and Presentation Guides

Writing and Presentation Guides

reference:  http://www.cs.sunysb.edu/~ezk/grad-res/index.html#writing

Strunk and White's Elements of Style
Encyclopedias, Dictionaries, and Thesauri
Checklist for Articles
How Not to Write a Paper
Mathematical Writing
Writing Up Your Final Project
Views and Feelings: A Word of Encouragement
A course on Writing and Oral Skills in Computer Science
Tips on Oral Presentations, an annual Usenet post by Henry Spencer.
The task of the referee, by Smith, A J., Computer 23(4): p. 65-71. Apr. 1990.
Collected Advice on Research and Writing
The Science of Scientfic Writing, an excellent article written by English professors for scientists (PDF).

http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mleone/web/how-to.html

2011年9月18日 星期日

PCI

PCI (Peripheral Component Interconnect)
Purpose: replace the ISA standard with three main goals:
1. better performance between peripherals, 2. platform independent, 3. simplify adding/removing devices.
PCI devices are jumperless (不懂?), and are automatically configured at boot time.

- Identification of a device
PCI定義三個參數來決定一個device:
1. bus number: device所屬的bus
2. device number: bus所指定的一個unique number, 每個bus最多可指定32個devices
3. function number: 多個device可以共用一個PCI slot, 利用function number來區分.
Much used in laptops are multifunction chipsets, which are attached to the PCI and integrate a whole range of expansions (IDE controller, USB controller, modem, network, etc.) in a minimum of space [ref2]
4. pci domain: Each PCI domain can host up to 256 buses.
每個bus可以有32 devices, 每個device可以有multifunction board, 最多八個function (identified by 16-bit address).

- Address space
PCI使用三個address space來跟device溝通
1. I/O space: described by 32 bits address
2. Data space (memory locations): 32 byte or 64 byte are available for data space, depending on the processor type
3. Configuration space: contains type and characteristics of the individual devices. 256 byte for each device function.

*每個pci bus使用同一個I/O space and data space, 每一個device有自己的configuration space.
*

from fig 6.1, ref1


PCI address space and configuration header
每一個PCI device都一個configuration header 存在他的address space, 系統根據PCI topology來決定此address space的位址 (tied to the physical PCI device location, which PCI slot it locates)
每一個PCI slot都有一個對應到的PCI configuration header 和offset. 所以系統可以從第一個offset去讀所有的offset來偵測PCI device, 並屆由vendor ID來判斷type.
From: ref2
Class code:  決定哪種device, ex: SCSI = 0x0100
BAR (Base Address Registers):  用來設定device可使用哪些PCI I/O and PCI memory
Interrupt Pin: 有四個pin可用來產生interrupt, ABCD.
Interrupt Line: 用來決定當interrupt發生時要執行那個interrupt handler.
六個Base address (32 bit each)用來定義PCI和其他系統溝通用的位址, 如果是64 bit, 那就要用兩個base addresses.

PCI I/O and PCI Memory address:
PCI device和driver溝通所使用的address space. 

PCI-ISA bridge:
用來轉換PCI I/O and memory address to ISA I/O and memory address for legacy ISA devices.

PCI-PCI bridge:
Devices that glue PCI buses of the system together. PCI bus 可以支援的device數量有限, 所以利用PCI-PCI bridge來連接多個devices. 利用兩種cycles: Type 0 and type 1來設定PCI-PCI bridges. 

"The PCI-PCI bridges must be programmed with a base and limit for PCI I/O and PCI Memory space access that they have to pass from their primary bus onto their secondary bus. Once the PCI-PCI Bridges in a system have been configured then so long as the Linux device drivers only access PCI I/O and PCI Memory space via these windows, the PCI-PCI Bridges are invisible." [ref1]
PCI-PCI bridge會被設定有一個memory map window, 讓device driver直接透過window去讀寫PCI I/O and memory space, 這樣讓系統感覺不到PCI-PCI bridge的存在.

PCI configuration cycle: A cycle is just an address as it appears on the PCI bus. 
Type 0: fig 6.3
沒有bus number, interpret as PCI configuration address. 

Type 1: fig 6.4
用來設定PCI-PCI bridge.
每個PCI-PCI bridge有三個值1) primary bus number: the upstream bus. 2) secondary bus number: the downstream bus. 3) subordinary bus number: the highest number of all buses that can be reached downstream of the bridge. 
如果PCI-PCI bridge收到type 1 frame. 
1. If bus number > subordinary number: ignore
2. If bus number == secondary number: 表示device在secondary bus上, convert to type 0
3. If subordinary > bus number > secondary: pass to the secondary interface.

Linux PCI Initialization:

Linux Kernel PCI data structure:
- each pci device is described by pci_dev struct
- the result is a tree topology from root to all leaf nodes
- pci_root points to pci_bus, pci_bus is PCI-PCI bridge as PCI bus can only be reached by PCI-PCI bridge.
Fig 6.5 from ref1:


Initialization可分成三部份: Device driver, BIOS, and Fixup


The PCI device driver:
- a function of the OS called at system initialization.
- the PCI init code scan all the PCI buses looking for PCI devices
- 利用PCI bios code去確認是否PCI slot上有裝置, if yes, setup the pci_dev.
- 每個pci bus存在struct pci_bus
PCI Device driver: pseudo-device driver searches the PCI system starting at BUS 0 and allocates all PCI devices and bridges. A PCI topology is built by a linked list of data structure. 接下來要做PCI-PCI bridge numbering, 用來偵測整個PCI topology和決定primary, secondary, and subordinary number for PCI-PCI bridge.

BIOS: 好像只是一些functions.
Fixup: system specific fixup code.

PCI Bar (From Wiki, PCI configuration space)



Reference:
ref1: http://tldp.org/LDP/tlk/dd/pci.html
ref2: The professional linux kernel
ref3: Linux Device Driver, chapter 12

2011年9月17日 星期六

食譜

1. 水餃
麵皮
中筋麵粉兩包,  加水後揉成麵粉團

內餡
絞肉 + 香油 + 薑 + 蔥 + 一點醬油後攪拌
最後跟韭菜拌在一起 (可以打一顆蛋)
也可以加一些鹽巴跟五香粉

2. 家常豆腐
老豆腐切片後沾但下去煎, 在拿出來放盤子
把香菇跟紅蘿蔔切片後炒一下, 把豆腐放進去, 加點香菇素蠔油
最後在加小黃瓜, 蔥, 和香菜

3. 肉燥
洋蔥一顆半, 絞肉, 五香粉, 油蔥
把洋蔥炒一炒後加入絞肉, 油蔥和醬油
炒熟後加水跟五香粉慢慢煮一小時
可加入水煮蛋

Page table, shadow and extended page table (Intel VT support)

Page table基本概念
Page table的功能是用來做virtual address to physical address轉換. Virtual address space是一個4G的連續記憶體位址, 專屬於某一個process, 但其所指到的實體記憶體其實是分散在不同physical memory 上, 甚至有可能不在physical memory裡面 (swap out). Modern OS利用Page table的功能提供了讓process之間有自己的連續記憶體區塊, 並提供process address space之間的isolation.

Page fault 
Page fault的發生有兩種可能, 第一個是當要去查詢某個virtual address時發現沒有此entry在page table裡面, 此時會發生segmentation fault.
第二種是查詢到的entry不在physical memory裡面, (page table entry裡面有個present bit = 0), 此時表示此區記憶體已經swap out, 必須要在被load進memory.

Paravirtualization 
需要修改guest OS code to call hypercall. 只有Linux能用 因為需要改OS
不然則需要利用binary translation (ex: windows), 直接修改binary, overhead較高 (called: full virtualization)
From Xen wiki about hypercall:
A hypercall is to a syscall what a hypervisor is to an OS. Alternatively, a hypercall is to a hypervisor what a syscall is to a kernel. A hypercall is a software trap from a domain to the hypervisor, just as a syscall is a software trap from an application to the kernel. Domains will use hypercalls to request privileged operations like updating pagetables.

Shadow page table
- guest virtual to host physical
當host要做context switching時, cr3會指向下一個page table. 當此情形發生在VM時, hypervisor會trap 此 instruction, 並將此cr3重新指向在host physical底下的page table. Hypervisor因此需要管理一個mapping, 從guest page table 轉到 host page table.
Host會將此page table設定成read-only, 當guest對pte (page table entry)做修改時, host 會trap此情況, 然後將修改後的直sync回guest page table.
因為修改cr3, trap instruction, synchronize需要較多的overhead, 因此有了EPT.

Extended page table
EPT是x86提供給虛擬畫技術的功能, 主要目的在加強MMU做VM和PM之間memory address的轉換("Extend" from guest physical to host physical). 當有EPT support時, VM的 physical address可以直接經由MMU轉換到PM的 physical address (因此不需要像shadow page table利用trap, 造成overhead). 此時cr3指到guest VM的 address, 因此Guest VM可以直接修改 page table( guest physical address) 然後MMU會直接寫到 host physical address.
Intel x86提供兩個instruction: vmentry and vmexit. 當vmentry時, MMU extended page table 功能被啓動, 因此VM可值接寫入host physical.

Reference:
http://en.wikipedia.org/wiki/Page_tables

好句分享

楊德昌說:
"我們拍片不要迎合觀眾,因為觀眾從來也不知道他們要什麼。但是你做出來以後,他會跟著去看"

賈柏斯說:

"Apple不做市場調查,他們只做自己想要用的東西"
"在真正的世界裡,好人不會擊敗壞人,壞人也不會擊敗好人,而是主動的人會擊敗懶惰的人"
"A good traveler has no fixed plans and is not intent on arriving." - Lao Tzu

2011年9月16日 星期五

x86 Segment Translation

Purpose
Virtual memory的目的是為了提供
1. abstraction of contiguous, isolated memory to a program.
2. prevent illegal operation

x86 processor modes

1. real mode: 20-bit address space (1MB), direct memory access, segmentation available
2. protected mode: 這時候就有segmentation and paging, 加上user / kernel space
3. long mode: 64-bit mode for x86-64
機器一開機後先進入real mode, boot loader轉成protected mode之後交給OS

x86 segmentation
a segment has: base address (linear addr), length, and type (code, data, etc)
a segment for: code, data, stack, and some extra...
一個program最多可以有6個segments, ex: CS, DS, SS, ES, FS, GS

利用不同的instruction type來推測現在是使用甚麼segment register
1. Control flow: CS
2. Stack management: SS
3. Most load/stores: DS
segment register有16bit, 包含index, 指向table裡面的entry.
所以當值行這些instruction時, processor自動去把相對映的segment selector load進來, 然後把base address, limit, type從Descriptor table中讀出 (see 5.1.4, ref2)


為何OS要定義segment?
1. memory protection (similar to paging)
2. 每個segment都有自己的permission, ex: CS: read-only
3. Process只能對特定的segment做特定的存取, ex: write to CS會造成segmentation fault!
4. provide location information in memory and flag to indicate whether it resides in memory.
MMU: 用來做virtual to physical memory位址轉換
reference: http://en.wikipedia.org/wiki/Memory_segmentation

Memory management register
"The processor provides four memory-management registers (GDTR, LDTR, IDTR, and TR) that specify the locations of the data structures which control segmented memory management" -- IA-32
x86 processor定義了四個register: gdtr, ldtr, idtr, and tr. 目的是為了方便存取這些table, 由這些table去拿到linear address.
GDT: any process can use these segments
LDT: segment definitions for a specific process

From: IA-32 chapter2
1. Exception發生: idtr -> trap gate -> LDT -> exception handler
2. Interrupt發生: IDT -> GDT-> interrupt handler
3. Process context switch: TR -> GDT -> TSS (把資料存下後 context switch)

Interrupt分為三種: Interrupt Gate, Trap Gate, and Task Gate
Interrupt發生時先根據IRQ number查IDT, 然後到GDT裡面拿到要執行code的segment. 之後就下instruction "CALL CS:IP"跳到handler
Gate通常指的是從一種privilege跳到另一個level of privilege.








 Reference:
1. CSE506 slides by Don Porter
2. http://www.cs.sunysb.edu/~porter/courses/cse506/f11/ref/i386/s05_01.htm

System Call

System call
- System call 也是一種interrupt, 早期是利用int 0x80來產生software interrupt
- the dispatch routine 就是interrupt handler, Linux check arch/x86/kernel/syscall_table_32.S去執行對應到的function, or interrupt handler.
-  interrupt 的index存在eax, 然後開始執行

有兩種方法
1. int 0x80, iret (把syscall當作一種interrupt, 但比較慢)
“The int assembly language instruction is inherently slow because it performs several consistency and security checks. (The instruction is described in detail in the section "Hardware Handling of Interrupts and Exceptions" in Chapter 4.)”

2. Intel Pentium II 之後支援 instruction: sysenter / sysexit (比較快 但不是每個processor都支援)
These instructions use MSRs (machine specific registers) to store: (多利用三個register去儲存)
- Syscall entry point and code segment
- Kernel stack
- Syscall return address

好處 :
1. Indeed faster than int instruction
2. Security arguments:
- Easier to sandbox a program (prevent illegal system calls) 
- Limits ability of a program to issue errant system calls

壞處: Programmer inconvenience
1. Can’t just drop an ‘int 0x80’ in my program anymore
2. Tighter contract between program and kernel
3. Also, not all x86 CPUs have this instruction

Reference: CSE506 slides interrupt.pdf

2011年9月15日 星期四

Interrupt 中斷

Interrupt
可分為兩種
1. Synchronous: 由內部驅動, 例如system call, divide by zero, 此類又叫做exception
2. Asynchronous: caused by external , 例如device I/O, time ticks

x86 interrupt可分為index 0-255
0-31: processor interrupt, or exception
ex: 14 for page fault
32-255: 由軟體控制
32-47: device interrupt (IRQ) in JOS
int 0x80 for system call interrupt
ex: 
IRQ=0, INT=32, Timer interrupt
IRQ=1, INT=33, Keyboard interrupt


Interrupt Procedure
當interrupt發生時, 控制權轉移至kernel的interrupt handler, CPU register儲存到kernel stack (TSS), 然後執行interrupt handler. 執行完後resume program (iret instruction)
HW要如何知道執行甚麼?
A. interrupt descriptor table (IDT) 在開機時必需設定好, 當interrupt發生時查此table然後執行
idtr: register pointing to the IDT, similar to gdtr, ldtr
entry 0 代表interrupt 0

Hardware Handling of Interrupts (4.2.4)
步驟如下
CS and EIP contain logical address of next instruction to be executed. 在執行之前會檢查是否發生interrupt, if yes, do the following:
1. determine the vector i (0 < i < 255)
2. 從IDT (idtr) 中讀出第i個entry, entry中包含了segment selector and offset
3. 從segment selector中去找GDT, 拿到base address. This descriptor specifies the base address of the segment that includes the interrupt or exception handler.
4,5,6. Makes sure the interrupt was issued by an authorized source. (就一些檢查, 先跳過)
7. Saves the contents of eflags , cs, and eip in the stack.
8. If the exception carries a hardware error code, it saves it on the stack.
9. 跳至interrupt handler:
Loads cs and eip, respectively, with the Segment Selector and the Offset fields of the Gate Descriptor stored in the i th entry of the IDT. These values define the logical address of the first instruction of the interrupt or exception handler.

Interrupt Descriptor
the fields in the interrupt descriptor includes:
1. code segment selector
2. segment offset
3. PrivilegeLevel(ring)
4. Present bit–disable unused interrupts
5. gate type

IRQ sharing (from chapter 4.6.1)
IRQ可以被多個devices共用:
The interrupt handler executes several interrupt service routines (ISRs). Each ISR is a function related to a single device sharing the IRQ line. Because it is not possible to know in advance which particular device issued the IRQ, each ISR is executed to verify whether its device needs attention; if so, the ISR performs all the operations that need to be executed when the device raises an interrupt.

Disable Interrupt
1. cli disable interrupt and sti enable it
2. clear the bit 9 of EFLAGS (IF flag)
Before touching a shared data structure (or grabbing a lock), an interrupt handler should disable I/O interrupts
當interrupt handler要拿一個lock時, 必須要disable interrupt. 如果拿到lock, 然後被另一個handler preempt, then deadlock.
“Interrupt handlers need to synchronize, both with locks (multi-processor) and by disabling interrupts (same CPU)” - ref2

Gate Type
在IDT裡面有兩種gate type
1. Interrupt: 必須要disable all other interrupt
2. Exception: 不必


TSS (Task Stack Segment)
The Task State Segment is a special structure on x86-based computers which holds information about a task. It is used by the operating system kernel for task management.  (from wiki)
TSS用來儲存關於task的資料, 例如
1. processor register state (當要做context switching時存下現狀)
2. I/O port permission ,...
暫存器TR(task register)儲存segment selector指向GDT, 內有valid TSS.
- Linux使用one TSS per CPU, and modify TSS fields when context switching

Protection
Interrupt提供一個界面讓user space可以進入kernel space, x86 processor必須提供保護kernel的機制已避免user space application bug造成kernel crashes. 兩種方法如下:
from ref3:
  1. The Interrupt Descriptor Table. The processor ensures that interrupts and exceptions can only cause the kernel to be entered at a few specific, well-defined entry-points determined by the kernel itself, and not by the code running when the interrupt or exception is taken.
    The x86 allows up to 256 different interrupt or exception entry points into the kernel, each with a different interrupt vector. A vector is a number between 0 and 255. An interrupt's vector is determined by the source of the interrupt: different devices, error conditions, and application requests to the kernel generate interrupts with different vectors. The CPU uses the vector as an index into the processor's interrupt descriptor table (IDT), which the kernel sets up in kernel-private memory, much like the GDT. From the appropriate entry in this table the processor loads:
    • the value to load into the instruction pointer (EIP) register, pointing to the kernel code designated to handle that type of exception.
    • the value to load into the code segment (CS) register, which includes in bits 0-1 the privilege level at which the exception handler is to run. (In JOS, all exceptions are handled in kernel mode, privilege level 0.)

  2. The Task State Segment. The processor needs a place to save the old processor state before the interrupt or exception occurred, such as the original values of EIP and CS before the processor invoked the exception handler, so that the exception handler can later restore that old state and resume the interrupted code from where it left off. But this save area for the old processor state must in turn be protected from unprivileged user-mode code; otherwise buggy or malicious user code could compromise the kernel.
    For this reason, when an x86 processor takes an interrupt or trap that causes a privilege level change from user to kernel mode, it also switches to a stack in the kernel's memory. A structure called the task state segment (TSS) specifies the segment selector and address where this stack lives. The processor pushes (on this new stack) SS, ESP, EFLAGS, CS, EIP, and an optional error code. Then it loads the CS and EIP from the interrupt descriptor, and sets the ESP and SS to refer to the new stack.
    Although the TSS is large and can potentially serve a variety of purposes, JOS only uses it to define the kernel stack that the processor should switch to when it transfers from user to kernel mode. Since "kernel mode" in JOS is privilege level 0 on the x86, the processor uses the ESP0 and SS0 fields of the TSS to define the kernel stack when entering kernel mode. JOS doesn't use any other TSS fields.
Interrupt Procedure in x86
1. Interrupt gate跟call gate的作用機制很像
2. IDT entry裡面有selector跟offset, selector指向GDT, 從GDT拿到base address.
3. 最後利用CALL CS:IP, CS=base, IP=offset去職行

from ref4.




Reference:
1. Understanding the Linux kernel
2. CSE506 course slide
3. CSE506 lab3, JOS lab3
4. 80386 Programmer's Reference Manual: http://www.cs.stonybrook.edu/~porter/courses/cse506/f11/ref/i386/c09.htm

PC開機要做的事情

PC的實體記憶體位址分配如下:
1. 早期的電腦只能定址到1MB (20bits), 而0 - 640KB 是給早期電腦的RAM使用, 稱為low memory
2. 384KB: 0x000A0000 through 0x000FFFFF: reserved for video and firmware held in non-volatile memory
3. Basic Input/Output System (BIOS), which occupies the 64KB region from 0x000F0000 through 0x000FFFFF. (960KB - 1MB)
4. BIOS會找到boot loader, 放到 0x7C00 - 0x7DFF

+------------------+  <- 0xFFFFFFFF (4GB)
|      32-bit      |
|  memory mapped   |
|     devices      |
|                  |
/\/\/\/\/\/\/\/\/\/\

/\/\/\/\/\/\/\/\/\/\
|                  |
|      Unused      |
|                  |
+------------------+  <- depends on amount of RAM
|                  |
|                  |
| Extended Memory  |  <- OS
|                  |
|                  |
+------------------+  <- 0x00100000 (1MB)
|     BIOS ROM     |
+------------------+  <- 0x000F0000 (960KB)
|  16-bit devices, |
|  expansion ROMs  |
+------------------+  <- 0x000C0000 (768KB)
|   VGA Display    |
+------------------+  <- 0x000A0000 (640KB)
|                  |
|    Low Memory    |
|                  |
+------------------+  <- 0x00000000
 
1. BIOS
開機後第一行執行的code
[f000:fff0] 0xffff0: ljmp   $0xf000,$0xe05b
[segment:OFFSET] linear addr: instruction 
在real mode 底下, segmentation的轉換方式為
linear address = 16 * segment + offset
16 * 0xf000 + 0xfff0   # in hex multiplication by 16 is
   = 0xf0000 + 0xfff0     # easy--just append a 0.
   = 0xffff0 
 
0xffff0落在960KB到1MB之間, 所以是BIOS, BIOS必須再往前跳一點, 在開始執行
BIOS會初始化interrupt table,和一些devices, 然後把OS從硬碟中讀出來 
之後就交給boot loader
The bios read the boot loader from disk and transfer control to it.

2. Boot Loader
boot loader的位置固定在0x7c00 through 0x7dff, 當開機後BIOS會去尋找可以用來開機的boot sector (512byte), 可能在disk上或其他地方, 找到後把boot sector load到0x7c00, 利用jmp instruction to set the CS:IP  = 0000: 7c00.

boot loader的主要兩個工作
1. switch the processor mode from real to protected mode.
2. 把kernel image從硬碟中讀出來 放到physical address 0x00100000 (1MB)的地方

用CR0來enable protected mode (boot/boot.S):
  lgdt    gdtdesc
  movl    %cr0, %eax
  orl     $CR0_PE_ON, %eax
  movl    %eax, %cr0
(lgdt: load GDT)

Real mode和Protected mode:
傳統16 bit 8086 CPU有16位元暫存器, 匯流排等等, 利用CS: IP 的方法來定址到1MB的記憶體, 物理位址 = CS * 16 + IP. 當轉換到protected mode時, CS不在代表一個address, 而是代表GDT裡面的index, 而GDT entry裡面有base, limit, 等資料結構 (Descriptor).
這就是為什麼上面那段code要先lgdt gdtdesc之後才可以切換到protected mode
See orange 3-10

3. Loading the kernel
OS通常linked and run在high virtual address such as 0xf0100000, 但有些機器無法提供如此高的memory address, 所以我們使用了processor的memory map把0xf0100000 map到0x00100000也就是RAM 1MB的地方 (就是在BIOS上方).



reference:
1. http://www.cs.sunysb.edu/~porter/courses/cse506/f11/lab1.html

Inline Assembly: 在C語言裡面嵌入assembly

Reference:
1. http://www.delorie.com/djgpp/doc/brennan/brennan_att_inline_djgpp.html
2. http://stenlyho.blogspot.com/2008/08/cinline-casmasm-code-asmmovl-1eaxnt.html
3. Linux device driver programming, Chapter 7-8

The Syntax
兩種不同的表示方法src和dst是相反的
AT&T: movl %eax, %ebx
Intel: mov ebx, eax

The Basic Inline Assembly
1. asm("nop")
2. push the register onto the stack.
asm (
  "pushl %eax\n\t"
  "movl $0, %eax\n\t"
  "popl %eax");
 eax的value放到stack, 在pop出來

3. inline assembly的通式
asm ( "statements" : output_registers : input_registers : clobbered_registers);
asm("組合語言指令碼"
      : 輸出暫存器
      : 輸入暫存器
      : 會更變的暫存器);

輸出暫存器語法, "="代表寫入專用, "+"代表讀寫兩用:
"=暫存器字源(輸出變數)"
"+暫存器字源(輸出變數)"

example from ref3:
u16 cx, ax, dx;
ams("int $0x11"
    : "+a" (ax), "=c" (cx), "=d", (dx)
    :
    : "ebx", "esi", "edi");

example from ref2 (自動分配暫存器):

int main (void) {
      int operand1, operand2, sum, accumulator;
   
      operand1 = rand (); operand2 = rand ();
     
      asm("movl %1, %0\n\t"
              "addl %2, %0"
        : "=r" (sum)   /* output operands */
        : "r" (operand1), "r" (operand2) /* input operands */
        : "0");    /* clobbered operands */
     
      accumulator = sum;
     
      asm("addl %1, %0\n\t"
        "addl %2, %0"
        : "=r" (accumulator)
        : "0" (accumulator), "g" (operand1), "r" (operand2)
        : "0");
      return accumulator;
}

- Note the use of = to specify an output register. You just have to do it that way.
- %0, %1, %2分別代表 "=r", "r" (operand1), 和 "r" (operand2), 順序是從output -> input
- clobber list = "0" 告訴gcc說%0的質已經被改變了 必須重新考慮合法性, "you can no longer count on the values you loaded into ecx or edi to be valid." This doesn't mean they will be reloaded for certain. This is the clobberlist.
- When you do the clobber list, you specify the registers as above with the %. If you write to a variable, you must include "memory" as one of The Clobbered.
- 如果statement裡真要指定register要多加個%變成%%eax

example:
asm ("leal (%0,%0,4), %0"
     : "=r" (x)
     : "0" (x) );
This also works, by the way:
asm ("leal (%%ebx,%%ebx,4), %%ebx"
     : "=b" (x)
     : "b" (x) );

--- list of register loading code ---
a        eax
b        ebx
c        ecx
d        edx
S        esi
D        edi
I        constant value (0 to 31)
q,r      dynamically allocated register (see below) 自動分配
g        eax, ebx, ecx, edx or variable in memory
A        eax and edx combined into a 64-bit integer (use long longs)
 
Example:
asm volatile("int %1\n"
 : "=a" (ret)
 : "i" (T_SYSCALL),
  "a" (num),
  "d" (a1),
  "c" (a2),
  "b" (a3),
  "D" (a4),
  "S" (a5)
 : "cc", "memory"); 
Here %1 means the 2nd parameter in the extended inline assembly code 
(and not the immediate value 1) which is T_SYSCALL (similarly %0
corresponds to the 1st parameter that is ret). 

Linux segmentation and paging

看了好久總是不太董linux的memory managemet....
From chapter 2. page 38
1. Logical address translation
- Logical address 經由segmentation轉成linear address
- Linear address 經由 paging 轉成physical address

2. Segment selector and GDT 
(用來做logical to linear 轉換)
一個logical address包含segment identifier (16bits) and offset (32bits)
segment identifier就是segment selector
Linux提供了segment registers:
CS: code segment, 指向program instruction
SS: stack segment
DS: data segment
...
Segment selector裡面有個13bit index, 指向Global Description Table (GDT)
每個segment description entry有8byte, 裡面有Base and Limit, 用來做address translation

3. Segment selector
分成兩部份visible and invisible. Visible: 16 bit, 前13bit為index指到GDT entry. 從GDT entry 拿到segment descriptor後會把descriptor存到invisible part給processor當cache.
"The processor automatically fetches the base address, limit, type, and other information from a descriptor table and loads them into the invisible part of the segment register." from 5.1.4 ref2


  
4. 80386系統其實有兩種memory organization 
reference: http://www.cs.sunysb.edu/~porter/courses/cse506/f11/ref/i386/s02_01.htm
- flat model: 一個process只有4G記憶體空間
- segmentation model: 每個process有64 Terabyte, 因為每個segment有14 bits (2^14) 
再加上32 bits的offset, 總共有46 bits, 64 Terabyte.

Extract from the ref:
- A segment selector, which is a 16-bit field that identifies a segment
- An offset, which is a 32-bit ordinal that addresses to the byte level within a segment.

5. Page Translation (Paging)
- Linear address包含三部份: DIR to index the page directory, Page to index the page table, OFFSET則指到最後4G address space的地方
- CR3存的是目前task的page table physical address
- CR0有個PG bit, set時paging才會enable
from ref2:

6. Page table entry
因為page table指到的是4K page boundary, 所以address的前12 bit都是零, Page table entry利用這12 bit來放一些page table的property.
Page table entry跟page directory entry的格式差不多相同.



Reference:1. Understanding the Linux Kernel
2. 80386 Programmer's Reference Manual, Chapter 5

Intel x86 Function-call Conventions

reference:
http://www.unixwiz.net/techtips/win32-callconv-asm.html
** the stack grows from high memory address to low memory addr **

When a function is called, the compiler generates assembly code by doing:
1. push parameters onto the stack

2. call the function
the processor push content of %EIP to the stack.
%EIP points to the next instruction (the instruction after CALL). So that after calling it can return to the caller.

The caller has lost control.

**The figure below grows from up to down.


3. Save and update %ebp
push %ebp
mov %esp, %ebp
The new base pointer (base frame) is from the %esp

4. save CPU register states

5. Allocate local variables

- Function Parameters are positive offsets from EBP (stack frame register)
- Local Variables occur as negative offsets from the EBP (stack frame pointer) register

At this point, the stack frame is set up correctly, and this is represented by the diagram to the right. All the parameters and locals are offsets from the %ebp register:
16(%ebp) - third function parameter
12(%ebp) - second function parameter
8(%ebp) - first function parameter
4(%ebp) - old %EIP (the function's "return address")
0(%ebp) - old %EBP (previous function's base pointer)
-4(%ebp) - first local variable
-8(%ebp) - second local variable
-12(%ebp) - third local variable

2011年9月9日 星期五

first article

恩今天開始要好好寫點東西
隨便寫甚麼都好
但要可以持續下去