2011年9月26日 星期一

Linux Memory Allocator

NUMA, Node, and Zone 
NUMA and Node
Linux 2.6之後支援NUMA (Non Uniform Memory Access), 記憶體的分配是有區域性的, 實體記憶體被切割成多個node. 每個CPU access屬於自己的memory node的時間相同, 若是不同的node則需要較多的時間.
每個node的相關資料存在struct pg_data_t

Zone and Region
每個node裡面在區分成多個zone, pg_data_t裡面含有node_zonelists and node_zones

因為硬體限制 P.299
1. DMA必須要access前16MB的address space
2. 32-bit系統最多只可以定址到4G
3. Kernel address space只有1G的空間 (在扣掉保留的128MB), 0xc0000000 - 0xffffffff

所以將Zone區分成三種
1. ZONE_DMA: below 16MB
2. ZONE_NORMAL: 16MB ~ 896MB
3. ZONE_HIGHMEM: > 896MB
"the ZONE_HIGHMEM zone includes page frames that cannot be directly accessed by the kernel through the linear mapping in the fourth gigabyte of linear address space" from ref1

Kernel address space 介於3G ~ 4G, 大於896MB的記憶體不值接linear map到kernel. 小於896MB可值接將physical memory address對應到linear address.

* 896MB = 1024 - 128MB, 比較高的128MB用來做Fixed Mapped Linear Address for HIGHMEM.

from ref1, fig 8-2: 每區的zone有自己的buddy system, buddy system底下有slab. 當使用ZONE_HIGHMEM時, 會始用highmem的buddy system.


解決highmem mapping的方法
1. permanent kernel mapping, 2. temporary kernel mapping, 3. noncontiguous memory allocation. (p. 306)

Buddy Allocator
Buddy algorithm:
- no external fragmentation
- worse case: waste less than half of the requested size
- kernel need to keep track of the existing blocks of free, contiguous page frames.
- 所有的free page frame被分配到11個list, 根據大小1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page frames.
- Zone allocator is the frontend of buddy system
- 在buddy allocator之上還有slab allocator

Buddy system每次以page為單位太大了, 所以有slab allocator來管理較小的memory request
- buddy allocator是以page為管理單位, 而slab是以cache為管理單位

Slab allocator
- the default cache allocator
- "kernel creates 13 geometrically distributed lists of free memory areas whose sizes range from 32 to 131, 072 bytes" from ref1
- The slab allocator groups objects into caches . Each cache is a "store" of objects of the same type
- 缺點: Similar basic ideas, but substantially more complex bookkeeping.
- 缺點: 2 groups upset 1. Users of very small system (ex: watch) 2. Users of large multi-processor systems
- example: Slab再不同CPU之間溝通需要較高的overhead.
For very large (thousands of CPU) systems, complex allocator bookkeeping gets out of hand. slabs try to migrate objects from one CPU to another to avoid synchronization. (ref2)

SLOB allocator
- 給小的系統使用,例如一些嵌入式系統, overhead較小
- Simple List of Blocks
- Keep a free list of available chunk and its size
- Grab the first one big enough chunk to work
- No internal fragmentation
- Has external fragmentation but trade with low overhead

SLUB allocator
- SLAB有bookkeeping overhead, 因為不同CPU之間需要協調
- The Unqueue Slab Allocator
- All objects of same size from same slab
- simple free list of per slab
- no cross-CPU nonsense
- Does better than SLAB in many cases


from ref4, fig 8-1


Cache line alignment:
1. cache line: 64 ~ 128 bytes on most CPUs
2. word: 32 bits or 64 bits
3. cache line是memory cache的基本單位


Reference:
1. Understanding the Linux Kernel, chapter 8
2. Slides from cse506, Don Porter
3. False Sharing:
Performance overhead under multiprocessor cache system, see http://en.wikipedia.org/wiki/False_sharing
4. Linux device driver programming

沒有留言:

張貼留言