Why is DRAM Important? • Why do we need bigger and faster memory? • Data intensive computing – Bigger, more complex application – Large and high-bandwidth data processing – DRAM is often a performance bottleneck 40
Why is DRAM Important? • Parallelism – Out-of-order core • A single core can generate many memory requests – Multicore • Multiple cores share DRAM – Accelerator • GPU 41
Memory Performance Isolation Part 2 Part 3 Part 4 Part 1 Core3 Core1 Core2 Core4 LLC LLC LLC LLC Memory Controller DRAM • Q. How to guarantee predictable memory performance? 42
Memory System Architecture L2 CACHE 0 L2 CACHE 1 SHARED L3 CACHE DRAM INTERFACE DRAM BANKS CORE 0 CORE 1 DRAM MEMORY CONTROLLER L2 CACHE 2 L2 CACHE 3 CORE 2 CORE 3 This slide is from Prof. Onur Mutlu 43
DRAM Organization • Channel • Rank • Chip • Bank • Row • Row/Column 44
The DRAM subsystem “ Channel ” DIMM (Dual in-line memory module) Processor Memory channel Memory channel This slide is from Prof. Onur Mutlu
Breaking down a DIMM DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM This slide is from Prof. Onur Mutlu
Breaking down a DIMM DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM Rank 0: collection of 8 chips Rank 1 This slide is from Prof. Onur Mutlu
Rank Rank 0 (Front) Rank 1 (Back) <0:63> <0:63> Addr/Cmd CS <0:1> Data <0:63> Memory channel This slide is from Prof. Onur Mutlu
Breaking down a Rank . . . Chip 0 Chip 1 Chip 7 Rank 0 <56:63> <8:15> <0:7> <0:63> Data <0:63> This slide is from Prof. Onur Mutlu
Breaking down a Chip Chip 0 Bank 0 <0:7> <0:7> <0:7> ... <0:7> <0:7> This slide is from Prof. Onur Mutlu
Breaking down a Bank 2kB 1B (column) row 16k-1 ... Bank 0 row 0 <0:7> Row-buffer 1B 1B 1B ... <0:7> This slide is from Prof. Onur Mutlu
Example: Transferring a cache block Physical memory space 0xFFFF…F Channel 0 ... DIMM 0 0x40 Rank 0 64B cache block 0x00 This slide is from Prof. Onur Mutlu
Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> cache block 0x00 This slide is from Prof. Onur Mutlu
Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 0 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> cache block 0x00 This slide is from Prof. Onur Mutlu
Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 0 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> cache block 8B 0x00 8B This slide is from Prof. Onur Mutlu
Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> cache block 8B 0x00 This slide is from Prof. Onur Mutlu
Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> 8B cache block 8B 0x00 8B This slide is from Prof. Onur Mutlu
Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> 8B cache block 8B 0x00 A 64B cache block takes 8 I/O cycles to transfer. During the process, 8 columns are read sequentially. This slide is from Prof. Onur Mutlu
DRAM Organization Core1 Core2 Core3 Core4 L3 Memory Controller (MC) • DRAM DIMM Have multiple banks • Different banks can be accessed in parallel Bank Bank Bank Bank 1 2 3 4
Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Fast DRAM DIMM • Peak = 10.6 GB/s Bank Bank Bank Bank – DDR3 1333Mhz 1 2 3 4
Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Fast DRAM DIMM • Peak = 10.6 GB/s Bank Bank Bank Bank – DDR3 1333Mhz 1 2 3 4 • Out-of-order processors
Most-cases Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Mess DRAM DIMM • Performance = ?? Bank Bank Bank Bank 1 2 3 4
Worst-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Slow DRAM DIMM • 1bank b/w Bank Bank Bank Bank – Less than peak b/w 1 2 3 4 – How much?
DRAM Chip Bank 4 Bank 3 Bank 2 READ (Bank 1, Row 3, Col 7) Bank 1 Row 1 precharge Row 2 Col7 Row 3 Row 4 Row 5 activate Row Buffer Read/write • State dependent access latency – Row miss: 19 cycles, Row hit: 9 cycles (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
DDR3 Timing Parameters Kim et al., “Bounding Memory Interference Delay in COTS -based Multi- Core Systems,” RTAS’14 65
DRAM Controller • Service DRAM requests (from CPU) while obeying timing/resource constraints – Translate requests to DRAM command sequences – Timing constraints: e.g., minimum write-to-read delay, activation time, … – Resource conflicts: bank, bus, channel • Maximize performance – Buffering, reordering, pipelining in scheduling requests 66
DRAM Controller Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1. • Request queue – Buffer read/write requests from CPU cores – Unpredictable queuing delay due to reordering 67
Request Reordering Initial Queue Reordered Queue Core1: READ Row 1, Col 1 Core1: READ Row 1, Col 1 Core2: READ Row 2, Col 1 Core1: READ Row 1, Col 2 Core1: READ Row 1, Col 2 Core2: READ Row 2, Col 1 DRAM DRAM 2 Row Switch 1 Row Switch • Improve row hit ratio and throughput • Unpredictable queuing delay 68
Row Management Policy • Open row – Keep the row open after an access • If next access targets the same row: CAS • If next access targets a different row: PRE + ACT + CAS • Close row – Close the row after an access • always pay the same (longer) cost: ACT + CAS • Adaptive policies 69
COTS DRAM Controller • FR-FCFS scheduling (out-of-order) • Separate read/write buffers – High/low watermark based switching • Read prioritized over writes 70
PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms Heechul Yun*, Renato Mancuso + , Zheng-Pei Wu # , Rodolfo Pellizzoni # *University of Kansas, + University of Illinois , # University of Waterloo 71
Background: DRAM Organization Core1 Core2 Core3 Core4 L3 Memory Controller (MC) • DRAM DIMM Have multiple banks • Different banks can be accessed in parallel Bank Bank Bank Bank 1 2 3 4 72
Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Fast DRAM DIMM • Peak = 10.6 GB/s Bank Bank Bank Bank – DDR3 1333Mhz 1 2 3 4 73
Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Fast DRAM DIMM • Peak = 10.6 GB/s Bank Bank Bank Bank – DDR3 1333Mhz 1 2 3 4 • Out-of-order processors 74
Most-cases Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Mess DRAM DIMM • Performance = ?? Bank Bank Bank Bank 1 2 3 4 75
Worst-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Slow DRAM DIMM • 1bank b/w Bank Bank Bank Bank – Less than peak b/w 1 2 3 4 – How much? 76
Timing Diagram Row hits are fast but can still suffer interference at bus Row misses are slow but can overlap when target different banks Heechul Yun, Rodolfo Pellizzoni, Prathap Kumar Valsan. Parallelism-Aware Memory Interference Delay Analysis 77 for COTS Multicore Systems. Euromicro Conference on Real-Time Systems (ECRTS) , 2015. [pdf] [ppt]
Problem • OS does NOT know SMP OS DRAM banks Core1 Core2 Core3 Core4 • OS memory pages are spread all over multiple banks L3 Memory Controller (MC) ???? DRAM DIMM Unpredictable memory Bank Bank Bank Bank 1 2 3 4 performance 78
PALLOC • OS is aware of SMP OS DRAM mapping Core1 Core2 Core3 Core4 • Each page can be allocated to a CPC desired DRAM bank Memory Controller (MC) DRAM DIMM Flexible Bank Bank allocation Bank Bank 1 2 3 4 policy 79
PALLOC • Private banking Core1 Core2 Core3 Core4 – Allocate pages on certain L3 exclusively assigned banks Memory Controller (MC) DRAM DIMM Eliminate Bank Bank Bank Inter-core bank Bank 1 2 3 4 conflicts 80
Challenges • Finding DRAM address mapping • Modifying Linux’s memory allocation mechanism 81
Identifying Memory Mapping 31 21 19 14 12 6 0 cache-sets banks banks Intel Xeon 3530 + 4GiB DDR3 DIMM (16 banks) 31 18 16 14 0 7 6 cache-sets banks channel Freescale P4080 + 2x2 GiB DDR3 DIMM (32 banks) • Memory mappings are platform specific • We developed a detection tool software 82
DRAM Address Map Detector Slow Fast https://github.com/heechul/palloc/blob/master/README-map-detector.md 83
More Recent Solution https://github.com/IAIK/drama 84
Complex Addressing 85
Implementation • Modified Linux kernel’s buddy allocator – DRAM bank-aware page frame allocation at each page fault 86
Background • On how Linux allocates memory 87
User-level Memory Allocation • When does a process actually allocate a memory from the kernel? – On a page fault – Allocate a page (e.g., 4KB) • What does malloc () do? – Doesn’t physically allocate pages – Manage a process’s heap – Variable size objects in heap 88
Kernel-level Memory Allocation • Page-level allocator (low-level) – Page granularity (4K) – Buddy allocator • Other kernel-memory allocators – Support fine-grained allocations – Slab, kmalloc, vmalloc allocators 89
Kernel-Level Memory Allocators Kernel code vmalloc kmalloc (large) non-physically Arbitrary size objects contiguous memory SLAB allocator Multiple fixed-sized object caches Page allocator (buddy) Allocate power of two pages: 4K, 8K, 16K, … 90
Buddy Allocator • Linux’s page -level allocator • Allocate power of two number of pages: 1, 2, 4, 8, … pages. • Request rounded up to next highest power of 2 • When smaller allocation needed than is available, current chunk split into two buddies of next-lower power of 2 • Quickly expand/shrink across the lists 32KB 16KB 8KB 4KB 91
Buddy Allocator • Example – Assume 256KB chunk available, kernel requests 21KB 256 Free 128 128 Free Free 64 64 128 Free Free Free 32 32 64 128 Free Free Free Free 32 32 64 128 A Free Free Free 92
Buddy Allocator • Example 32 32 64 128 A Free Free Free – Free A 32 32 64 128 Free Free Free Free 64 64 128 Free Free Free 128 128 Free Free 256 Free 93
PALLOC in Linux Kernel code vmalloc kmalloc (large) non-physically Arbitrary size objects contiguous memory SLAB allocator Multiple fixed-sized object caches PALLOC 94
Simplified Pseudocode 95
PALLOC Interface • Example: per-core private banking (PB) # cd /sys/fs/cgroup # mkdir core0 core1 core2 core3 create 4 cgroup partitions # echo 0-3 > core0/palloc.dram_bank assign bank 0 ~ 3 for the core0 partition. # echo 4-7 > core1/palloc.dram_bank # echo 8-11 > core2/palloc.dram_bank # echo 12-15 > core3/palloc.dram_bank 96
Evaluation Platforms • Platform #1: Intel Xeon 3530 – X86-64, 4 cores, 8MB shared L3 cache – 1 x 4GB DDR3 DRAM module (16 banks) – Modified Linux 3.6.0 • Platform #2: Freescale P4080 – PowerPC, 8 cores, 2MB shared LLC – 2 x 2GB DDR3 DRAM module (32 banks) – Modified Linux 3.0.6 97
Samebank vs. Diffbank • Samebank: All cores Bank0 • Diffbank: Core0 Bank0, Core1-3 Bank 1-3 – Zero interference !!! 98
Real-Time Performance Buddy(solo) PALLOC(diffbank) Buddy • Setup: HRT Core0, X-server Core1 • Buddy: no bank control (use all Bank 0-15) • Diffbank: Core0 Bank0-7, Core1 Bank8-15 99
SPEC2006 • Use 15 high-medium memory intensive benchmarks 100
Recommend
More recommend