memory hierarchy
play

Memory Hierarchy Heechul Yun 1 Topics Introduction to Real-Time - PowerPoint PPT Presentation

Memory Hierarchy Heechul Yun 1 Topics Introduction to Real-Time Systems, CPS CPS Applications Real-time architecture/OS Fault tolerance, safety, security Amazon prime air 2 Topics Introduction to Real-Time Systems, CPS


  1. Why is DRAM Important? • Why do we need bigger and faster memory? • Data intensive computing – Bigger, more complex application – Large and high-bandwidth data processing – DRAM is often a performance bottleneck 40

  2. Why is DRAM Important? • Parallelism – Out-of-order core • A single core can generate many memory requests – Multicore • Multiple cores share DRAM – Accelerator • GPU 41

  3. Memory Performance Isolation Part 2 Part 3 Part 4 Part 1 Core3 Core1 Core2 Core4 LLC LLC LLC LLC Memory Controller DRAM • Q. How to guarantee predictable memory performance? 42

  4. Memory System Architecture L2 CACHE 0 L2 CACHE 1 SHARED L3 CACHE DRAM INTERFACE DRAM BANKS CORE 0 CORE 1 DRAM MEMORY CONTROLLER L2 CACHE 2 L2 CACHE 3 CORE 2 CORE 3 This slide is from Prof. Onur Mutlu 43

  5. DRAM Organization • Channel • Rank • Chip • Bank • Row • Row/Column 44

  6. The DRAM subsystem “ Channel ” DIMM (Dual in-line memory module) Processor Memory channel Memory channel This slide is from Prof. Onur Mutlu

  7. Breaking down a DIMM DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM This slide is from Prof. Onur Mutlu

  8. Breaking down a DIMM DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM Rank 0: collection of 8 chips Rank 1 This slide is from Prof. Onur Mutlu

  9. Rank Rank 0 (Front) Rank 1 (Back) <0:63> <0:63> Addr/Cmd CS <0:1> Data <0:63> Memory channel This slide is from Prof. Onur Mutlu

  10. Breaking down a Rank . . . Chip 0 Chip 1 Chip 7 Rank 0 <56:63> <8:15> <0:7> <0:63> Data <0:63> This slide is from Prof. Onur Mutlu

  11. Breaking down a Chip Chip 0 Bank 0 <0:7> <0:7> <0:7> ... <0:7> <0:7> This slide is from Prof. Onur Mutlu

  12. Breaking down a Bank 2kB 1B (column) row 16k-1 ... Bank 0 row 0 <0:7> Row-buffer 1B 1B 1B ... <0:7> This slide is from Prof. Onur Mutlu

  13. Example: Transferring a cache block Physical memory space 0xFFFF…F Channel 0 ... DIMM 0 0x40 Rank 0 64B cache block 0x00 This slide is from Prof. Onur Mutlu

  14. Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> cache block 0x00 This slide is from Prof. Onur Mutlu

  15. Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 0 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> cache block 0x00 This slide is from Prof. Onur Mutlu

  16. Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 0 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> cache block 8B 0x00 8B This slide is from Prof. Onur Mutlu

  17. Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> cache block 8B 0x00 This slide is from Prof. Onur Mutlu

  18. Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> 8B cache block 8B 0x00 8B This slide is from Prof. Onur Mutlu

  19. Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> 8B cache block 8B 0x00 A 64B cache block takes 8 I/O cycles to transfer. During the process, 8 columns are read sequentially. This slide is from Prof. Onur Mutlu

  20. DRAM Organization Core1 Core2 Core3 Core4 L3 Memory Controller (MC) • DRAM DIMM Have multiple banks • Different banks can be accessed in parallel Bank Bank Bank Bank 1 2 3 4

  21. Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Fast DRAM DIMM • Peak = 10.6 GB/s Bank Bank Bank Bank – DDR3 1333Mhz 1 2 3 4

  22. Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Fast DRAM DIMM • Peak = 10.6 GB/s Bank Bank Bank Bank – DDR3 1333Mhz 1 2 3 4 • Out-of-order processors

  23. Most-cases Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Mess DRAM DIMM • Performance = ?? Bank Bank Bank Bank 1 2 3 4

  24. Worst-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Slow DRAM DIMM • 1bank b/w Bank Bank Bank Bank – Less than peak b/w 1 2 3 4 – How much?

  25. DRAM Chip Bank 4 Bank 3 Bank 2 READ (Bank 1, Row 3, Col 7) Bank 1 Row 1 precharge Row 2 Col7 Row 3 Row 4 Row 5 activate Row Buffer Read/write • State dependent access latency – Row miss: 19 cycles, Row hit: 9 cycles (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)

  26. DDR3 Timing Parameters Kim et al., “Bounding Memory Interference Delay in COTS -based Multi- Core Systems,” RTAS’14 65

  27. DRAM Controller • Service DRAM requests (from CPU) while obeying timing/resource constraints – Translate requests to DRAM command sequences – Timing constraints: e.g., minimum write-to-read delay, activation time, … – Resource conflicts: bank, bus, channel • Maximize performance – Buffering, reordering, pipelining in scheduling requests 66

  28. DRAM Controller Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1. • Request queue – Buffer read/write requests from CPU cores – Unpredictable queuing delay due to reordering 67

  29. Request Reordering Initial Queue Reordered Queue Core1: READ Row 1, Col 1 Core1: READ Row 1, Col 1 Core2: READ Row 2, Col 1 Core1: READ Row 1, Col 2 Core1: READ Row 1, Col 2 Core2: READ Row 2, Col 1 DRAM DRAM 2 Row Switch 1 Row Switch • Improve row hit ratio and throughput • Unpredictable queuing delay 68

  30. Row Management Policy • Open row – Keep the row open after an access • If next access targets the same row: CAS • If next access targets a different row: PRE + ACT + CAS • Close row – Close the row after an access • always pay the same (longer) cost: ACT + CAS • Adaptive policies 69

  31. COTS DRAM Controller • FR-FCFS scheduling (out-of-order) • Separate read/write buffers – High/low watermark based switching • Read prioritized over writes 70

  32. PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms Heechul Yun*, Renato Mancuso + , Zheng-Pei Wu # , Rodolfo Pellizzoni # *University of Kansas, + University of Illinois , # University of Waterloo 71

  33. Background: DRAM Organization Core1 Core2 Core3 Core4 L3 Memory Controller (MC) • DRAM DIMM Have multiple banks • Different banks can be accessed in parallel Bank Bank Bank Bank 1 2 3 4 72

  34. Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Fast DRAM DIMM • Peak = 10.6 GB/s Bank Bank Bank Bank – DDR3 1333Mhz 1 2 3 4 73

  35. Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Fast DRAM DIMM • Peak = 10.6 GB/s Bank Bank Bank Bank – DDR3 1333Mhz 1 2 3 4 • Out-of-order processors 74

  36. Most-cases Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Mess DRAM DIMM • Performance = ?? Bank Bank Bank Bank 1 2 3 4 75

  37. Worst-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Slow DRAM DIMM • 1bank b/w Bank Bank Bank Bank – Less than peak b/w 1 2 3 4 – How much? 76

  38. Timing Diagram Row hits are fast but can still suffer interference at bus Row misses are slow but can overlap when target different banks Heechul Yun, Rodolfo Pellizzoni, Prathap Kumar Valsan. Parallelism-Aware Memory Interference Delay Analysis 77 for COTS Multicore Systems. Euromicro Conference on Real-Time Systems (ECRTS) , 2015. [pdf] [ppt]

  39. Problem • OS does NOT know SMP OS DRAM banks Core1 Core2 Core3 Core4 • OS memory pages are spread all over multiple banks L3 Memory Controller (MC) ???? DRAM DIMM Unpredictable memory Bank Bank Bank Bank 1 2 3 4 performance 78

  40. PALLOC • OS is aware of SMP OS DRAM mapping Core1 Core2 Core3 Core4 • Each page can be allocated to a CPC desired DRAM bank Memory Controller (MC) DRAM DIMM Flexible Bank Bank allocation Bank Bank 1 2 3 4 policy 79

  41. PALLOC • Private banking Core1 Core2 Core3 Core4 – Allocate pages on certain L3 exclusively assigned banks Memory Controller (MC) DRAM DIMM Eliminate Bank Bank Bank Inter-core bank Bank 1 2 3 4 conflicts 80

  42. Challenges • Finding DRAM address mapping • Modifying Linux’s memory allocation mechanism 81

  43. Identifying Memory Mapping 31 21 19 14 12 6 0 cache-sets banks banks Intel Xeon 3530 + 4GiB DDR3 DIMM (16 banks) 31 18 16 14 0 7 6 cache-sets banks channel Freescale P4080 + 2x2 GiB DDR3 DIMM (32 banks) • Memory mappings are platform specific • We developed a detection tool software 82

  44. DRAM Address Map Detector Slow Fast https://github.com/heechul/palloc/blob/master/README-map-detector.md 83

  45. More Recent Solution https://github.com/IAIK/drama 84

  46. Complex Addressing 85

  47. Implementation • Modified Linux kernel’s buddy allocator – DRAM bank-aware page frame allocation at each page fault 86

  48. Background • On how Linux allocates memory 87

  49. User-level Memory Allocation • When does a process actually allocate a memory from the kernel? – On a page fault – Allocate a page (e.g., 4KB) • What does malloc () do? – Doesn’t physically allocate pages – Manage a process’s heap – Variable size objects in heap 88

  50. Kernel-level Memory Allocation • Page-level allocator (low-level) – Page granularity (4K) – Buddy allocator • Other kernel-memory allocators – Support fine-grained allocations – Slab, kmalloc, vmalloc allocators 89

  51. Kernel-Level Memory Allocators Kernel code vmalloc kmalloc (large) non-physically Arbitrary size objects contiguous memory SLAB allocator Multiple fixed-sized object caches Page allocator (buddy) Allocate power of two pages: 4K, 8K, 16K, … 90

  52. Buddy Allocator • Linux’s page -level allocator • Allocate power of two number of pages: 1, 2, 4, 8, … pages. • Request rounded up to next highest power of 2 • When smaller allocation needed than is available, current chunk split into two buddies of next-lower power of 2 • Quickly expand/shrink across the lists 32KB 16KB 8KB 4KB 91

  53. Buddy Allocator • Example – Assume 256KB chunk available, kernel requests 21KB 256 Free 128 128 Free Free 64 64 128 Free Free Free 32 32 64 128 Free Free Free Free 32 32 64 128 A Free Free Free 92

  54. Buddy Allocator • Example 32 32 64 128 A Free Free Free – Free A 32 32 64 128 Free Free Free Free 64 64 128 Free Free Free 128 128 Free Free 256 Free 93

  55. PALLOC in Linux Kernel code vmalloc kmalloc (large) non-physically Arbitrary size objects contiguous memory SLAB allocator Multiple fixed-sized object caches PALLOC 94

  56. Simplified Pseudocode 95

  57. PALLOC Interface • Example: per-core private banking (PB) # cd /sys/fs/cgroup # mkdir core0 core1 core2 core3  create 4 cgroup partitions # echo 0-3 > core0/palloc.dram_bank  assign bank 0 ~ 3 for the core0 partition. # echo 4-7 > core1/palloc.dram_bank # echo 8-11 > core2/palloc.dram_bank # echo 12-15 > core3/palloc.dram_bank 96

  58. Evaluation Platforms • Platform #1: Intel Xeon 3530 – X86-64, 4 cores, 8MB shared L3 cache – 1 x 4GB DDR3 DRAM module (16 banks) – Modified Linux 3.6.0 • Platform #2: Freescale P4080 – PowerPC, 8 cores, 2MB shared LLC – 2 x 2GB DDR3 DRAM module (32 banks) – Modified Linux 3.0.6 97

  59. Samebank vs. Diffbank • Samebank: All cores  Bank0 • Diffbank: Core0  Bank0, Core1-3  Bank 1-3 – Zero interference !!! 98

  60. Real-Time Performance Buddy(solo) PALLOC(diffbank) Buddy • Setup: HRT  Core0, X-server  Core1 • Buddy: no bank control (use all Bank 0-15) • Diffbank: Core0  Bank0-7, Core1  Bank8-15 99

  61. SPEC2006 • Use 15 high-medium memory intensive benchmarks 100

Recommend


More recommend