adaptive scheduling for systems with asymmetric memory
play

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies - PowerPoint PPT Presentation

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies Po-An Tsai , Changping Chen, and Daniel Sanchez Die-stacking has enabled near-data processing Die-stacking has enabled near-data processing Conventional multicore processors use


  1. Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies Po-An Tsai , Changping Chen, and Daniel Sanchez

  2. Die-stacking has enabled near-data processing

  3. Die-stacking has enabled near-data processing Conventional multicore processors use a multi-level deep cache hierarchy to reduce data movement Shared LLC Private Caches Cores

  4. Die-stacking has enabled near-data processing Near-data processors place Conventional multicore processors use cores close to main memory to a multi-level deep cache hierarchy to reduce data movement reduce data movement Shared LLC Private Caches DRAM Dies Logic Vault Layer Controller Cores NDP Core Private cache only ( shallow hierarchy)

  5. Die-stacking has enabled near-data processing Near-data processors place Conventional multicore processors use cores close to main memory to a multi-level deep cache hierarchy to reduce data movement reduce data movement Shared LLC Private Caches DRAM Dies Neither shallow nor deep hierarchies work well for all applications… Logic Vault Layer Controller Cores NDP Core Private cache only ( shallow hierarchy)

  6. Asymmetric hierarchies get the best of both worlds

  7. Asymmetric hierarchies get the best of both worlds

  8. Asymmetric hierarchies get the best of both worlds Prior work proposes hybrid system with asymmetric memory hierarchies to get the best of both [Ahn et al., ISCA’15][Gao et al., PACT’15 ] [ Hsieh et al., ISCA’16][ Boroumand et al., ASPLOS’18]

  9. Applications have strong hierarchy preferences 4

  10. Applications have strong hierarchy preferences 80 70 Access latency (ns) 60 50 40 30 20 10 0 Deep hier. Shallow Deep hier. LLC hit hierarchy LLC miss 4

  11. Applications have strong hierarchy preferences Performance/J of milc on different hierarchies 80 3 70 Access latency (ns) 2.5 Normalized Perf/J 60 2 50 40 1.5 30 1 20 0.5 10 0 0 Deep hier. Shallow Deep hier. Deep Shallow LLC hit hierarchy LLC miss hierarchy hierarchy 4

  12. Applications have strong hierarchy preferences Performance/J of milc Performance/J of xalanc on different hierarchies on different hierarchies 80 3 1.2 70 Access latency (ns) Normalized Perf/J 2.5 1 Normalized Perf/J 60 0.8 2 50 40 0.6 1.5 30 0.4 1 20 0.2 0.5 10 0 0 0 Deep Shallow Deep hier. Shallow Deep hier. Deep Shallow hierarchy hierarchy LLC hit hierarchy LLC miss hierarchy hierarchy 4

  13. Applications have strong hierarchy preferences Performance/J of milc Performance/J of xalanc on different hierarchies on different hierarchies 80 3 1.2 70 Access latency (ns) Normalized Perf/J 2.5 1 Normalized Perf/J 60 0.8 2 50 40 0.6 1.5 30 0.4 1 20 0.2 0.5 10 0 0 0 Deep Shallow Deep hier. Shallow Deep hier. Deep Shallow hierarchy hierarchy LLC hit hierarchy LLC miss hierarchy hierarchy How well each application can use the shared LLC is critical to its preference 4

  14. Scheduling programs to the right hierarchy is hard 5

  15. Scheduling programs to the right hierarchy is hard Performance/J of gems Many applications prefer different hierarchies over time because they have different phases 5

  16. Scheduling programs to the right hierarchy is hard Performance/J of gems Performance/J of xalanc 2.5 Normalized Perf/J 2 1.5 1 0.5 0 Shallow Deep Deep Deep Deep hierarchy hierarchy hierarchy hierarchy hierarchy 2MB LLC 4MB LLC 8MB LLC 16MB LLC Many applications prefer different Applications may prefer different hierarchies over time because they hierarchies due to resource have different phases contention with other applications 5

  17. Prior schedulers focus on different systems and constraints 6

  18. Prior schedulers focus on different systems and constraints  Contention-aware scheduling (Bubble- up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])  Focuses on symmetric memory systems (multi-socket LLCs/NUMA) LLC 1 LLC 2 8MB 8MB 6

  19. Prior schedulers focus on different systems and constraints  Contention-aware scheduling (Bubble- up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])  Focuses on symmetric memory systems (multi-socket LLCs/NUMA) LLC 1 LLC 2 8MB 8MB  Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])  Focuses on asymmetric core microarchitectures (big.LITTLE systems) OoO In-order cores cores 6

  20. Prior schedulers focus on different systems and constraints  Contention-aware scheduling (Bubble- up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])  Focuses on symmetric memory systems (multi-socket LLCs/NUMA) LLC 1 LLC 2 8MB 8MB  Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])  Focuses on asymmetric core microarchitectures (big.LITTLE systems) OoO In-order cores cores  NDP-aware workload partitioning (PIM- enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])  Focuses on single workloads and requires software modifications or compiler support 6

  21. Prior schedulers focus on different systems and constraints  Contention-aware scheduling (Bubble- up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])  Focuses on symmetric memory systems (multi-socket LLCs/NUMA) LLC 1 LLC 2 8MB 8MB  Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])  Focuses on asymmetric core microarchitectures (big.LITTLE systems) OoO In-order cores cores  NDP-aware workload partitioning (PIM- enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])  Focuses on single workloads and requires software modifications or compiler support By contrast, our goal is to schedule threads considering both memory and core asymmetries, with no program modifications and transparently to users 6

  22. AMS: An asymmetry-aware scheduler Miss curves Sample Hardware Hardware accesses Produce Misses utility monitors Cache size Schedule threads First contribution Second contribution Software Analytical model that estimates Two thread placement algorithms performance under different hierarchies (AMS-Greedy/AMS-DP) that extend techniques originally designed for cache partitioning 7

  23. AMS analytical model 8

  24. AMS analytical model  AMS estimates application preferences using total memory access latency 8

  25. AMS analytical model  AMS estimates application preferences using total memory access latency Miss curve from hardware monitors # Misses 2 4 6 8 LLC Capacity (MB) 8

  26. AMS analytical model  AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC  Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem) Miss curve from hardware monitors # Misses 2 4 6 8 LLC Capacity (MB) 8

  27. AMS analytical model  AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC A function of LLC capacity  Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem) Miss curve from hardware monitors # Misses 2 4 6 8 LLC Capacity (MB) 8

  28. AMS analytical model  AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC A function of LLC capacity  Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem) Miss curve from Latency curve model hardware monitors Latency # Misses Processor-die core 2 4 6 8 2 4 6 8 LLC Capacity (MB) LLC Capacity (MB) 8

  29. AMS analytical model  AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC A function of LLC capacity  Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)  Shallow hierarchy has no shared LLC  Lat = # accesses x Latency of shallow mem Miss curve from Latency curve model hardware monitors Latency # Misses NDP core Processor-die core 2 4 6 8 2 4 6 8 LLC Capacity (MB) LLC Capacity (MB) 8

  30. AMS analytical model  AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC A function of LLC capacity  Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)  Shallow hierarchy has no shared LLC  Lat = # accesses x Latency of shallow mem Miss curve from Latency curve model hardware monitors Use processor-die core Latency # Misses NDP core Processor-die core Use NDP core 2 4 6 8 2 4 6 8 LLC Capacity (MB) LLC Capacity (MB) 8

  31. Handling heterogeneous cores  Combine model from prior work (PIE) with our memory latency model 9

  32. Handling heterogeneous cores  Combine model from prior work (PIE) with our memory latency model Latency curves Memory latency NDP core Processor-die core 2 4 6 8 LLC Capacity (MB) 9

  33. Handling heterogeneous cores  Combine model from prior work (PIE) with our memory latency model Latency curves Memory stall curves Memory latency Memory stalls Weigh NDP core NDP core by MLP Processor-die core Processor-die core 2 4 6 8 2 4 6 8 LLC Capacity (MB) LLC Capacity (MB) 9

Recommend


More recommend