parallel programming and heterogeneous computing
play

Parallel Programming and Heterogeneous Computing Non-Uniform Memory - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing Non-Uniform Memory Access Max Plauth, Sven Khler, Felix Eberhardt , Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Recap Optimization Goals Decrease Latency process a


  1. Parallel Programming and Heterogeneous Computing Non-Uniform Memory Access Max Plauth, Sven Köhler, Felix Eberhardt , Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group

  2. Recap Optimization Goals Decrease Latency – process a single workload faster (= speedup ) ■ Increase Throughput – process more workloads in the same time ■ Both are Performance metrics Ø Scalability : make best use of additional resources ■ Scale Up : Utilize additional resources on a machine □ Scale Out : Utilize resources on additional machines □ Cost/Energy Efficiency : ■ minimize cost/energy requirements for given performance objectives □ ParProg 2020 B4 Non-Uniform alternatively: maximize performance for given cost/energy budget □ Memory Access Felix Eberhardt Utilization : minimize idle time (=waste) of available resources ■ Precision-Tradeoffs : trade performance for precision of results ■ Chart 2

  3. Non-Uniform Memory Access Context: Scalability Two basic approaches to scaling computing hardware: ■ Scale-Up : combine more resources (memory or cores) in a tightly □ coupled system User perceives a single large shared-memory system Ø ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Machine Chart 3

  4. Non-Uniform Memory Access Context: Scalability Two basic approaches to scaling computing hardware: ■ Scale-Out : connect more machines in a loosely coupled network □ User perceives multiple communicating machines in a shared- Ø nothing system ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Machine Chart 4

  5. Non-Uniform Memory Access Context: Scalability Recent coherent interconnect technologies enable hybrid systems with ■ both scale-up and scale-out characteristics: Example: Gen-Z strives to connect an entire datacenter of machines □ coherently User perceives a shared-memory system, but with the performance Ø characteristics (communication latency and bandwidth) of a shared- nothing system ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Machine Chart 5

  6. Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Memory Controller ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 6.1 Memory

  7. Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Memory Controller ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 6.2 Memory

  8. Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Memory Controller ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 6.3 Memory

  9. Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Memory Controller ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 6.4 Memory

  10. Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Contention Memory Controller ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 6.5 Memory

  11. Non-Uniform Memory Access Concept Part of the main memory is directly attached to a socket ( local memory ) ■ Memory attached to a different socket can be accessed indirectly via the other ■ socket‘s memory controller and interconnect ( remote memory ) Socket + local memory form a NUMA node ■ Core Core Core Core Memory Memory Memory Memory Socket Socket Memory Memory Interconnect Memory Controller ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Memory Memory Memory Memory Socket Socket Memory Memory Chart 7

  12. Non-Uniform Memory Access Characteristics Local memory access does not involve ■ inter-socket links, but they are shared C00 C10 C11 C01 for remote requests Memory Memory Memory Memory Socket0 Socket1 Memory Memory Local performance can suffer from Ø remote activity C03 C02 C13 C12 Remote memory access involves one or ■ C30 C31 C20 C21 more inter-socket links, as they need Memory Memory Memory Memory Socket3 Socket2 not form a complete graph Memory Memory Access to different remote memory C33 C32 C23 C22 ParProg 2020 B4 Ø Non-Uniform regions is non-uniform as well Memory Access Felix Eberhardt Chart 8

  13. Non-Uniform Memory Access Concept Multiple point to point links between sockets scale better than a shared ■ interconnect Multiple memory controllers partition address space and provide a higher ■ total memory bandwidth (though the bandwidth to a single local region remains the same) Access to local memory behaves exactly like UMA system ■ Access to remote memory traverses more hops ( local interconnect → inter- ■ socket link → remote interconnect → remote memory controller ) Certainly higher access latency Ø ParProg 2020 B4 Probably lower bandwidth, as inter-socket link is likely not as wide as on Non-Uniform Ø chip connections Memory Access Felix Eberhardt Predominant architecture for current multi-socket machines Ø Chart 9

  14. Non-Uniform Memory Access Terminology Physical Perspective Logical Perspective Hardware Thread Core, CPU, Processing Unit, 1. ■ Processing Element Core 2. Chip, Die 3. Multichip Module 4. NUMA Node/Region ■ Socket, Package, Processor, CPU 5. Mainboard 6. Machine, System 7. ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Chart 10

  15. Non-Uniform Memory Access Example: SGI UV 300H 240 Cores ■ 12 TB RAM ■ 16 Sockets ■ What is a Killer Application for such a machine? In-Memory Databases! Ø HSLD HSHD Synchroni- ParProg 2020 B4 “Parallel Hell” UMA Non-Uniform zation Memory Access Traffic LSLD LSHD Felix Eberhardt Frequency “Parallel Cluster NUMA Nirvana” Chart 11 Data Traffic Volume [Workload Taxonomy by Pfister]

  16. Non-Uniform Memory Access Example: SGI UV 300H Experiment: NUMA behavior when scaling a workload Machine has 16 sockets x 15 cores x 2-way SMT (allocated in locality order) ■ Performance degrades when using more than two sockets! Ø ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Chart 12

  17. Non-Uniform Memory Access Characteristics Unsuitable access patterns can severely degrade ■ performance: Inter-socket link contention on excessive □ remote memory accesses Local memory controller contention on local bandwidth high □ utilization excessive combined local and remote low memory accesses high interconnect Local interconnect contention also on □ utilization low excessive multi-hop forward traffic ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Chart 13

  18. Non-Uniform Memory Access Data Access Patterns Single task accesses private buffer B. on a different node C00 C01 C10 C11 Relocate remote buffer to local A. Memory Memory Memory Memory Node0 Node1 Memory Memory memory Relocate task to remote node C03 C02 C13 C12 B. A. Reduce inter-socket contention Ø C30 C31 C20 C21 Memory Memory Memory Memory Node3 Node2 Memory Memory C33 C32 C23 C22 ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Chart 14

  19. Non-Uniform Memory Access Data Access Patterns A. Multiple tasks on multiple nodes access private buffers on single C00 C01 C10 C11 node Memory Memory Memory Memory Node0 Node1 Memory Memory Relocate remote buffers to local A. memory C03 C02 C13 C12 A. A. Reduce memory controller Ø contention C30 C31 C20 C21 Memory Memory Memory Memory Node3 Node2 Memory Memory C33 C32 C23 C22 ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Chart 15

  20. Non-Uniform Memory Access Data Access Patterns Multiple tasks on a single node access private buffers on the same A. C00 C01 C10 C11 node Memory Memory Memory Memory Node0 Node1 Memory Memory Distribute tasks and buffers to A. different nodes C03 C02 C13 C12 A. A. Balance memory controller Ø utilization C30 C31 C20 C21 Memory Memory Memory Memory Node3 Node2 Memory Memory C33 C32 C23 C22 ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Chart 16

Recommend


More recommend