Parallel Programming and Heterogeneous Computing Non-Uniform Memory Access Max Plauth, Sven Köhler, Felix Eberhardt , Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group
Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Memory Controller ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 3 Memory
Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Memory Controller ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 4 Memory
Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Memory Controller ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 5 Memory
Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Memory Controller ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 6 Memory
Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Memory Controller ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 7 Memory
Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Problem: Socket0 Socket1 Socket2 Socket3 Sockets contend for memory ■ bandwidth C03 C02 C13 C12 C23 C22 C33 C32 Full utilization of the memory ■ controller link means only 1/4 Interconnect utilization of each socket link Contention (or 1/n utilization for n sockets) Memory Controller ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 8 Memory
Non-Uniform Memory Access Context: Scalability Parallelism for… Speedup – compute faster ■ Throughput – compute more in the same time ■ Scalability – compute faster / more with additional resources ■ Price / performance – be as fast as possible for given money ■ Scavenging – compute faster / more with idle resources ■ ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Chart 9
Non-Uniform Memory Access Context: Scalability Two basic approaches to scaling computing hardware: ■ Scale-Up : combine more resources (memory or cores) in a tightly □ coupled system User perceives a single large shared-memory system Ø ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Machine Chart 10
Non-Uniform Memory Access Context: Scalability Two basic approaches to scaling computing hardware: ■ Scale-Out : connect more machines in a loosely coupled network □ User perceives multiple communicating machines in a shared- Ø nothing system ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Machine Chart 11
Non-Uniform Memory Access Context: Scalability Recent coherent interconnect technologies enable hybrid systems with ■ both scale-up and scale-out characteristics: Example: Gen-Z strives to connect an entire datacenter of machines □ coherently User perceives a shared-memory system, but with the performance Ø characteristics (communication latency and bandwidth) of a shared- nothing system ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Machine Chart 12
Non-Uniform Memory Access Concept Part of the main memory is directly attached to a socket (local memory) ■ Memory attached to a different socket can be accessed indirectly via the ■ other socket‘s memory controller and interconnect (remote memory) Socket + local memory form a NUMA node ■ Core Core Core Core Memory Memory Memory Memory Socket Socket Memory Memory Interconnect Memory Controller ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Memory Memory Memory Memory Socket Socket Memory Memory Chart 13
Non-Uniform Memory Access Concept Multiple point to point links between sockets scale better than a shared ■ interconnect Multiple memory controllers partition address space and provide a higher ■ total memory bandwidth (though the bandwidth to a single local region remains the same) Access to local memory behaves exactly like UMA system ■ Access to remote memory traverses more hops (local interconnect -> ■ inter-socket link -> remote interconnect -> remote memory controller) ParProg 2019 Certainly higher access latency Ø Non-Uniform Memory Access Probably lower bandwidth, as inter-socket link is likely not as wide as Ø Felix Eberhardt on chip connections Chart 14 Predominant architecture for current multi-socket machines Ø
Non-Uniform Memory Access Terminology Physical Perspective Logical Perspective Hardware Thread Core, CPU, Processing Unit, 1. ■ Processing Element Core 2. Chip, Die 3. Multichip Module 4. NUMA Node/Region ■ Socket, Package, Processor, CPU 5. Mainboard 6. Machine, System 7. ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Chart 15
Non-Uniform Memory Access Example: SGI UV 300H 240 Cores ■ 12 TB RAM ■ 16 Sockets ■ What is the Killer Application for such a machine? Ø In-Memory Databases! HSLD HSHD Synchroni- ParProg 2019 “Parallel Hell” UMA Non-Uniform zation Memory Access Traffic LSLD LSHD Felix Eberhardt Frequency “Parallel Cluster NUMA Nirvana” Chart 16 Data Traffic Volume [Workload Taxonomy by Pfister]
Non-Uniform Memory Access Example: SGI UV 300H Experiment: Deploy a Database Workload on a NUMA Machine 15 Cores / 30 Threads per Socket ■ Performance degrades when using more than two sockets! Ø ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Chart 18
Non-Uniform Memory Access Characteristics Local memory access does not ■ involve inter-socket links C00 C10 C11 C01 Memory Memory Remote memory access involves one Memory Memory ■ Socket0 Socket1 Memory Memory or more inter-socket links C03 C02 C13 C12 Inter-socket links might not form a ■ complete graph Performance of remote memory Ø access is non-uniform as well C30 C31 C20 C21 Memory Memory Memory Memory Socket3 Socket2 (e.g. S0 can access memory on Memory Memory S3 and S1 with fewer hops than C33 C32 C23 C22 ParProg 2019 on S2) Non-Uniform Memory Access Felix Eberhardt Chart 19
Non-Uniform Memory Access Characteristics Unsuitable access patterns can severely ■ degrade performance: Inter-socket link contention on excessive □ remote memory accesses local bandwidth high Local memory controller contention on utilization □ low excessive combined local and remote high memory accesses interconnect utilization low Local interconnect contention also on □ excessive multi-hop forward traffic ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Chart 20
Non-Uniform Memory Access Data Access Patterns Single task accesses private buffer on a different node C00 C01 C10 C11 Memory Memory Relocate remote buffer to local Memory Memory 1. Node0 Node1 Memory Memory memory C03 C02 C13 C12 Relocate task to remote node 2. Reduce inter-socket contention Ø C30 C31 C20 C21 Memory Memory Memory Memory Node3 Node2 Memory Memory C33 C32 C23 C22 ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Chart 22
Non-Uniform Memory Access Data Access Patterns Single task accesses private buffer on a different node C00 C01 C10 C11 Memory Memory Relocate remote buffer to local Memory Memory 1. Node0 Node1 Memory Memory memory C03 C02 C13 C12 1. Relocate task to remote node 2. Reduce inter-socket contention Ø C30 C31 C20 C21 Memory Memory Memory Memory Node3 Node2 Memory Memory C33 C32 C23 C22 ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Chart 23
Recommend
More recommend