I: Performance metrics (cont’d) II: Parallel programming models and mechanics Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.05] Tuesday, January 22, 2008 1
Algorithms for 2-D (3-D) Poisson, N=n 2 (=n 3 ) Algorithm Serial PRAM Memory # procs Dense LU N 3 N N 2 N 2 N 2 (N 7/3 ) N N 3/2 (N 5/3 ) N (N 4/3 ) Band LU Jacobi N 2 (N 5/3 ) N (N 2/3 ) N N Explicit inverse N 2 log N N 2 N 2 Conj. grad. N 3/2 (N 4/3 ) N 1/2(1/3) log N N N RB SOR N 3/2 (N 4/3 ) N 1/2 (N 1/3 ) N N Sparse LU N 3/2 (N 2 ) N 1/2 N log N (N 4/3 ) N FFT N log N log N N N Multigrid N log 2 N N N N log N N Lower bound PRAM = idealized parallel model with zero communication cost. Source: Demmel (1997) 2
Sources for today’s material Mike Heath at UIUC CS 267 (Yelick & Demmel, UCB) 3
Efficiency and scalability metrics (wrap-up) 4
Example: Summation using a tree algorithm Efficiency E p ≡ C 1 n 1 n + p log p = ≈ 1 + p C p n log p 5
Basic definitions M Memory complexity Storage for given problem ( e.g. , words) Amount of work for given problem W Computational complexity ( e.g. , flops) V Processor speed Ops / time ( e.g. , flop/s) Elapsed wallclock T Execution time ( e.g. , secs) (No. procs) * (exec. time) C Computational cost [ e.g. , processor-hours] 6
Parallel scalability Algorithm is scalable if E p ≡ C 1 = Θ (1) as p → ∞ C p Why use more processors? Solve fixed problem in less time Solve larger problem in same time (or any time) Obtain sufficient aggregate memory Tolerate latency and/or use all available bandwidth (Little’s Law) 7
Is this algorithm scalable? No, for fixed problem size , exec. time , and work / proc. Determine isoefficiency function for which efficiency is constant 1 E p ≡ C 1 n n + p log p = n log p = E (const.) ≈ 1 + p C p = ⇒ n ( p ) = Θ ( p log p ) But then execution time grows with p: T p = n p + log p = Θ (log p ) 8
A simple model of communication performance 9
Time to Send a Message Time ( μ sec) 1500 241 39 T3E/Shm T3E/MPI IBM/LAPI IBM/MPI 6 Quadrics/Shm Quadrics/MPI Myrinet/GM Myrinet/MPI GigE/VIPL GigE/MPI 1 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 Size (bytes) 10
Latency and bandwidth model Model time to send a message in terms of latency and bandwidth α + n t ( n ) = β Usually have cost(flop) << 1/ β << α One long message cheaper than many short ones Can do hundreds or thousands of flops for each message Efficiency demands large computation-to-communication ratio 11
Empirical latency and (inverse) bandwidth ( μ sec) on real machines 12
Time to Send a Message Time ( μ sec) 1500 241 39 T3E/Shm T3E/MPI IBM/LAPI IBM/MPI 6 Quadrics/Shm Quadrics/MPI Myrinet/GM Myrinet/MPI GigE/VIPL GigE/MPI 1 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 Size (bytes) 13
Time to Send a Message (Model) Time ( μ sec) 1500.0000 241.0285 38.7298 T3E/Shm T3E/MPI IBM/LAPI 6.2233 IBM/MPI Quadrics/Shm Quadrics/MPI Myrinet/GM Myrinet/MPI GigE/VIPL GigE/MPI 1.0000 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 Size (bytes) 14
Latency on some current machines (MPI round-trip) 8-byte Roundtrip Latency 24.2 25 MPI ping-pong 22.1 20 Roundtrip Latency (usec) 18.5 14.6 15 9.6 10 6.6 5 Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed Source: Yelick (UCB/LBNL) 15
End-to-end latency (1/2 round-trip) over time Latency over Time 1000 169 100 93 83 73 63 Latency (usec) 50 36.34 35 34 34 27.9 25 21.4725 21 19.4985 18.916 14.2365 12.005 12.0805 11 11.027 10 9.944 9.25 7.2755 6.9745 6.905 6.526 4.81 3.3 3 2.815 2.6 2.2 1.7 1.245 1 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Year Source: Yelick (UCB/LBNL) 16
Bandwidth vs. Message Size Source: Mike Welcome (NERSC) 17
Parallel programming models 18
A generic parallel architecture Physical location of memories, processors? Connectivity? Proc Proc Proc Proc Proc Proc Interconnection Network Memory Memory Memory Memory Memory 19
What is a “parallel programming model?” Languages + libraries composing abstract view of machine Major constructs Control : Create parallelism? Execution model? Data : Private vs. shared? Synchronization : Coordinating tasks? Atomicity? Variations in models Reflect diversity of machine architectures Imply variations in cost 20
Running example: Summation n Compute the sum, � s = f ( a i ) i =1 Questions: Where is “A”? Which processors do what? How to combine? A[1..n] f(·) f(A[1..n]) ⊕ s 21
Programming model 1: Shared memory Program = collection of threads of control Each thread has private variables May access shared variables, for communicating implicitly and synchronizing Shared memory s s = ... y = ..s ... i: 5 i: 8 i: 2 Private memory P1 Pn P0 22
Need to avoid race conditions: Use locks Race condition ( data race ): Two threads access a variable, with at least one writing and concurrent accesses shared int s = 0; Thread 2 Thread 1 for i = n/2, n-1 for i = 0, n/2-1 s = s + f(A[i]) s = s + f(A[i]) 23
Need to avoid race conditions: Use locks Explicitly lock to guarantee atomic operations shared int s = 0; shared lock lk; Thread 1 Thread 2 local_s1= 0 local_s2 = 0 for i = 0, n/2-1 for i = n/2, n-1 local_s1 = local_s1 + f(A[i]) local_s2= local_s2 + f(A[i]) lock(lk); lock(lk); s = s + local_s1 s = s +local_s2 unlock(lk); unlock(lk); 24
Machine model 1a: Symmetric multiprocessors (SMPs) All processors connect to large shared memory Challenging to scale both hardware & software > 32 procs P2 P1 Pn $ $ $ bus shared $ memory 25
Source: Pat Worley (ORNL) 26
Machine model 1b: Simultaneous multithreaded processor (SMT) Multiple thread contexts share memory and functional units Switch among threads during long-latency memory ops T0 T1 Tn shared $, shared floating point units, etc. Memory 27
Cray El Dorado processor Source: John Feo (Cray) 28
Machine model 1c: Distributed shared memory Memory logically shared , but physically distributed Challenge to scale cache coherency protocols > 512 procs P2 P1 Pn Cache lines (pages) must $ be large to amortize $ $ overhead locality is critical to performance network memory memory memory 29
Programming model 2: Message passing Program = named processes No shared address space Processes communicate via explicit send/receive operations s: 14 s: 12 s: 11 receive Pn,s y = ..s ... i: 3 i: 2 i: 1 send P1,s P1 Pn P0 Network 30
Example: Computing A[1]+A[2] Processor 1 : Processor 2 : x = A[1] x = A[2] SEND x → Proc. 2 SEND x → Proc. 1 RECEIVE y ← Proc. 2 RECEIVE y ← Proc. 1 s = x + y s = x + y What could go wrong in the following code? Scenario A: Send/receive is like the telephone system Scenario B: Send/receive is like the post office 31
Machine model 2a: Distributed memory Separate processing nodes, memory Communicate through network interface over interconnect P0 P1 NI NI Pn NI . . . memory memory memory interconnect 32
Programming model 2b: Global address space (GAS) Program = named threads Shared data, but partitioned over local processes Implied cost model: remote accesses cost more Shared memory s[n]: 27 s[0]: 27 s[1]: 27 y = ..s[i] ... Private i: 5 i: 8 i: 2 memory s[myThread] = ... P1 Pn P0 33
Machine model 2b: Global address space Same as distributed, but NI can access memory w/o interrupting CPU One-sided communication; remote direct memory access (RDMA) P0 P1 NI NI Pn NI . . . memory memory memory interconnect 34
Programming model 3: Data parallel Program = single thread performing parallel operations on data Implicit communication and coordination; easy to understand Drawback : Not always applicable Examples: HPF, MATLAB/StarP A[1..n] f(·) f(A[1..n]) ⊕ s 35
Machine model 3a: Single instruction, multiple data (SIMD) Control processor issues instruction, (usually) simpler processors execute May “turn off” some processors Examples: CM2, Maspar control processor P1 NI P1 NI P1 NI P1 NI P1 NI . . . memory memory memory memory memory interconnect 36
Machine model 3b: Vector processors Single processor with multiple functional units Perform same operation Instruction specifies large amount of parallelism, hardware executes on a subset Rely on compiler to find parallelism Resurgent interest Large scale: Earth Simulator, Cray X1 Small scale: SIMD units ( e.g. , SSE, Altivec, VIS) 37
Vector hardware Operations on vector registers, O(10-100) elements / register r1 r2 … … vr1 vr2 + + (logically, performs # elts adds in parallel) r3 … vr3 Actual hardware has 2-4 vector pipes or lanes 38
Programming model 4: Hybrid May mix any combination of preceeding models MPI + threads DARPA HPCS languages mix threads and data parallel in global address space 39
Machine model 4: Clusters of SMPs (CLUMPs) Use SMPs as building block nodes Many clusters ( e.g. , GT “warp” cluster) Best programming model? “Flat” MPI Shared mem in SMP , MPI between nodes 40
Administrivia 41
Recommend
More recommend