measurements time measurements steps
play

Measurements - time Measurements - steps MASPAR MP-1 256 processors - PDF document

Machine Models What are models good for? Abstracting from machine properties Bridging Models and Machines Making programming simple Making programs portable Reflecting essential machine properties Functionality (sure) PRAM


  1. Machine Models � What are models good for? � Abstracting from machine properties Bridging Models and Machines � Making programming simple � Making programs portable � Reflecting essential machine properties � Functionality (sure) PRAM – BSP – Delay Model – LogP � Costs (programmer should understand that a program is expensive when (s)he writes it) as long as it cannot be hidden by compilers � Success of the von-Neumann machine model 2 Problem Questions � What is the von-Neumann model to for � Evaluate the models with respect to parallel machines � Programmability � Much more sensitive wrt. reflecting performance � Reality � Much more diverse in existing architectures � Simulations � Message passing networks of different kinds � Compilations � Internet � Shared and virtual shared memory machines � Vector machines … � Conflict between � Easy to program, portability of programs � Accurately reflecting performance 3 4 PRAM revisited Theoretical simulation results + Easy to program � Deterministic and probabilistic methods + Portable programs � Deterministic � Each PRAM cell is stored on different nodes (memory organization scheme) - Unrealistic assumptions like constant time � An general optimum memory organization scheme is memory access unknown (only its existence for individual topologies) - Expensive simulations on existing message � E.g. O (log 2 p / log log p ) for a p - PRAM step on a p - mesh architectures � Probabilistic � Looks ok in the O -calculus � Probabilistic distribution of the memory cells � Large constants on message passing machines � E.g. O (max(log p , v / p )) for a v - PRAM step on a p - CCC, p - hypercube, or p - butterfly (Valiant) - optimal if v > p log p � But constant speed-up is all we can hope for 5 6 1

  2. Measurements - time Measurements - steps MASPAR MP-1 256 processors (Zimmermann, Kumm) (Zimmermann, Kumm) MASPAR MP-1 256 processors 7 8 Measurements BSP (Valiant) MASPAR MP-1 256 processors � Bulk-Synchronous Parallel Machine � Avoids the costs in the PRAM simulation for hashing, sorting, queuing, other organizational tricks ☺ � Let the programmer handle this problem � � Bridging model for parallel computation � Standard results on probabilistic PRAM simulations in Handbook of Theoretical CS are by Valiant � Even he obviously sees a need to get closer to reality 9 10 BSP (Valiant) BSP Computations � In super-steps, each: � Processors read values required in a step � Perform computations locally P P P � Store values computed in that step � Bulk-synchronize before the next step S M M M � Periodicity of L for synchronization � Processor (P), � Virtually shared (S) and/or local Memory (M), � Common synchronization � Router 11 12 2

  3. Cost Model Super-steps � Router can handle h -relations in time hg’+s time � Number of messages sent or received h Barrier � Router throughput g’ Global write � Startup time s in time gh � For simplicity, define a g such that the router can L handle h -relations in time hg for h > h 0 (some Compute initial value) – e.g. take g=2g’ assuming hg’ >s Local read � Router implementation is hidden in a library processors 13 14 Periodic synchronization BSP (McColl) � Assumed to be implemented in hardware � At least independent of the processors � Otherwise there wouldn’t be any processor P P P capacity left for computation in the super-steps � Bound from below by the hardware M M M � Bound from above by the application � Larger super-steps means longer independent parallel computations without the need of � Processor (P), establishing a consistent memory state � Memory (M), � Common synchronization in software � Requires higher granularity in the problem to � Network connected allow that 15 16 Super-steps Discussion Valiant vs. McColl � McColl time Barrier l � Gives up periodicity L as unnecessary constraint � Introduces explicit synchronization time l accounting for synchronization in processors, i.e., sharing the hardware communicate with computation and communication max. h relation in time gh � Assumes message passing to address usual hardware � Valiant Local read � Preserves the ability of managed data distribution from compute the deterministic PRAM simulation max. in time w � Allowing user defined data distribution if applicable processors 17 18 3

  4. Design a BSP program Example Prefix Sums – Plan A f or ( p=0; p<n; p++) i n par al l el { � Execution time: 1. r i ght =i ni t ( p) ; l ef t =0; 2. T= Σ s ∈ super-steps ( max i ∈ procs w i,s + max i ∈ procs h i,s g + l ) f or ( i =1; i <n; i * =2) { target value from to target 3. processor local variable variable i f ( p+i < n) 4. put ( p+i , r i ght , l ef t ) ; 5. � Implications for algorithm design: bar r i er _synchr oni ze( ) ; 6. � Balance computation because of max i ∈ procs w i i f ( p >= i ) 7. � Balance communications because of max i ∈ procs h i g r i ght =r i ght +l ef t ; 8. � Minimize the number of super-steps because of } 9. | super-steps | × l } 10. 19 20 Prefix Sums (cont.) Prefix Sums (cont.) time time p= 10 i= 8 i= 8 i= 4 i= 4 i= 2 i= 2 i= 1 i= 1 processors processors 21 22 Analysis of Prefix Sums Prefix Sums – Plan B � BSP execution time in general: f or ( p=0; p<n; p++) i n par al l el { 1. T= Σ s ∈ super-steps ( max i ∈ procs w i,s + max i ∈ procs h i,s g + l ) r i ght =i ni t ( p) ; ar r ay[ 0… n- 1] =0; 2. f or ( i =p+1; i <n; i ++) 3. � Prefix Sums execution time: put ( i , r i ght , ar r ay[ i ] ) ; 4. � Initialization w i,0 =1 � All steps perform a “ + ” operation w i,s =1 bar r i er _synchr oni ze( ) ; 5. � All steps route a 1 -relation h i,s =1 f or ( i =0; i <p; i ++) 6. � ⎡ log n ⎤ super-steps in total r i ght =r i ght +ar r ay[ i ] ; 7. T= 1 + ⎡ log n ⎤ ( 1 + 1 g + l ) } 8. 23 24 4

  5. Plan B (cont.) Analysis of Prefix Sums – Plan B � Prefix Sums execution time: time � 2 super-steps, one barrier synchronization � Initialization w i, 0 =1 � Processor n -1 performs n “ + ” operations: max i ∈ procs w i,s = w n -1 , 1 = n � Processor 0 sends and processor n -1 receives n -1 messages max i ∈ procs h i,s = h 0 , 0 = n -1 T= 1 + n + ( n -1) g + l processors 25 26 General Prefix Sums Design of a BSP program Assumption n = P ( P - number of actual � Requires machine parameters: l, g, P � processors) can be dropped using either � Analytically derived: too complex, does not work � Benchmarks algorithm – plan A or B: � Requires computation times of sequential 1. Sum of array blocks of size n / P computed algorithm locally (sequential algorithm) � Analytically derived: too complex, does not work 2. Use plan A or B to compute the prefix sum in every n /P-th element (last of each block) � Benchmarks: imprecise since � Caching, pipelining effects not repeatable 3. Receive the result of the left neighbors prefix � Data dependencies of sequential computation sum � In practice: analysis + profiling necessary 4. Add the received value to the local sums 27 28 Micro Benchmarks Load Micro Benchmarks Store SGI Power Challenge SGI Power Challenge 29 30 5

  6. Performance Predictions Some BSP Machines SGI Power Challenge / Radix sort Maschine l g ( P- relation ) g (1 - relation ) P 25.7 0.13 x 0.13 x 4 SGI Power Challenge Hitachi SR2001 1321.7 0.92 x 0.9 x 32 Parsytec GC 6700 34.1 x 14.1 x 32 4664 8.1 x 4 DEC-Farm IBM SP-2 208.2 0.43 x 0.27 x 8 Cray T3D 16.6 0.48 x 0.36 x 32 31.1 0.78 x 0.42 x 256 Cray T3D x in words and time in μ s 31 32 Performance Predictions Profile Plan A IBM SP/2 8 processors SGI Power Challenge / Sample sort Completion Time 33 34 Profile Plan B IBM SP/2 8 processors Profile Plan A Cray T3D 32 processors Completion Completion Time Time 35 36 6

  7. Profile Plan B Cray T3D 32 processors Observations � Plan A could be seen as a PRAM simulation � Plan B designed directly for BSP � Appears absurd on PRAM Completion Time � Advantages show on the more realistic machine model BSP � Programming becomes more difficult � Same situation when comparing � BSP vs. PRAM � PRAM vs. von-Neumann (and parallelization) 37 38 Problems with BSP Example Prefix Sums (revisited) � Algorithms need to be split in global phases time � Computation � Communication i= 8 � Synchronization � In many algorithms computation and communication not balanced over processors i= 4 � On almost all machines � Different times g for local and global communication i= 2 in a P- relation compared to a 1 - relation � Synchronization i= 1 � Not necessary when knowing all data dependencies, � Otherwise, only locally necessary processors 39 40 Prefix Sums Data Dependencies Prefix Sums Task Graph time processors 41 42 7

Recommend


More recommend