literature foundations of parallel algorithms aff
play

Literature Foundations of parallel algorithms aff: Practical PRAM - PowerPoint PPT Presentation

DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. 2 C. Kessler, IDA, Link opings Universitet, 2011. DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. 1 C. Kessler, IDA, Link opings


  1. DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. 2 C. Kessler, IDA, Link¨ opings Universitet, 2011. DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. 1 C. Kessler, IDA, Link¨ opings Universitet, 2011. Literature Foundations of parallel algorithms aff: Practical PRAM Programming. [PPP] Keller, Kessler, Tr¨ PRAM model Wiley Interscience, New York, 2000. Chapter 2. Time, work, cost [JaJa] JaJa: An introduction to parallel algorithms. Self-simulation and Brent’s Theorem Addison-Wesley, 1992. Speedup and Amdahl’s Law [CLR] Cormen, Leiserson, Rivest: Introduction to Algorithms , NC Chapter 30. MIT press, 1989. Scalability and Gustafssons Law [JA] Jordan, Alaghband: Fundamentals of Parallel Processing. Fundamental PRAM algorithms Prentice Hall, 2003. reduction parallel prefix list ranking Survey article (see course homepage): C. Kessler, J. Keller: Models for Parallel Computing – Review and Perspectives. PRAM variants, simulation results and separation theorems. PARS-Mitteilungen 24 , Gesellschaft f¨ ur Informatik, Dec. 2007, ISSN 0177-0454 Survey of other models of parallel computation Asynchronous PRAM, Delay model, BSP , LogP , LogGP DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. 4 C. Kessler, IDA, Link¨ opings Universitet, 2011. 3 C. Kessler, IDA, Link¨ DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. opings Universitet, 2011. Parallel computation models (2) Parallel computation models (1) Cost model: should + abstract from hardware and technology ! generalization + explain available observations ! analyze algorithms before implementation + specify basic operations, when applicable + predict future behaviour + specify how data can be stored + abstract from unimportant details ! focus on most characteristic (w.r.t. influence on time/space complexity) independent of a particular parallel computer Simplifications to reduce model complexity: use idealized machine model ignore hardware details: memory hierarchies, network topology, ... features of a broader class of parallel machines use asymptotic analysis Cost model Programming model drop insignificant effects key parameters shared memory vs. use empirical studies message passing cost functions for basic operations calibrate parameters, evaluate model degree of synchronous execution constraints

  2. 5 DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. C. Kessler, IDA, Link¨ opings Universitet, 2011. DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. 6 C. Kessler, IDA, Link¨ opings Universitet, 2011. Flashback to DALG, Lecture 1: The RAM model The RAM model (2) = t load + t store + ( 2 t load + t add + t store + t branch ) = 5 N � 3 2 Θ ( N s = d(0) ) Algorithm analysis: Counting instructions RAM (Random Access Machine) [PPP 2.1] = 2 do i = 1, N-1 s = s + d(i) programming and cost model for the analysis of sequential algorithms Example: Computing the global sum of N elements end do N data memory ∑ t ..... i s M[3] M[2] + s M[1] + M[0] s + load s clock store + s + + program memory CPU s ALU + + + register 1 s ! arithmetic circuit model, directed acyclic graph (DAG) model current instruction register 2 + + + + + .... PC d[0] d[1] d[2] d[3] d[4] d[5] d[6] d[7] d[0] d[1] d[2] d[3] d[4] d[5] d[6] d[7] DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. 8 C. Kessler, IDA, Link¨ opings Universitet, 2011. DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. 7 C. Kessler, IDA, Link¨ opings Universitet, 2011. PRAM model: Variants for memory access conflict resolution PRAM model [PPP 2.2] Exclusive Read, Exclusive Write (EREW) PRAM Parallel Random Access Machine [Fortune/Wyllie’78] concurrent access only to different locations in the same cycle p processors Concurrent Read, Exclusive Write (CREW) PRAM MIMD simultaneous reading from or single writing to same location is possible common clock signal Shared Memory arithm./jump: 1 clock cycle Concurrent Read, Concurrent Write (CRCW) PRAM simultaneous reading from or writing to same location is possible: shared memory CLOCK Weak CRCW uniform memory access time Shared Memory ? ...... Common CRCW P P P P P a latency: 1 clock cycle (!) 0 1 2 3 p-1 Arbitrary CRCW CLOCK M M0 M2 M3 concurrent memory accesses M1 p-1 Priority CRCW sequential consistency ...... P P P P P 0 1 2 3 p-1 Combining CRCW private memory (optional) (global sum, max, etc.) M p-1 M0 M2 M3 M1 processor-local access only t: *a=0; *a=1; nop; *a=0; *a=2; No need for ERCW ...

  3. ; x 1 ; :::; x n � 1 stored in an array. DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. 9 C. Kessler, IDA, Link¨ opings Universitet, 2011. DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. 10 C. Kessler, IDA, Link¨ opings Universitet, 2011. n � 1 Global sum computation on EREW and Combining-CRCW PRAM (1) Global sum computation on EREW and Combining-CRCW PRAM (2) d log 2 n e time steps = 0 Given n numbers x 0 Recursive parallel sum program in the PRAM progr. language Fork [PPP] ∑ sync int parsum( sh int *d, sh int n) The global sum x i can be computed in i { on an EREW PRAM with n processors. sh int s1, s2; sh int nd2 = n / 2; Parallel algorithmic paradigm used: Parallel Divide-and-Conquer if (n==1) return d[0]; // base case d[0] d[1] d[2] d[3] d[4] d[5] d[6] d[7] $=rerank(); // re-rank processors within group ParSum(n): if ($<nd2) // split processor group: + + + + ( 1 ) s1 = parsum( d, nd2 ); ParSum(n/2) ParSum(n/2) + + else s2 = parsum( &(d[nd2]), n-nd2 ); ( n = 2 ) ! T ( n ) = T ( n = 2 ) + O ( 1 ) ( 1 ) return s1 + s2; + + } ( 1 ) t Global sum� Fork95 434 sh-loads, 344 sh-stores trv traced time period: 6 msecs 78 mpadd, 0 mpmax, 0 mpand, 0 mpor Divide phase: trivial, time O P0 ! T ( n ) 2 O ( log n ) 7 barriers, 0 msecs = 15.4% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 93 sh loads, 43 sh stores, 15 mpadd, 0 mpmax, 0 mpand, 0 mpor P1 Recursive calls: parallel time T 7 barriers, 0 msecs = 14.9% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 48 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor P2 7 barriers, 0 msecs = 14.9% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 48 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor with base case: load operation, time O P3 7 barriers, 0 msecs = 14.4% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 49 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor Combine phase: addition, time O P4 7 barriers, 0 msecs = 14.9% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 48 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor P5 7 barriers, 0 msecs = 14.4% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 49 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor P6 Use induction or the master theorem [CLR 4] 7 barriers, 0 msecs = 14.4% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 49 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor P7 7 barriers, 0 msecs = 13.9% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 50 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor 12 DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. C. Kessler, IDA, Link¨ opings Universitet, 2011. DF21500 Multicore Computing. Lecture on foundations of parallel algorithms. 11 C. Kessler, IDA, Link¨ opings Universitet, 2011. PRAM model: CRCW is stronger than CREW Global sum computation on EREW and Combining-CRCW PRAM (3) Example: Iterative parallel sum program in Fork t int sum(sh int a[], sh int n) Computing the logical OR of p bits + idle idle idle idle idle idle idle { CREW: time O(log p) int d, dd; Shared Memory + + ? idle idle idle idle idle idle 0 1 0 1 0 0 0 1 int ID = rerank(); a d = 1; + + + + idle idle idle idle OR OR OR OR CLOCK while (d<n) { 1 1 0 1 a(1) a(2) a(3) a(4) a(5) a(6) a(7) a(8) dd = d; d = d*2; OR OR if (ID%d==0) a[ID] = a[ID] + a[ID+dd]; ...... 1 1 P P P P P 0 1 2 3 p-1 } OR } 1 M p-1 M0 M2 M3 M1 t: nop; *a=1; nop; *a=1; *a=1; time O(1) CRCW: On a Combining CRCW PRAM with addition as the combining operation, sh int a = 0; the global sum problem can be solved in a constant number of time steps (else do nothing) if (mybit == 1) a = 1; using n processors. e.g. for termination detection syncadd( &s, a[ID] ); // procs ranked ID in 0...n-1

Recommend


More recommend