co444h
play

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to - PowerPoint PPT Presentation

Loops and CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to use parallelism. Unfortunately, it is not easy to develop software that can take advantage of parallel machines. Dividing the


  1. Loops and CO444H parallelism Ben Livshits 1

  2. Why Parallelism? • One way to speed up a computation is to use parallelism. • Unfortunately, it is not easy to develop software that can take advantage of parallel machines. • Dividing the computation into units that can execute on different processors in parallel is already hard enough; yet that by itself does not guarantee a speedup. • We must also minimize inter-processor communication, because communication overhead can easily make the parallel code run even slower than the sequential execution! 2

  3. Maximizing Data Locality • Minimizing communication can be thought of as a special case of improving a program's data locality. In general, we say that a program has good data locality if a processor often accesses the same data it has used recently. • Surely if a processor on a parallel machine has good locality, it does not peed to communicate with other processors frequently. Thus, parallelism and data locality need to be considered hand-in-hand. Data locality, by itself, is also important for the performance of individual processors. Why? • Modem processors have one or more level of caches in the memory hierarchy; a memory access can take tens of machine cycles whereas a cache hit would only take a few cycles. If a program does not have good data locality and misses in the cache often, its performance will suffer. 3

  4. Agenda • Introduction • Single Loop • Nested Loops • Data Dependence Analysis Based on slides 4 taken from Wei Li

  5. Motivation: Better Parallelism • DOALL loops: loops whose iterations can execute in parallel for i = 11, 20 a[i] = a[i] + 3 • New abstraction needed • Abstraction used in data flow analysis is inadequate • Information from all instances of a statement for multiple indices is combined 5

  6. Focus on Affine Array Accesses • For example, if i and j are the index variables of surrounding loops, then Z[i][j] and Z[i][i + j] are affine accesses. • A function of one or more variables, ii,i2 ,... ,in is affine if it can be expressed as a sum of a constant, plus constant multiples of the variables • i.e., co + c1i1 + c2i2 + • • • + cnin , where co, c1, … , cn are constants. • Affine functions are usually known as linear functions, although strictly speaking, linear functions do not have the c0 term. 6

  7. Try to Minimize Inter-processor Communication • Processors on a symmetric multiprocessor share the same address space. To communicate, a processor can simply write to a memory location, which is then read by any other processor. • Symmetric multiprocessors use a coherent cache protocol to hide the presence of caches from the programmer. • When a processor wishes to write to a cache line, copies from all other caches are removed. When a processor requests data not found in its cache, the request goes out on the shared bus, and the data will be fetched either from memory or from the cache of another processor. 7

  8. Memory Access Costs • The time taken for one processor • You may think that interprocessor communication is relatively cheap, to communicate with another is since it is only about twice as slow as about twice the cost of a a memory access. memory access. • Memory accesses are very expensive • The data, in units of cache lines, when compared to cache hits — they must first be written from the can be a hundred times slower. first processor's cache to • This analysis brings home the memory, and then fetched from similarity between efficient the memory to the cache of the parallelization and locality analysis. second processor. • For a processor to perform well, either on its own or in the context of a multiprocessor, it must find most of the data it operates on in its cache 8

  9. Application-level Parallelism • Loops are a great target • We use two high-level metrics to estimate how to parallelize well a parallel application • What properties of will perform: loops are we looking for • parallelism coverage which is the percentage of the in terms of better computation that runs in parallelism? parallel and • granularity of parallelism, which is the amount of computation that each processor can execute without synchronizing or communicating with others 9

  10. Other Examples of Coarse- Grained Parallelism • How about map- • What about a web reduce server with a computations? back-end database? • Analyzing • Running astronomical data from telescopes, simulations with etc.? multiple parameters? 10

  11. TPL: Task Parallel Library 11

  12. More Elaborate TPL Example When a ForEach<TSource> loop executes, it divides its source collection • into multiple partitions Each partition will get its own copy of the "thread-local" variable • 12 https://msdn.microsoft.com/en-us/library/dd997393(v=vs.110).aspx

  13. Automatic Parallelism • With TPL, we saw some examples of developer- controlled parallelism • The developer has to plan ahead and parallelize their code carefully • They can make mistakes, some can lead to incorrect results and others crashes • Some of these bugs may only exhibit themselves on machines with a large number of processors because execution schedules will be more complex 13

  14. Matrices: Layout Considerations • Suppose Z is stored in a row-major order • We can do it column-by- column • Or row-by-row – this will match the layout • Or can parallelize the outer loop here • b is the partition to give to every processor • M processors • This the code for p-th proc 14

  15. Examples for i = 11, 20 Parallel? a[i] = a[i] + 3 for i = 11, 20 Parallel? a[i] = a[i-1] + 3 15

  16. Examples for i = 11, 20 Parallel a[i] = a[i] + 3 for i = 11, 20 Not parallel a[i] = a[i-1] + 3 for i = 11, 20 Parallel? a[i] = a[i-10] + 3 16

  17. 17 Single Loops

  18. Data Dependence of Scalar Variables • True dependence  Output dependence a = a = = a a = • Anti-dependence  Input dependence = a = a a = = a 18

  19. Array Accesses in a Loop for i = 2, 5 a[i] = a[i] + 3 read a[4] a[5] a[3] a[2] a[5] a[4] a[3] a[2] write 19

  20. Array True-dependence for i = 2, 5 a[i] = a[i-2] + 3 read a[2] a[3] a[1] a[0] a[5] a[4] a[3] a[2] write 20

  21. Array Anti- dependence for i = 2, 5 a[i-2] = a[i] + 3 read a[4] a[5] a[3] a[2] a[3] a[2] a[1] a[0] write 21

  22. Dynamic Data Dependence • Let o and o’ be two (dynamic) operations • Data dependence exists from o to o’, iff • either o or o’ is a write operation • o and o’ may refer to the same location • o executes before o’ 22

  23. Static Data Dependence • Let a and a’ be two static array accesses (not necessarily distinct) • Data dependence exists from a to a’, iff • either a or a’ is a write operation • There exists a dynamic instance of a (o) and a dynamic instance of a’ (o’) such that • o and o’ may refer to the same location • o executes before o’ 23

  24. Recognizing DOALL Loops • Find data dependences in loop • Definition: a dependence is loop-carried if it crosses an iteration boundary • If there are no loop-carried dependences then loop is parallelizable 24

  25. Compute Dependences for i= 2, 5 a[i-2] = a[i] + 3 • There is a dependence between a[i] and a[i-2] if • There exist two iterations i r and i w within the loop bounds such that iterations i r and i w read and write the same array element, respectively • There exist i r , i w , 2 ≤ i r , i w ≤ 5, i r = i w -2 25

  26. Compute Dependences for i= 2, 5 a[i-2] = a[i] + 3 • There is a dependence between a[i-2] and a[i-2] if • There exist two iterations i v and i w within the loop bounds such that iterations i v and i w write the same array element, respectively • There exist i v , i w , 2 ≤ i v , i w ≤ 5, i v -2= i w -2 26

  27. Parallelization for i= 2, 5 a[i-2] = a[i] + 3 • Is there a loop-carried dependence between a[i] and a[i-2]? • Is there a loop-carried dependence between a[i-2] and a[i-2]? 27

  28. 28 Nested Loops

  29. Iteration Spaces • The iteration space is the set of the dynamic execution instances in a computation, that is, the set of combinations of values taken on by the loop indexes. • The data space is the set of array elements accessed. • The processor space is the set of processors in the system. Normally, these processors are assigned integer numbers or vectors of integers to distinguish among them. 29

  30. Iteration Spaces Illustrated 30

  31. Nested Loops • Which loop(s) are parallel? for i1 = 0, 5 for i2 = 0, 3 a[i1,i2] = a[i1-2,i2-1] + 3 31

  32. Iteration Space • An abstraction for loops for i1 = 0, 5 for i2 = 0, 3 i2 a[i1,i2]= 3 • Iteration is represented as coordinates in iteration space. i1 32

  33. Execution Order • Sequential execution order of iterations: Lexicographic order [0,0], [0,1], …[0,3], [1,0], [1,1], …[1,3], [2,0]… i 2 • Let I = (i 1 ,i 2 ,… i n ). I is lexicographically less than I’, I<I’, iff there exists k such that (i 1 ,… i k-1 ) = (i’ 1 ,… i’ k-1 ) and i k < i’ k i 1 33

  34. Parallelism for Nested Loops • Is there a data dependence between a[i1,i2] and a[i1-2,i2-1]? • There exist i1 r , i2 r , i1 w , i2 w , such that • 0 ≤ i1 r , i1 w ≤ 5, • 0 ≤ i2 r , i2 w ≤ 3, • i1 r - 2 = i1 w • i2 r - 1 = i2 w 34

Recommend


More recommend