ece 1747h ece 1747h parallel
play

ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM - PDF document

ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM Programming Meeting place: BA 4164 Instructor: Cristiana Amza, Lecture 1: Overview http://www.eecg.toronto.edu/~amza amza@eecg.toronto.edu, office Pratt 484E Material


  1. ECE 1747H ECE 1747H : Parallel • Meeting time: Mon 4-6 PM Programming • Meeting place: BA 4164 • Instructor: Cristiana Amza, Lecture 1: Overview http://www.eecg.toronto.edu/~amza amza@eecg.toronto.edu, office Pratt 484E Material Prerequisites • Course notes • Programming in C or C++ • Web material (e.g., published papers) • Data structures • No required textbook, some recommended • Basics of machine architecture • Basics of network programming • Please send e-mail to eugenia@eecg to get an eecg account !! (name, stuid, class, instructor) 1

  2. Other than that Programming Project • No written homeworks, no exams • Parallelizing a sequential program, or improving the performance or the • 10% for each small programming functionality of a parallel program assignments (expect 1-2) • Project proposal and final report • 10% class participation • In-class project proposal and final report • Rest comes from major course project presentation • “Sample” project presentation posted Parallelism (1 of 2) Parallelism (2 of 2) • Ability to execute different parts of a single • Coarse-grain parallelism mainly applicable program concurrently on different machines to long-running, scientific programs • Goal: shorter running time • Examples: weather prediction, prime number factorization, simulations, … • Grain of parallelism: how big are the parts? • Can be instruction, statement, procedure, … • Will mainly focus on relative coarse grain 2

  3. Lecture material (1 of 4) Lecture material (2 of 4) • Standard models of parallelism • Parallelism – shared memory (Pthreads) – What is parallelism? – message passing (MPI) – What can be parallelized? – shared memory + data parallelism (OpenMP) – Inhibitors of parallelism: dependences • Classes of applications – scientific – servers Lecture material (3 of 4) Lecture material (4 of 4) • Transaction processing • Perf. of parallel & distributed programs – classic programming model for databases – architecture-independent optimization – now being proposed for scientific programs – architecture-dependent optimization 3

  4. Course Organization Parallel vs. Distributed Programming • First month of semester: Parallel programming has matured: – lectures on parallelism, patterns, models • Few standard programming models – small programming assignments, done • Few common machine architectures individually • Portability between models and • Rest of the semester architectures – major programming project, done individually or in small group – Research paper discussions Bottom Line • Programmer can now focus on program and ECE 1747H: Parallel use suitable programming model Programming • Reasonable hope of portability • Problem: much performance optimization is Lecture 1-2: Parallelism, still platform-dependent Dependences – Performance portability is a problem 4

  5. Parallelism Measures of Performance • Ability to execute different parts of a • To computer scientists: speedup, execution program concurrently on different machines time. • Goal: shorten execution time • To applications people: size of problem, accuracy of solution, etc. Speedup of Algorithm Speedup on Problem • Speedup of algorithm = sequential execution time • Speedup on problem = sequential execution / execution time on p processors (with the same time of best known sequential algorithm / data set). execution time on p processors. speedup • A more honest measure of performance. • Avoids picking an easily parallelizable algorithm with poor sequential execution time. p 5

  6. What Speedups Can You Get? Speedup • Linear speedup speedup linear – Confusing term: implicitly means a 1-to-1 speedup per processor. – (almost always) as good as you can do. actual • Sub-linear speedup: more normal due to overhead of startup, synchronization, communication, etc. p Scalability Super-linear Speedup? • No really precise decision. • Due to cache/memory effects: • Roughly speaking, a program is said to – Subparts fit into cache/memory of each node. scale to a certain number of processors p, if – Whole problem does not fit in cache/memory of a single node. going from p-1 to p processors results in • Nondeterminism in search problems. some acceptable improvement in speedup (for instance, an increase of 0.5). – One thread finds near-optimal solution very quickly => leads to drastic pruning of search space. 6

  7. Cardinal Performance Rule Amdahl’s Law • If 1/s of the program is sequential, then you • Don’t leave (too) much of your code can never get a speedup better than s. sequential! – (Normalized) sequential execution time = 1/s + (1- 1/s) = 1 – Best parallel execution time on p processors = 1/s + (1 - 1/s) /p – When p goes to infinity, parallel execution = 1/s – Speedup = s. Why keep something sequential? When can two statements execute in parallel? • Some parts of the program are not • On one processor: parallelizable (because of dependences) statement 1; statement 2; • Some parts may be parallelizable, but the • On two processors: overhead dwarfs the increased speedup. processor1: processor2: statement1; statement2; 7

  8. Fundamental Assumption When can 2 statements execute in parallel? • Processors execute independently: no • Possibility 1 control over order of execution between Processor1: Processor2: processors statement1; statement2; • Possibility 2 Processor1: Processor2: statement2: statement1; Example 1 When can 2 statements execute in parallel? a = 1; • Their order of execution must not matter! b = 2; • In other words, • Statements can be executed in parallel. statement1; statement2; must be equivalent to statement2; statement1; 8

  9. Example 2 Example 3 a = 1; a = f(x); b = a; b = a; • Statements cannot be executed in parallel • May not be wise to change the program (sequential execution would take longer). • Program modifications may make it possible. Example 5 True dependence a = 1; Statements S1, S2 a = 2; S2 has a true dependence on S1 • Statements cannot be executed in parallel. iff S2 reads a value written by S1 9

  10. Anti-dependence Output Dependence Statements S1, S2. Statements S1, S2. S2 has an anti-dependence on S1 S2 has an output dependence on S1 iff iff S2 writes a value read by S1. S2 writes a variable written by S1. Example 6 When can 2 statements execute in parallel? • Most parallelism occurs in loops . S1 and S2 can execute in parallel iff there are no dependences between S1 and S2 for(i=0; i<100; i++) – true dependences a[i] = i; – anti-dependences – output dependences • No dependences. Some dependences can be removed. • Iterations can be executed in parallel. 10

  11. Example 7 Example 8 for(i=0; i<100; i++) { for(i=0;i<100;i++) a[i] = i; a[i] = i; for(i=0;i<100;i++) b[i] = 2*i; b[i] = 2*i; } Iterations and loops can be executed in parallel. Iterations and statements can be executed in parallel. Example 9 Example 10 for(i=0; i<100; i++) for( i=0; i<100; i++ ) a[i] = a[i] + 100; a[i] = f(a[i-1]); • There is a dependence … on itself! • Dependence between a[i] and a[i-1]. • Loop is still parallelizable. • Loop iterations are not parallelizable. 11

  12. Loop-carried dependence Example 11 • A loop carried dependence is a dependence for(i=0; i<100; i++ ) that is present only if the statements are part for(j=0; j<100; j++ ) of the execution of a loop. a[i][j] = f(a[i][j-1]); • Otherwise, we call it a loop-independent dependence. • Loop-independent dependence on i. • Loop-carried dependence on j. • Loop-carried dependences prevent loop • Outer loop can be parallelized, inner loop cannot. iteration parallelization. Example 12 Level of loop-carried dependence for( j=0; j<100; j++ ) • Is the nesting depth of the loop that carries the dependence. for( i=0; i<100; i++ ) • Indicates which loops can be parallelized. a[i][j] = f(a[i][j-1]); • Inner loop can be parallelized, outer loop cannot. • Less desirable situation. • Loop interchange is sometimes possible. 12

  13. Be careful … Example 13 Be careful … Example 14 printf(“a”); a = f(x); printf(“b”); b = g(x); Statements have a hidden output dependence Statements could have a hidden dependence if due to the output stream. f and g update the same variable. Also depends on what f and g can do to x. Be careful … Example 15 Be careful … Example 16 for(i=0; i<100; i++) for( i=1; i<100;i++ ) { a[i] = …; a[i+10] = f(a[i]); ... = a[i-1]; } • Dependence between a[10], a[20], … • Dependence between a[i] and a[i-1] • Dependence between a[11], a[21], … • Complete parallel execution impossible • … • Pipelined parallel execution possible • Some parallel execution is possible. 13

  14. Be careful … Example 14 An aside for( i=0; i<100; i++ ) • Parallelizing compilers analyze program dependences to decide parallelization. a[i] = f(a[indexa[i]]); • In parallelization by hand, user does the same analysis. • Cannot tell for sure. • Compiler more convenient and more correct • Parallelization depends on user knowledge • User more powerful, can analyze more of values in indexa[]. patterns. • User can tell, compiler cannot. To remember • Statement order must not matter. • Statements must not have dependences. • Some dependences can be removed. • Some dependences may not be obvious. 14

Recommend


More recommend