ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM - PDF document

ECE 1747H ECE 1747H : Parallel • Meeting time: Mon 4-6 PM Programming • Meeting place: BA 4164 • Instructor: Cristiana Amza, Lecture 1: Overview http://www.eecg.toronto.edu/~amza amza@eecg.toronto.edu, office Pratt 484E Material Prerequisites • Course notes • Programming in C or C++ • Web material (e.g., published papers) • Data structures • No required textbook, some recommended • Basics of machine architecture • Basics of network programming • Please send e-mail to eugenia@eecg to get an eecg account !! (name, stuid, class, instructor) 1

Other than that Programming Project • No written homeworks, no exams • Parallelizing a sequential program, or improving the performance or the • 10% for each small programming functionality of a parallel program assignments (expect 1-2) • Project proposal and final report • 10% class participation • In-class project proposal and final report • Rest comes from major course project presentation • “Sample” project presentation posted Parallelism (1 of 2) Parallelism (2 of 2) • Ability to execute different parts of a single • Coarse-grain parallelism mainly applicable program concurrently on different machines to long-running, scientific programs • Goal: shorter running time • Examples: weather prediction, prime number factorization, simulations, … • Grain of parallelism: how big are the parts? • Can be instruction, statement, procedure, … • Will mainly focus on relative coarse grain 2

Lecture material (1 of 4) Lecture material (2 of 4) • Standard models of parallelism • Parallelism – shared memory (Pthreads) – What is parallelism? – message passing (MPI) – What can be parallelized? – shared memory + data parallelism (OpenMP) – Inhibitors of parallelism: dependences • Classes of applications – scientific – servers Lecture material (3 of 4) Lecture material (4 of 4) • Transaction processing • Perf. of parallel & distributed programs – classic programming model for databases – architecture-independent optimization – now being proposed for scientific programs – architecture-dependent optimization 3

Course Organization Parallel vs. Distributed Programming • First month of semester: Parallel programming has matured: – lectures on parallelism, patterns, models • Few standard programming models – small programming assignments, done • Few common machine architectures individually • Portability between models and • Rest of the semester architectures – major programming project, done individually or in small group – Research paper discussions Bottom Line • Programmer can now focus on program and ECE 1747H: Parallel use suitable programming model Programming • Reasonable hope of portability • Problem: much performance optimization is Lecture 1-2: Parallelism, still platform-dependent Dependences – Performance portability is a problem 4

Parallelism Measures of Performance • Ability to execute different parts of a • To computer scientists: speedup, execution program concurrently on different machines time. • Goal: shorten execution time • To applications people: size of problem, accuracy of solution, etc. Speedup of Algorithm Speedup on Problem • Speedup of algorithm = sequential execution time • Speedup on problem = sequential execution / execution time on p processors (with the same time of best known sequential algorithm / data set). execution time on p processors. speedup • A more honest measure of performance. • Avoids picking an easily parallelizable algorithm with poor sequential execution time. p 5

What Speedups Can You Get? Speedup • Linear speedup speedup linear – Confusing term: implicitly means a 1-to-1 speedup per processor. – (almost always) as good as you can do. actual • Sub-linear speedup: more normal due to overhead of startup, synchronization, communication, etc. p Scalability Super-linear Speedup? • No really precise decision. • Due to cache/memory effects: • Roughly speaking, a program is said to – Subparts fit into cache/memory of each node. scale to a certain number of processors p, if – Whole problem does not fit in cache/memory of a single node. going from p-1 to p processors results in • Nondeterminism in search problems. some acceptable improvement in speedup (for instance, an increase of 0.5). – One thread finds near-optimal solution very quickly => leads to drastic pruning of search space. 6

Cardinal Performance Rule Amdahl’s Law • If 1/s of the program is sequential, then you • Don’t leave (too) much of your code can never get a speedup better than s. sequential! – (Normalized) sequential execution time = 1/s + (1- 1/s) = 1 – Best parallel execution time on p processors = 1/s + (1 - 1/s) /p – When p goes to infinity, parallel execution = 1/s – Speedup = s. Why keep something sequential? When can two statements execute in parallel? • Some parts of the program are not • On one processor: parallelizable (because of dependences) statement 1; statement 2; • Some parts may be parallelizable, but the • On two processors: overhead dwarfs the increased speedup. processor1: processor2: statement1; statement2; 7

Fundamental Assumption When can 2 statements execute in parallel? • Processors execute independently: no • Possibility 1 control over order of execution between Processor1: Processor2: processors statement1; statement2; • Possibility 2 Processor1: Processor2: statement2: statement1; Example 1 When can 2 statements execute in parallel? a = 1; • Their order of execution must not matter! b = 2; • In other words, • Statements can be executed in parallel. statement1; statement2; must be equivalent to statement2; statement1; 8

Example 2 Example 3 a = 1; a = f(x); b = a; b = a; • Statements cannot be executed in parallel • May not be wise to change the program (sequential execution would take longer). • Program modifications may make it possible. Example 5 True dependence a = 1; Statements S1, S2 a = 2; S2 has a true dependence on S1 • Statements cannot be executed in parallel. iff S2 reads a value written by S1 9

Anti-dependence Output Dependence Statements S1, S2. Statements S1, S2. S2 has an anti-dependence on S1 S2 has an output dependence on S1 iff iff S2 writes a value read by S1. S2 writes a variable written by S1. Example 6 When can 2 statements execute in parallel? • Most parallelism occurs in loops . S1 and S2 can execute in parallel iff there are no dependences between S1 and S2 for(i=0; i<100; i++) – true dependences a[i] = i; – anti-dependences – output dependences • No dependences. Some dependences can be removed. • Iterations can be executed in parallel. 10

Example 7 Example 8 for(i=0; i<100; i++) { for(i=0;i<100;i++) a[i] = i; a[i] = i; for(i=0;i<100;i++) b[i] = 2*i; b[i] = 2*i; } Iterations and loops can be executed in parallel. Iterations and statements can be executed in parallel. Example 9 Example 10 for(i=0; i<100; i++) for( i=0; i<100; i++ ) a[i] = a[i] + 100; a[i] = f(a[i-1]); • There is a dependence … on itself! • Dependence between a[i] and a[i-1]. • Loop is still parallelizable. • Loop iterations are not parallelizable. 11

Loop-carried dependence Example 11 • A loop carried dependence is a dependence for(i=0; i<100; i++ ) that is present only if the statements are part for(j=0; j<100; j++ ) of the execution of a loop. a[i][j] = f(a[i][j-1]); • Otherwise, we call it a loop-independent dependence. • Loop-independent dependence on i. • Loop-carried dependence on j. • Loop-carried dependences prevent loop • Outer loop can be parallelized, inner loop cannot. iteration parallelization. Example 12 Level of loop-carried dependence for( j=0; j<100; j++ ) • Is the nesting depth of the loop that carries the dependence. for( i=0; i<100; i++ ) • Indicates which loops can be parallelized. a[i][j] = f(a[i][j-1]); • Inner loop can be parallelized, outer loop cannot. • Less desirable situation. • Loop interchange is sometimes possible. 12

Be careful … Example 13 Be careful … Example 14 printf(“a”); a = f(x); printf(“b”); b = g(x); Statements have a hidden output dependence Statements could have a hidden dependence if due to the output stream. f and g update the same variable. Also depends on what f and g can do to x. Be careful … Example 15 Be careful … Example 16 for(i=0; i<100; i++) for( i=1; i<100;i++ ) { a[i] = …; a[i+10] = f(a[i]); ... = a[i-1]; } • Dependence between a[10], a[20], … • Dependence between a[i] and a[i-1] • Dependence between a[11], a[21], … • Complete parallel execution impossible • … • Pipelined parallel execution possible • Some parallel execution is possible. 13

Be careful … Example 14 An aside for( i=0; i<100; i++ ) • Parallelizing compilers analyze program dependences to decide parallelization. a[i] = f(a[indexa[i]]); • In parallelization by hand, user does the same analysis. • Cannot tell for sure. • Compiler more convenient and more correct • Parallelization depends on user knowledge • User more powerful, can analyze more of values in indexa[]. patterns. • User can tell, compiler cannot. To remember • Statement order must not matter. • Statements must not have dependences. • Some dependences can be removed. • Some dependences may not be obvious. 14

ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM - PDF document

ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM Programming Meeting place: BA 4164 Instructor: Cristiana Amza, Lecture 1: Overview http://www.eecg.toronto.edu/~amza amza@eecg.toronto.edu, office Pratt 484E Material

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Cloudy/Clear Sky Relative Humidity in the Upper Troposphere Observed by AIRS, CloudSat, and

30 Transformational Design with Essential Aspect Decomposition: Model-Driven Architecture (MDA)

Outline Introduction Contribution: Novel Vectorization and Mapping Workflow.

TUHH Institute of Telematics TUHH Hamburg University of Technology Hamburg University of

TensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance Pre-release

Architecture-based Dependability Prediction for Service-oriented Computing Vincenzo Grassi

I N C LOUD C OMPUTING Christina Delimitrou Stanford University Defense May 26 th

Perf rform rmance ance Inter terfer ference ence on Multico ticore Processor essors