30. Parallel Programming I Moores Law and the Free Lunch, Hardware - PowerPoint PPT Presentation

30. Parallel Programming I Moore’s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn’s Taxonomy, Multi-Threading, Parallelism and Concurrency, C++ Threads, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling [Task-Scheduling: Cormen et al, Kap. 27] [Concurrency, Scheduling: Williams, Kap. 1.1 – 1.2] 888

The Free Lunch The free lunch is over 50 50 "The Free Lunch is Over", a fundamental turn toward concurrency in software, Herb Sutter, Dr. Dobb’s Journal, 2005 889

Moore’s Law Observation by Gordon E. Moore: The number of transistors on integrated circuits Gordon E. Moore (1929) doubles approximately every two years. 890

891 ourworldindata.org, https://en.wikipedia.org/wiki/Transistor_count

For a long time... the sequential execution became faster ("Instruction Level Parallelism", "Pipelining", Higher Frequencies) more and smaller transistors = more performance programmers simply waited for the next processor generation 892

Today the frequency of processors does not increase significantly and more (heat dissipation problems) the instruction level parallelism does not increase significantly any more the execution speed is dominated by memory access times (but caches still become larger and faster) 893

Trends 894 http://www.gotw.ca/publications/concurrency-ddj.htm

Multicore Use transistors for more compute cores Parallelism in the software Programmers have to write parallel programs to benefit from new hardware 895

Forms of Parallel Execution Vectorization Pipelining Instruction Level Parallelism Multicore / Multiprocessing Distributed Computing 896

Vectorization Parallel Execution of the same operations on elements of a vector (register) x skalar x + y + y x 1 x 2 x 3 x 4 vector x 1 + y 1 x 2 + y 2 x 3 + y 3 x 4 + y 4 + y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 vector fma � x, y � y 1 y 2 y 3 y 4 897

Pipelining in CPUs Fetch Decode Execute Data Fetch Writeback Multiple Stages Every instruction takes 5 time units (cycles) In the best case: 1 instruction per cycle, not always possible (“stalls”) Paralellism (several functional units) leads to faster execution. 898

ILP – Instruction Level Parallelism Modern CPUs provide several hardware units and execute independent instructions in parallel. Pipelining Superscalar CPUs (multiple instructions per cycle) Out-Of-Order Execution (Programmer observes the sequential execution) Speculative Execution () 899

30.2 Hardware Architectures 900

Shared vs. Distributed Memory Shared Memory Distributed Memory CPU CPU CPU CPU CPU CPU Mem Mem Mem Mem Interconnect 901

Shared vs. Distributed Memory Programming Categories of programming interfaces Communication via message passing Communication via memory sharing It is possible: to program shared memory systems as distributed systems (e.g. with message passing MPI) program systems with distributed memory as shared memory systems (e.g. partitioned global address space PGAS) 902

Shared Memory Architectures Multicore (Chip Multiprocessor - CMP) Symmetric Multiprocessor Systems (SMP) Simultaneous Multithreading (SMT = Hyperthreading) one physical core, Several Instruction Streams/Threads: several virtual cores Between ILP (several units for a stream) and multicore (several units for several streams). Limited parallel performance. Non-Uniform Memory Access (NUMA) Same programming interface 903

Overview CMP SMP NUMA 904

An Example AMD Bulldozer: between CMP and SMT 2x integer core 1x floating point core Wikipedia 905

Flynn’s Taxonomy Single-Core Fehlertoleranz SI = Single Instruction MI = Multiple Instructions SD = Single Data MD = Multiple Data Vector Computing / GPU Multi-Core 906

Massively Parallel Hardware [General Purpose] Graphical Processing Units ([GP]GPUs) Revolution in High Performance Computing Calculation 4.5 TFlops vs. 500 GFlops Memory Bandwidth 170 GB/s vs. 40 GB/s SIMD High data parallelism Requires own programming model. Z.B. CUDA / OpenCL 907

30.3 Multi-Threading, Parallelism and Concurrency 908

Processes and Threads Process: instance of a program each process has a separate context, even a separate address space OS manages processes (resource control, scheduling, synchronisation) Threads: threads of execution of a program Threads share the address space fast context switch between threads 909

Why Multithreading? Avoid “polling” resources (files, network, keyboard) Interactivity (e.g. responsivity of GUI programs) Several applications / clients in parallel Parallelism (performance!) 910

Multithreading conceptually Thread 1 Single Core Thread 2 Thread 3 Thread 1 Multi Core Thread 2 Thread 3 911

Thread switch on one core (Preemption) thread 1 thread 2 busy Interrupt idle Store State t 1 Load State t 2 busy idle Interrupt Store State t 2 idle Load State t 1 busy 912

Parallelität vs. Concurrency Parallelism: Use extra resources to solve a problem faster Concurrency: Correctly and efficiently manage access to shared resources Begriffe überlappen offensichtlich. Bei parallelen Berechnungen besteht fast immer Synchronisierungsbedarf. Concurrency Parallelism Requests Work Resources Resources 913

Thread Safety Thread Safety means that in a concurrent application of a program this always yields the desired results. Many optimisations (Hardware, Compiler) target towards the correct execution of a sequential program. Concurrent programs need an annotation that switches off certain optimisations selectively. 914

Example: Caches Access to registers faster than to shared memory. Principle of locality. Use of Caches (transparent to the programmer) If and how far a cache coherency is guaran- teed depends on the used system. 915

30.4 C++ Threads 916

C++11 Threads #include <iostream> #include <thread> create thread void hello(){ std::cout << "hello\n"; hello } int main(){ join // create and launch thread t std::thread t(hello); // wait for termination of t t.join(); return 0; } 917

C++11 Threads void hello(int id){ std::cout << "hello from " << id << "\n"; } create threads int main(){ std::vector<std::thread> tv(3); int id = 0; for (auto & t:tv) join t = std::thread(hello, ++id); std::cout << "hello from main \n"; for (auto & t:tv) t.join(); return 0; } 918

Nondeterministic Execution! One execution: Other execution: Other execution: hello from main hello from 1 hello from main hello from 2 hello from main hello from 0 hello from 1 hello from 0 hello from hello from 1 hello from 0 hello from 2 2 919

Technical Detail To let a thread continue as background thread: void background(); void someFunction(){ ... std::thread t(background); t.detach(); ... } // no problem here, thread is detached 920

More Technical Details With allocating a thread, reference parameters are copied, except explicitly std::ref is provided at the construction. Can also run Functor or Lambda-Expression on a thread In exceptional circumstances, joining threads should be executed in a catch block More background and details in chapter 2 of the book C++ Concurrency in Action , Anthony Williams, Manning 2012. also available online at the ETH library. 921

30.5 Scalability: Amdahl and Gustafson 922

Scalability In parallel Programming: Speedup when increasing number p of processors What happens if p → ∞ ? Program scales linearly: Linear speedup. 923

Parallel Performance Given a fixed amount of computing work W (number computing steps) Sequential execution time T 1 Parallel execution time on p CPUs Perfection: T p = T 1 /p Performance loss: T p > T 1 /p (usual case) Sorcery: T p < T 1 /p 924

Parallel Speedup Parallel speedup S p on p CPUs: S p = W/T p = T 1 . W/T 1 T p Perfection: linear speedup S p = p Performance loss: sublinear speedup S p < p (the usual case) Sorcery: superlinear speedup S p > p Efficiency: E p = S p /p 925

Reachable Speedup? Parallel Program Seq. Part Parallel Part 80% 20% T 1 = 10 T 8 = 10 · 0 . 8 + 10 · 0 . 2 = 1 + 2 = 3 8 S 8 = T 1 = 10 3 ≈ 3 . 3 < 8 (!) T 8 926

Amdahl’s Law: Ingredients Computational work W falls into two categories Paralellisable part W p Not parallelisable, sequential part W s Assumption: W can be processed sequentially by one processor in W time units ( T 1 = W ): T 1 = W s + W p T p ≥ W s + W p /p 927

Amdahl’s Law S p = T 1 ≤ W s + W p T p W s + W p p 928

Amdahl’s Law With sequential, not parallelizable fraction λ : W s = λW , W p = (1 − λ ) W : 1 S p ≤ λ + 1 − λ p Thus S ∞ ≤ 1 λ 929

Illustration Amdahl’s Law p = 1 p = 2 p = 4 W s W s W s W p W p t T 1 W p 930

Amdahl’s Law is bad news All non-parallel parts of a program can cause problems 931

Gustafson’s Law Fix the time of execution Vary the problem size. Assumption: the sequential part stays constant, the parallel part becomes larger 932

Illustration Gustafson’s Law p = 1 p = 2 p = 4 W s W s W s t T W p W p W p W p W p W p W p 933

30. Parallel Programming I Moores Law and the Free Lunch, Hardware - PowerPoint PPT Presentation

30. Parallel Programming I Moores Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynns Taxonomy, Multi-Threading, Parallelism and Concurrency, C++ Threads, Scalability: Amdahl and Gustafson, Data-parallelism,

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Parallel Programming Languages and Approaches

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Overview Parallel computing platforms Approaches to building parallel computers

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

Concurrent Programming with Parallel Extensions to .NET Joe Duffy Architect & Development

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Distributed Systems Principles and Paradigms Chapter 08 (version October 5, 2007 ) Maarten van

CSE 1030 Inheritance Yves Lesp erance Often, we need to define a class that is similar to an

Interative Hybrid Probabilistic Model Counting Steffen Michels, Arjen Hommersom, and Peter Lucas

Machine Translation for Human Translators Carnegie Mellon Ph.D. Thesis Michael Denkowski

CORTONA 2004 Updating incomplete factorizations for PDEs DANIELE BERTACCINI Universit` a di

Symmetry Methods for Differential Equations and Conservation Laws Peter J. Olver University of

On the Classification of Integrable Scalar Evolution Equations in 1 + 1 Dimension Ay se H

Comparing Frequentist and Bayesian Fixed-Confidence Guarantees for Selection-of-the-Best Problems

30. Parallel Programming I Moores Law and the Free Lunch, Hardware - PowerPoint PPT Presentation

30. Parallel Programming I Moores Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynns Taxonomy, Multi-Threading, Parallelism and Concurrency, C++ Threads, Scalability: Amdahl and Gustafson, Data-parallelism,

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Parallel Programming Languages and Approaches

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Overview Parallel computing platforms Approaches to building parallel computers

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

Concurrent Programming with Parallel Extensions to .NET Joe Duffy Architect &amp; Development

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Distributed Systems Principles and Paradigms Chapter 08 (version October 5, 2007 ) Maarten van

CSE 1030 Inheritance Yves Lesp erance Often, we need to define a class that is similar to an

Interative Hybrid Probabilistic Model Counting Steffen Michels, Arjen Hommersom, and Peter Lucas

Machine Translation for Human Translators Carnegie Mellon Ph.D. Thesis Michael Denkowski

CORTONA 2004 Updating incomplete factorizations for PDEs DANIELE BERTACCINI Universit` a di

Symmetry Methods for Differential Equations and Conservation Laws Peter J. Olver University of

On the Classification of Integrable Scalar Evolution Equations in 1 + 1 Dimension Ay se H

Comparing Frequentist and Bayesian Fixed-Confidence Guarantees for Selection-of-the-Best Problems

Concurrent Programming with Parallel Extensions to .NET Joe Duffy Architect & Development