 
              Parallel Programming and Heterogeneous Computing E2 - Summary Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group
Course Topics A. The Parallelization Problem Power wall, memory wall, Moore’s law □ Terminology and metrics □ B. Shared Memory Parallelism Theory of concurrency, hardware today and in the past □ Programming models, optimization, profiling □ C. Heterogeneous Computing On-Chip Accelerators (e.g. SIMD, special purpose accelerators, etc.) □ External Accelerators (e.g. GPUs, FPGAs, etc.) □ D. Shared Nothing Parallelism ParProg20 E2 Summary Theory of concurrency, hardware today and in the past □ Programming models, optimization, profiling □ Chart 2
A: Why Parallel?, Terminology, Hardware, Metrics, Workloads, Foster‘s Methodology
Moore’s Law vs. Walls: Speed, Power, Memory, ILP Dynamic Power ~ Number of Transistors (N) x Capacitance (C) x Voltage 2 (V 2 ) x Frequency (F) CPU core CPU core CPU core CPU core L1 Cache L1 Cache L1 Cache L1 Cache Bus Bus L2 Cache L2 Cache Bus ParProg 2020 L3 Cache Introduction: Why Parallel? Max Plauth Chart 4
[Pfister1998] Three Ways of Doing Things Faster ■ Work Harder : Workload collection of operations (execution capacity) that are executed to Workload produce a desired result ~ Program, Application : Execution Unit facility that is capable of ■ Work Smarter executing the operations of a workload (optimization) ParProg20 A1 ■ Get Help Terminology (parallelization) Lukas Wenzel Execution Execution Execution Chart 5 Unit Unit Unit
An Important Distinction Parallelism Concurrency : Parallelism Capability of a machine to have multiple Capability of a machine to perform : Concurrency tasks in progress at any point in time multiple tasks simultaneously : Distribution ■ Can be realized without parallel Requires parallel hardware ■ hardware Any parallel program is a concurrent program, some concurrent programs cannot be executed correctly in parallel. Distribution Form of Parallelism, where tasks are ParProg20 A1 performed by multiple communicating Terminology machines Lukas Wenzel Concurrency ⊃ Parallelism ⊃ Distribution Chart 6 sometimes Concurrency \ Parallelism called "Concurrency"
Hardware Taxonomy [Flynn1966] Multiple Data Streams LD A LD A 0 A 1 A n LD LD B B 0 B 1 B n ADD C A B ADD C 0 B 0 A 0 C 1 B 1 A 1 C n B n A n ST C ST C 0 C 1 C n MUL B 2 MUL 2 2 2 A A 0 B 0 A 1 B 1 A n B n Instruction Streams ST A ST A 0 A 1 A n SISD SIMD LD A LD A LD A LD D LD LD LD ADD B B B D D 6 ParProg 2020 A2 ADD C 0 B SUB C n B 8 ADD C B ST A A D Multiple Parallel Hardware MUL C 0 C 0 3 DIV D n A C n ST C LD T Lukas Wenzel SUB B MUL D n MUL B 2 CMP T C 0 C 0 C n C n A D ST ST ST BGE label C 0 C n A Chart 7 MISD MIMD
MIMD Hardware Taxonomy MIMD SM-MIMD DM-MIMD (Shared Memory) (Distributed Memory) Processing elements can directly Processing elements can access their access a common address space private address spaces and exchange messages Data Data Data Data Private Memory Private Memory Shared Memory ParProg 2020 A2 Task Task Task Task Parallel Hardware Task Task Task Task ... ... Task Task Task Task Lukas Wenzel Processing Processing Processing Processing Element Element Element Element Message Message Chart 8 Message Interconnect / Network
SM-MIMD Hardware MIMD SM-MIMD DM-MIMD (Shared Memory) (Distributed Memory) UMA NUMA (Uniform Memory Access) (Non-Uniform Memory Access) PE PE PE PE PE Memory Memory Node Node ParProg 2020 A2 Parallel Hardware PE PE Lukas Wenzel Memory Memory Memory Chart 9 Node Node
Recap Optimization Goals Decrease Latency – process a single workload faster (= speedup ) ■ Increase Throughput – process more workloads in the same time ■ Both are Performance metrics Ø Scalability : make best use of additional resources ■ Scale Up : Utilize additional resources on a machine □ Scale Out : Utilize resources on additional machines □ Cost/Energy Efficiency : ■ minimize cost/energy requirements for given performance objectives □ ParProg20 A1 alternatively: maximize performance for given cost/energy budget □ Terminology Lukas Wenzel Utilization : minimize idle time (=waste) of available resources ■ Chart 10 Precision-Tradeoffs : trade performance for precision of results ■
Anatomy of a Workload The longest task puts a lower bound on the shortest execution time. T1 T2 T3 T4 T5 T6 T7 T8 Modeling discrete tasks is impractical → simplified continuous model. 𝐔 𝟐 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 𝐔 𝐎 = 𝐔 𝐪𝐛𝐬 𝐎 + 𝐔 𝐭𝐟𝐫 𝐔(𝐎) 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 /𝐎 ParProg 2020 A3 Performance Metrics Replace absolute times by parallelizable fraction 𝐐 : Lukas Wenzel 𝐔 𝐪𝐛𝐬 = 𝐔 𝟐 ⋅ 𝐐 𝑼 𝑶 = 𝑼 𝟐 ⋅ 𝑸 𝑶 + (𝟐 − 𝑸) Chart 11 𝐔 𝐭𝐟𝐫 = 𝐔 𝟐 ⋅ (𝟐 − 𝐐)
[Amdahl1967] Amdahl‘s Law Amdahl's Law derives the speedup 𝐭 𝐁𝐧𝐞𝐛𝐢𝐦 𝐎 for a parallelization degree 𝐎 T T 𝟐 ' ' 𝐭 𝐁𝐧𝐞𝐛𝐢𝐦 𝐎 = T(N) = = ' ⋅ P 𝐐 T N + (1 − P) 𝐎 + (𝟐 − 𝐐) Even for arbitrarily large 𝐎 , the speedup converges to a fixed limit 𝟐 𝑶→* 𝒕 𝑩𝒏𝒆𝒃𝒊𝒎 𝑶 = 𝐦𝐣𝐧 𝟐 − 𝐐 ParProg 2020 A3 Performance Metrics Lukas Wenzel For getting reasonable speedup out of 1000 processors, the sequential part must be substantially below 0.1% Chart 12
[Amdahl1967] Amdahl‘s Law Regardless of processor count, 90% parallelizable code allows not more than a speedup by factor 10 . Parallelism requires highly Ø parallelizable workloads to achieve a speedup What is the sense in large parallel ■ machines? Amdahl's law assumes a simple ParProg 2020 A3 speedup scenario! Performance isolated execution of a single Metrics Ø workload Lukas Wenzel fixed workload size Ø Chart 13 By Daniels220 at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6678551
[Gustafson1988] Gustafson-Barsis’ Law Consider a scaled speedup scenario , allowing a variable workload size 𝐱 . Amdahl ~ What is the shortest execution time for a given workload? Gustafson-Barsis ~ What is the largest workload for a given execution time? Determine the scaled speedup 𝐭 𝐇𝐯𝐭𝐮𝐛𝐰𝐭𝐩𝐨 𝐎 through 𝐭 𝐇𝐯𝐭𝐮𝐛𝐠𝐭𝐩𝐨 𝐎 = 𝐐 ⋅ 𝑶 + (𝟐 − 𝑸) the increase in workload size 𝐱(𝐎) over the fixed execution time 𝐔 𝐔 𝐔 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 ParProg 2020 A3 Performance Metrics Lukas Wenzel 𝐱 𝟐 ~ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 𝐱(𝐎) ~ 𝐎 ⋅ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 Chart 14
[Karp1990] Karp-Flatt-Metric Parallel fraction 𝐐 is a hypothetical parameter and not easily deduced from a given workload. Karp-Flatt-Metric determines sequential fraction 𝐑 = 𝟐 − 𝐐 empirically Ø Measure baseline execution time 𝐔 𝟐 1. by executing workload on a single execution unit Measure parallelized execution time 𝐔(𝐎) 2. by executing workload on 𝐎 execution units 𝐔 𝟐 Determine speedup 𝐭(𝐎) = , 3. 𝐔(𝐎) Calculate Karp-Flatt-Metric 4. ParProg 2020 A3 𝐭(𝐎) − 𝟐 𝟐 Performance 𝐎 Metrics 𝐑(𝐎) = Lukas Wenzel 𝟐 − 𝟐 𝐎 Chart 15
Workloads “task-level parallelism” “data-level parallelism” ParProg20 A4 Foster’s Methodology Sven Köhler Different tasks being ■ Parallel execution of the ■ performed at the same time same task on disjoint data sets Might originate from the Chart 16 ■ same or different programs
Designing Parallel Algorithms [Foster] A) Search for concurrency and scalability ■ Partitioning □ Decompose computation and data into the smallest possible tasks Communication □ Define necessary coordination of task execution B) Search for locality and other performance-related issues ■ Agglomeration □ Consider performance and implementation costs Mapping □ ParProg20 A4 Maximize execution unit utilization, minimize communication Foster’s Methodology Sven Köhler Might require backtracking or parallel investigation of steps ■ Chart 17
Surface-To-Volume Effect [Foster, Breshears] Visualize the data to be processed (in parallel) as sliced 3D cube ParProg20 A4 Foster’s Methodology Sven Köhler Chart 18 [nicerweb.com]
B1: Shared Memory Systems (Concurrency & Synchronization)
Critical Section Shared Resource (e.g. memory regions) Mutual Exclusion demand: Only ■ one task at a time is allowed into its critical section, among all tasks that have critical sections for the same resource. Progress demand: If no other task ■ is in the critical section, the Critical decision for entering should not be Section postponed indefinitely. Only tasks that wait for entering the critical section are allowed to participate in decisions. ParProg20 B1 Bounded Waiting demand: It ■ Concurrency & must not be possible for a task Synchronization requiring access to a critical section Sven Köhler to be delayed indefinitely by other threads entering the section ( starvation problem ) Chart 20 T0 T1 T2
Recommend
More recommend