Parallel Computer Architecture Lars Karlsson Ume˚ a University 2009-12-07 Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 1 / 52
Topics Covered Multicore processors Short vector instructions (SIMD) Advanced instruction level parallelism Cache coherence Hardware multithreading Sample multicore processors Introduction to parallel programming Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 2 / 52
Part I Introduction Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 3 / 52
Moore’s Law Moore’s law predicts an exponential growth in the number of transistors per chip Observed exponential growth over the last couple of decades Appears to continue at least another decade Enables the construction of faster processors Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 4 / 52
Turning Transistors into Performance The old approach Speed up a single instruction stream: Increase clock frequency Pipeline the execution of instructions Predict branches to reduce overhead of pipeline stalls Issue several instructions per clock Schedule instructions out-of-order Use short vector instructions (SIMD) Hide memory latency with a multilevel cache hierarchy Conclusion: relies on Instruction Level Parallelism (ILP) Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 5 / 52
Limits of the Old Approach The Power Wall ◮ Power consumption depends linearly on the clock frequency ◮ Power leads to heat ◮ Power is expensive ◮ Frequency around 2–3 GHz since 2001 ◮ Prior to 2001: exponential growth over several decades The ILP Wall ◮ Already, few applications utilize all functional units ◮ Sublinear return on invested resources (transistors/power) ◮ Diminishing returns Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 6 / 52
Turning Transistors into Performance The new approach: multicore architectures Several cores on one die – increases peak performance Reduce the clock frequency – saves power Use simpler core design – frees transistors Which of the following choices lead to the highest performance? All cores identical: homogeneous multicore Different types of cores: heterogeneous multicore Clearly, heterogeneous multicores are potentially harder to program. Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 7 / 52
Heterogeneous Multicores A Simple Model for Building Heterogeneous Multicores Consider the following core designs: Small: 1 unit of area, 1 unit of performance Medium: 4 units of area, 2 units of performance Large: 16 units of area, 4 units of performance Suppose we have 16 units of die area. Consider these processors: Large: 1 large core ◮ 4 units of sequential performance ◮ 4 units of parallel performance Medium/Homo: 4 medium cores ◮ 2 units of sequential performance ◮ 8 units of parallel performance Small/Homo: 16 small cores ◮ 1 unit of sequential performance ◮ 16 units of parallel performance Hetero: 1 medium and 12 small cores ◮ 2 units of sequential performance ◮ 14 units of parallel performance Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 8 / 52
Heterogeneous Multicores Evaluating Design Choices Partition an algorithm’s execution time Serial fraction f ∈ [0 , 1]: no parallel speedup possible ◮ f ≈ 1: sequential algorithm (very rare) ◮ f ≈ 0: perfectly parallel algorithm (quite common) Parallel fraction (1 − f ): perfect parallel speedup Performance as a function of f : 16 Large Medium/Homo 14 Small/Homo Hetero 12 10 Performance 8 6 4 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 f Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 9 / 52
Memory System Machine characteristics Peak computational performance Memory bandwidth Memory latency The first two impose hardware limits on performance Compute-bound, e.g., most of dense linear algebra Memory-bound, e.g., most of sparse linear algebra Latency-bound, e.g., finite state machines Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 10 / 52
Memory System Compute-Bound vs Memory-Bound Sample difference in performance between a compute-bound and a memory-bound algorithm on Akka @ HPC2N 80 Compute−bound Memory−bound 70 60 50 Gflop/s 43x 40 4x 30 20 10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Matrix Size Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 11 / 52
Obtaining Peak Floating Point Performance To obtain peak performance, an algorithm must Have a high arithmetic intensity Exploit the ISA effectively Parallelize over all cores Exploiting the ISA effectively means Balancing the number of multiplies with adds ◮ Fused multiply and add (FMA) ◮ Adder and multiplier in parallel Using SIMD instructions Having enough instruction level parallelism (ILP) Having a predictable control flow Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 12 / 52
Part II SISD / MIMD / SIMD Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 13 / 52
Flynn’s Taxonomy Flynn’s taxonomy classifies parallel computers based on Number of instruction streams Number of data streams Instr. / Data Single Multi Single SISD SIMD Multi MIMD MISD SISD: Uniprocessor MIMD: Multicores/clusters SIMD: Vector processors/instructions MISD: ??? Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 14 / 52
Single Instruction Multiple Data (SIMD) Several ALUs operating synchronously in parallel ◮ Same instruction stream ◮ Different data streams Several variants ◮ SIMD/Vector instructions ◮ Different control flows Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 15 / 52
SIMD Programming Issue Logic SSE example (Intel C intrinsics): ALU ALU ALU ALU __m128 vecA, vecB, vecC; vecC = _mm_add_ps(vecA, vecB); 4−Vector Addition a 1 2 3 4 Vector data types + Vector operations 5 4 3 2 b = 6 6 6 6 c Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 16 / 52
MIMD: Shared vs Distributed Address Space A key issue in MIMD design is whether to support a shared address space programming model (abbreviated shared memory) in hardware or not (distributed memory). Distributed memory ◮ Each process has its own address space ◮ Explicit communication (message passing) Shared memory ◮ Each process shares a global address space ◮ Implicit communication (reads/writes + synchronization) Supporting shared memory in hardware leads to various issues: What if two threads access the same memory location? How to manage multiple cached copies of a memory location? How to synchronize the threads? Supporting distributed memory is much simpler. Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 17 / 52
MIMD: Synchronization Thread cooperation requires that some threads write data that other threads read. To avoid corrupted results, it is necessary to synchronize the threads to avoid data races. Definition (Data Race) A data race refers to two memory accesses by different threads in which at least one is a write and the two accesses occur one after another. With data races present, the output depends on the execution order. Without any data races, the program is correctly synchronized. Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 18 / 52
MIMD: Synchronization Hardware Support Atomic read/write instructions are not strong enough ◮ Synchronization primitives too expensive to implement ◮ The cost grows with the number of processors Atomic read-modify-write required ◮ Atomic exchange ◮ Fetch-and-increment ◮ Test-and-set ◮ Compare-and-swap ◮ Load linked – store conditional Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 19 / 52
MIMD: Synchronization Implementing a Lock with Atomic Exchange Represent the state of a lock (locked/free) by an integer ◮ 0: free ◮ 1: locked Locking: ◮ Atomically exchange the lock variable with 1 ◮ (i) returns 0: lock was free and is now locked – OK! ◮ (ii) returns 1: lock was locked and is still locked – Retry! Precisely one thread will succeed since the operations are ordered by the hardware. Unlocking: ◮ Overwrite the lock variable with 0 Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 20 / 52
Compiling for SIMD/Shared Memory/Distributed Memory Compiling for SIMD instructions ◮ Alignment ◮ Data structures Compiling for shared memory ◮ Loop-level parallelism ◮ Best strategy depends on usage pattern ◮ Speculative multithreading Compiling for distributed memory ◮ Data distribution ◮ Communication Summary: very difficult to compile for parallel architectures. Programmers are responsible for almost all parallelizations. Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 21 / 52
Part III Advanced Instruction Level Parallelism Lars Karlsson (Ume˚ a University) Parallel Computer Architecture 2009-12-07 22 / 52
Recommend
More recommend