Parallel programming 01 Walter Boscheri walter.boscheri@unife.it University of Ferrara - Department of Mathematics and Computer Science A.Y. 2018/2019 - Semester I
Outline Introduction and motivation 1 Parallel architectures 2 Computational cost 3 Memory 4 Arrays 5
1. Introduction and motivation Moore’s law G. Moore (1985) - CEO Intel In 1985 Moore asserted that the number of transistors in a dense integrated circuit would have doubled approximately every two years . This means that the computation speed doubles about every two years as well. Intel Corporation (2019) Transistors have become very small. Intel’s new 10 nm (molecular scale) architecture is expected to be released in 2019, marking over a three-year cadence since the previous architecture node. Walter Boscheri Parallel programming 01 2 / 14
1. Introduction and motivation Moore’s law Walter Boscheri Parallel programming 01 2 / 14
1. Introduction and motivation Parallel computing Parallel computing Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. The code can be executed on many microprocessors or cores of the same processor in order to improve the computational efficiency of the simulation. High Performance Computing (HPC) High-performance computing (HPC) is the use of super computers and par- allel processing techniques for solving complex computational problems in science, engineering, or business. HPC technology focuses on developing parallel processing algorithms and systems by incorporating both theoretical and parallel computational techniques. Walter Boscheri Parallel programming 01 3 / 14
2. Parallel architectures Von Neumann CPU architecture (1945) Central Processing Unit (CPU) split into Artihmetic Logic Unit (ALU) and Control Unit. The accumulator (register) in the ALU connects input and output Random Access Memory (RAM) Input unit which enters data into the computer Output unit to return the elaborated data on the mass storage Walter Boscheri Parallel programming 01 4 / 14
2. Parallel architectures Bus CPU architecture The bus is a channel which connects all components to each other. The single bus can only access one of the two classes of memory at a time, hence slowing the rate at which the CPU can work. This seriously limits the effective processing speed when the CPU is required to perform minimal processing on large amounts of data. Walter Boscheri Parallel programming 01 5 / 14
2. Parallel architectures Bus CPU architecture Dual Independent Bus (DIB) Architecture (Intel and AMD) Two (dual) independent data I/O buses enables the processor to access data from either of its buses simultaneously and in parallel, rather than in a singular sequential manner (as in a single-bus system). The second or backside bus in a processor with DIB is used for the L2 cache, allowing it to run at much greater speeds than if it were to share the main processor bus. Walter Boscheri Parallel programming 01 5 / 14
2. Parallel architectures Parallel architecture: memory structure SHARED There is a global memory space which is accessible by all processors. Processors might also have some local memory. Algorithms may use global data structures efficiently. Example: multicore CPU, GPU DISTRIBUTED All memory is associated with processors. To retrieve information from another processors’ memory a message must be sent. Algorithms should use distributed data structures. Example: network of computers, supercomputers Walter Boscheri Parallel programming 01 6 / 14
2. Parallel architectures Parallel architecture: memory structure Shared memory architectures (multicore CPU): Intel 9 th generation 2019: up to 8 cores, 16 threads, 5 GHz and 16 MB cache AMD Ryzen 2019: up to 12 cores, 24 threads, 4.6 GHz and 64 MB cache Distributed memory architectures: MARCONI A3 supercomputer (Bologna - Italy) Nodes: 1512, RAM: 196 GB/node Processors: 2 x 24-cores Intel Xeon 8160 (SkyLake), 2.10 GHz Peak performance: 8 PFLOP/s SUPERMUC-NG supercomputer (Munich - Germany) Nodes: 6336, RAM: 96 GB/node Processors: 2 x 24-cores Intel Xeon Platinum 8174 (SkyLake), 3.90 GHz Peak performance: 26.3 PFLOP/s Walter Boscheri Parallel programming 01 6 / 14
3. Computational cost Cost estimation of an algorithm The cost of an algorithm is measured with the number of operations needed to execute the code. FLOP FLOP = FLoating point OPeration 1 FLOP is equivalent to the cost associated to a summation or a multipli- cation between floating points (real numbers). = 10 9 1 GFLOP FLOPS = 10 12 1 TFLOP FLOPS = 10 15 1 PFLOP FLOPS = 10 18 1 EFLOP FLOPS Walter Boscheri Parallel programming 01 7 / 14
3. Computational cost Example Dot product Given two vectors ( a , b ) ∈ R N , the dot product c is evaluated as N � c = a ( i ) · b ( i ) i =1 Each iterations i counts a total number of 2 floating point operations, namely a sum and a product. Thus, the computational cost of the dot product is 2 N . Exercise Work out the computational cost for i) a matrix-vector and ii) a matrix- matrix multiplication with size [ N × N ]. Write a FORTRAN code for comparing the theoretical results against numerical evidences. Walter Boscheri Parallel programming 01 8 / 14
3. Computational cost Computational speed of a system The computational speed of a system is measured in terms of the number of floating point operations that can be executed in one second, thus FLOP/s. = 10 9 1 GFLOP/s FLOPS/s = 10 12 1 TFLOP/s FLOPS/s = 10 15 1 PFLOP/s FLOPS/s = 10 18 1 EFLOP/s FLOPS/s Examples: SUPERMUC-NG (Munich): 26.3 PFLOP/s MARCONI A3 (Bologna): 8 PFLOP/s HLRS (Stuttgart): 7.4 PFLOP/s List of TOP 500 supercomputers Walter Boscheri Parallel programming 01 9 / 14
3. Computational cost Computational speed of a processor Intel Core i7-8750H (8 th gen) Cores: 6, Threads: 12 Clock speed: 2.20 GHz FLOP/s: 4 Speed: 6 (cores) × 4 (FLOP/s) × 2.20 · 10 9 (clock) = 52.8 GFLOP/s Intel Skylake i7-9800X (9 th gen) Cores: 8, Threads: 16 Clock speed: 4.40 GHz FLOP/s: 16 Speed: 8 (cores) × 16 (FLOP/s) × 4.40 · 10 9 (clock) = 563.2 GFLOP/s Walter Boscheri Parallel programming 01 10 / 14
4. Memory Memory bandwidth Memory bandwidth is the rate at which data can be transferred between memory and processor. It is typically measured in GB/s. The theoretical memory bandwidth is computed as the product of: base RAM clock frequency number of data transfers per clock: double data rate (DDR) runs two bits per clock cycle memory bus width: each DDR memory interface is 64 bits wide (that is called line ) number of interfaces: typically modern computers use two memory interfaces, thus they are referred to as dual-channel mode with 128-bit bus width. Example:dual-channel memory with DDR4-3200 (1600 MHz) 1600 · 10 6 clock × 2 line clock × 64 bit line × 2 interfaces = 409 . 6 · 10 9 bit s = 51 . 2 GB/s s Walter Boscheri Parallel programming 01 11 / 14
4. Memory Memory bandwidth Example: floating point operation c = a · b 3 memory accesses: a , b , and c . 24 bytes must be transferred ( a , b assumed to be double precision floats) ⇓ The processor can transfer data at the following rates: 24 bytes × 52.8 GFLOP/s = 1267.2 GB/s (Intel Core i7-8750H) 24 bytes × 563.2 GFLOP/s = 13516.8 GB/s (Intel Skylake i7-9800X) As the speed gap between CPU and memory widens, memory hier- archy has become the primary factor limiting program performance. Walter Boscheri Parallel programming 01 11 / 14
4. Memory Memory hierarchy register (on chip): up to 64-bit ( vector processors ) cache memory: L1 on chip and L2 Random access memory (RAM) Mass storage Memory access time It is the time interval between the instant at which an instruction control unit initiates a call for data or a request to store data, and the instant at which delivery of the data is completed or the storage is started. It is how long it takes for a character in memory to be located. Walter Boscheri Parallel programming 01 12 / 14
4. Memory Memory hierarchy Cache is a fast access memory that allows data to be temporarily located close to the CPU. Thus, memory access time is reduced. There are two types of locality : temporal locality: data can be accessed more than once in time spatial locality: when a specific data is accessed and copied to the cache, it is very likely that spatially closed data are accessed as well. All data loaded in the cache are stored as long as possible. Walter Boscheri Parallel programming 01 12 / 14
5. Arrays FORTRAN: array allocation Let A ∈ R M × N be a matrix (array of rank 2). In FORTRAN arrays of rank greater than 1 are stored by columns. Thus, the fastest index is the left one. Matrix A of components a ij is therefore stored in the following order: col 1: a 11 , a 21 , a M 1 , . . . col 2: a 12 , a 22 , a M 2 , . . . . . . col j : a 1 j , a 2 j , a Mj . . . . . . . col N : a 1 N , a 2 N , a MN . . . . Elements of the same column are closed in the memory, thus the fastest way to access array data is to loop column-by-column. The cache is then fully exploited. Walter Boscheri Parallel programming 01 13 / 14
Recommend
More recommend