2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins
Overview • Parallel computing platforms – Approaches to building parallel computers – Today's chip-multiprocessor architectures • Approaches to parallel programming – Programming with threads and shared memory – Message-passing libraries – PGAS languages – High-level parallel languages Chip Multiprocessors (ACS MPhil) 2
Parallel computers • How might we exploit multiple processing elements and memories in order to complete a large computation quickly? – How many processing elements, how powerful? – How do they communicate and cooperate? • How are memories and processing elements interconnected? • How is the memory hierarchy organised? – How might we program such a machine? Chip Multiprocessors (ACS MPhil) 3
The control structure • How are the processing elements controlled? – Centrally from single control unit or can they work independently? • Flynn's taxonomy: • Single Instruction Multiple Data ( SIMD ) • Multiple Instruction Multiple Data ( MIMD ) Chip Multiprocessors (ACS MPhil) 4
The control structure • SIMD – The scalar pipelines execute in lockstep – Data-independent logic is shared • Efficient for highly data parallel applications • Much simpler instruction fetch and supply mechanism – SIMD hardware can support a SPMD model if A Generic Streaming Multiprocessor (for graphics applications) the individual threads follow similar control flow • Masked execution Reproduced from, " Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow ", W. W. L. Fung et al Chip Multiprocessors (ACS MPhil) 5
The communication model • A clear distinction is made between two common communication models: – 1. Shared-address-space platforms • All processors have access to a shared data space accessed via a shared address space • All communication takes place via a shared memory • Each processing element may also have an area of memory that is private Chip Multiprocessors (ACS MPhil) 6
The communication model • 2. Message-passing platforms – Each processing element has its own exclusive address space – Communication is achieved by sending explicit messages between processing elements – The sending and receiving of messages can be used to both communicate between and synchronize the actions of multiple processing elements Chip Multiprocessors (ACS MPhil) 7
Multi-core Figure courtesy of Tim Harris, MSR Chip Multiprocessors (ACS MPhil) 8
SMP multiprocessor Figure courtesy of Tim Harris, MSR Chip Multiprocessors (ACS MPhil) 9
NUMA multiprocessor Figure courtesy of Tim Harris, MSR Chip Multiprocessors (ACS MPhil) 10
Message-passing platforms • Many early message- passing machines provided hardware primitives that 101 100 were close to the send/receive user-level communication commands 001 000 – e.g. a pair of processors may be interconnected with 111 110 a hardware FIFO queue – The network topology restricted which processors could be named in a send or 011 010 receive operation ( e.g. only neighbours could communicate in a mesh network) [Culler, Figure 1.22] Chip Multiprocessors (ACS MPhil) 11
Message-passing platforms • The Transputer (1984) – The result of an earlier foray into the world of parallel computing! – Transputer contained integrated serial links for building multiprocessors • IN/OUT instructions in ISA for sending and receiving messages – Programmed in OCCAM (based on CSP) • IBM Victor V256 (1991) – 16x16 array of transputers – The processors could be partitioned dynamically between different users Chip Multiprocessors (ACS MPhil) 12
Message-passing platforms • Recently some chip- multiprocessors have taken a similar approach (RAW/Tilera and XMOS) – Message queues (or communication channels) may be register mapped or accessed via special instructions – The processor stalls when reading an empty input queue or when trying to write to a full output buffer A wireless application mapped to the RAW processor. Data is streamed from one core to another over a statically scheduled network. Network input (See also the iWarp paper on wiki) and output is register mapped. Chip Multiprocessors (ACS MPhil) 13
Message-passing platforms • For larger message-passing machines (typically scientific supercomputers) direct FIFO designs were soon replaced by designs that built message-passing upon remote memory copies (supported by DMA or a more general communication assist processor) – The interconnection networks also became more powerful, supporting the automatic routing of messages between arbitrary nodes – No restrictions on programmer or software support required • Hardware and software evolution meant there was a general convergence of parallel machine organisations Chip Multiprocessors (ACS MPhil) 14
Message-passing platforms • The most fundamental communication primitives in a message-passing machine are synchronous send and receive operations – Here data movement must be specified at both ends of the communication, this is known as two-sided communication . e.g. MPI_Send and MPI_Recv* – Non-blocking versions of send and receive are also often provided to allow computation and communication to be overlapped *Message Passing Interface (MPI) is a portable message-passing system that is supported by a very wide range of parallel machines. Chip Multiprocessors (ACS MPhil) 15
One-side communication • SHMEM – Provides routines to access the memory of a remote processing element without any assistance from the remote process, e.g: • shmem_put (target_addr, source_addr, length, remote_pe) • shmem_get, shmem_barrier etc. – One-sided communication may be used to reduce synchronization, simplify programming and reduce data movement Chip Multiprocessors (ACS MPhil) 16
The communication model • From a hardware perspective we would like to keep the machine simple (message-passing) • But we inevitably need to simplify the programmer's and compiler's task – Efficiently support shared-memory programming – Add support for transactional memory? – Create a simple but high-performance target • Trade-offs between hardware complexity and complexity of hardware and compiler. Chip Multiprocessors (ACS MPhil) 17
Today's chip multiprocessors • Intel Nehalem-EX (2009) – 8-cores • 2-way hyperthreaded (SMT) • 16 hardware threads – L1I 32KB, L1D 32KB – 256 KB L2 (Private) – 24MB L3 (Shared) • 8-banks • Inclusive L3 Chip Multiprocessors (ACS MPhil) 18
Today's chip multiprocessors Intel Nahalem-EX (2009) L1 L2 Shared L3 Memory Chip Multiprocessors (ACS MPhil) 19
Today's chip multiprocessors • IBM Power 7 (2010) – 8 core (dual-chip module to hold 16 cores) – 32MB shared eDRAM L3 cache – 2-channel DDR3 controllers – Individual cores • 4-thread SMT per core • 6 ops/cycle • 4GHz Chip Multiprocessors (ACS MPhil) 20
Today's chip multiprocessors IBM Power 7 (2010) Chip Multiprocessors (ACS MPhil) 21
Today's chip multiprocessors • Sun Niagara T1 (2005) Each core has its own level 1 cache (16KB for instructions, 8KB for data). The level 2 caches are 3MB in total and are effectively 12-way associative. They are interleaved by 64-byte cache lines. Chip Multiprocessors (ACS MPhil) 22
Oracle M7 Processor (2014) • 32 core – Dual-issue, OOO • Dynamic multithreading 1-8 threads/core • 256KB I&D L2 caches shared by groups of 4 cores • 64MB L3 • Technology: 20nm, 13 metal layers • 16 DDR channels – 160GB/s – (vs. ~20GB/s for T1) • >10B transistors! Chip Multiprocessors (ACS MPhil) 23
“Manycore” designs: Tilera • Tilera (now Mellanox) – Evolution of MIT RAW – 100-cores – grid of identical tiles – Low-power 3-way VLIW cores – Cores interconnected by a selection of static and dynamic on-chip networks Chip Multiprocessors (ACS MPhil) 24
“Manycore” designs: Celerity (2017) Tiered Accelerator Fabric General-purpose tier: 5 “Rocket” RISC-V cores Massively parallel tier: 496 5-stage RISC-V cores, 16x31 tiled mesh array Specialised tier: Binarized Neural Network accelerator Chip Multiprocessors (ACS MPhil) 25
GPUs • TESLA P100 – 56 Streaming multiprocessors x 64 cores = 3584 “cores” or lanes – 732GB/s memory bandwidth – 4MB L2 cache – 15.3 billion transistors “The NVIDIA GeForce 8800 GPU”, Hot Chips 2007 Chip Multiprocessors (ACS MPhil) 26
Communication latencies • Chip multiprocessor – Some have very fast core to core communication, as low as 1-3 cycles – Opportunities to add dedicated core-to-core links – Typical L1-to-L1 communication latencies may be around 10-100 cycles • Other types of parallel machine: – Shared memory multiprocessor ~500 – Cluster/supercomputer ~5000-10000 Chip Multiprocessors (ACS MPhil) 27
Approaches to parallel programming • “ Principles of Parallel Programming ”, Calvin Lin and Lawrence Snyder, Pearson, 2009 • This book provides a good overview of the different approaches to parallel programming • There is also a significant amount of information on the course wiki – Try some examples! Chip Multiprocessors (ACS MPhil) 28
Recommend
More recommend