overview
play

Overview Parallel computing platforms Approaches to building - PowerPoint PPT Presentation

Overview Parallel computing platforms Approaches to building parallel computers Today's chip-multiprocessor architectures 2 Introduction to parallel computing Approaches to parallel programming Programming with threads


  1. Overview • Parallel computing platforms – Approaches to building parallel computers – Today's chip-multiprocessor architectures 2 • Introduction to parallel computing • Approaches to parallel programming – Programming with threads and shared memory – Message-passing libraries – PGAS languages Chip Multiprocessors (ACS MPhil) Robert Mullins – High-level parallel languages Chip Multiprocessors (ACS MPhil) 2 Parallel computers The control structure • How might we exploit multiple processing elements • How are the processing elements controlled? and memories in order to complete a large – Centrally from single control unit or can they work computation quickly? independently? – How many processing elements, how powerful? • Flynn's taxonomy: • Single Instruction Multiple Data ( SIMD ) – How do they communicate and cooperate? • Multiple Instruction Multiple Data ( MIMD ) • How are memories and processing elements interconnected? • How is the memory hierarchy organised? – How might we program such a machine? Chip Multiprocessors (ACS MPhil) 3 Chip Multiprocessors (ACS MPhil) 4

  2. The control structure The communication model • SIMD • A clear distinction is made between two common – The scalar pipelines communication models: execute in lockstep – 1. Shared-address-space platforms – Data-independent logic is • All processors have access to a shared data space shared accessed via a shared address space • Efficient for highly data • All communication takes place via a shared memory parallel applications • Much simpler instruction • Each processing element may also have an area of fetch and supply memory that is private mechanism – SIMD hardware can support a SPMD model if A Generic Streaming Multiprocessor (for graphics applications) the individual threads follow similar control flow • Masked execution Reproduced from, " Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow ", W. W. L. Fung et al Chip Multiprocessors (ACS MPhil) 5 Chip Multiprocessors (ACS MPhil) 6 Multi-core The communication model • 2. Message-passing platforms – Each processing element has its own exclusive address space – Communication is achieved by sending explicit messages between processing elements – The sending and receiving of messages can be used to both communicate between and synchronize the actions of multiple processing elements Figure courtesy of Tim Harris, MSR Chip Multiprocessors (ACS MPhil) 7 Chip Multiprocessors (ACS MPhil) 8

  3. SMP multiprocessor NUMA multiprocessor Figure courtesy of Tim Harris, MSR Figure courtesy of Tim Harris, MSR Chip Multiprocessors (ACS MPhil) 9 Chip Multiprocessors (ACS MPhil) 10 Message-passing platforms Message-passing platforms • The Transputer (1984) • Many early message- passing machines provided – The result of an earlier foray into the world of parallel hardware primitives that 101 100 computing! were close to the – Transputer contained integrated serial links for building send/receive user-level communication commands multiprocessors 001 000 – e.g. a pair of processors • IN/OUT instructions in ISA for sending and receiving may be interconnected with messages 111 110 a hardware FIFO queue – Programmed in OCCAM (based on CSP) – The network topology • IBM Victor V256 (1991) restricted which processors could be named in a send or 011 010 – 16x16 array of transputers receive operation ( e.g. only – The processors could be partitioned dynamically neighbours could between different users communicate in a mesh network) [Culler, Figure 1.22] Chip Multiprocessors (ACS MPhil) 11 Chip Multiprocessors (ACS MPhil) 12

  4. Message-passing platforms Message-passing platforms • Recently some chip- • For larger message-passing machines (typically multiprocessors have taken scientific supercomputers) direct FIFO designs were a similar approach soon replaced by designs that built message-passing (RAW/Tilera and XMOS) upon remote memory copies (supported by DMA or a – Message queues (or more general communication assist processor) communication channels) – The interconnection networks also became more may be register mapped or powerful, supporting the automatic routing of accessed via special instructions messages between arbitrary nodes – The processor stalls when – No restrictions on programmer or software support reading an empty input required queue or when trying to write to a full output buffer • Hardware and software evolution meant there was a A wireless application mapped general convergence of parallel machine to the RAW processor. Data is streamed from one core to organisations another over a statically scheduled network. Network input (See also the iWarp paper on wiki) and output is register mapped. Chip Multiprocessors (ACS MPhil) 13 Chip Multiprocessors (ACS MPhil) 14 One-side communication Message-passing platforms • The most fundamental communication primitives in a • SHMEM message-passing machine are synchronous send – Provides routines to access the memory of a remote and receive operations processing element without any assistance from the remote process, e.g: – Here data movement must be specified at both ends of the communication, this is known as two-sided • shmem_put (target_addr, source_addr, communication . e.g. MPI_Send and MPI_Recv* length, remote_pe) – Non-blocking versions of send and receive are also • shmem_get, shmem_barrier etc. often provided to allow computation and – One-sided communication may be used to reduce communication to be overlapped synchronization, simplify programming and reduce data movement *Message Passing Interface (MPI) is a portable message-passing system that is supported by a very wide range of parallel machines. Chip Multiprocessors (ACS MPhil) 15 Chip Multiprocessors (ACS MPhil) 16

  5. The communication model Today's chip multiprocessors • From a hardware perspective we would like to keep • Intel Nehalem-EX the machine simple (message-passing) (2009) • But we inevitably need to simplify the programmer's – 8-cores and compiler's task • 2-way hyperthreaded (SMT) – Efficiently support shared-memory programming • 16 hardware threads – Add support for transactional memory? – L1I 32KB, L1D 32KB – Create a simple but high-performance target – 256 KB L2 (Private) • Trade-offs between hardware complexity and – 24MB L3 (Shared) complexity of hardware and compiler. • 8-banks • Inclusive L3 Chip Multiprocessors (ACS MPhil) 17 Chip Multiprocessors (ACS MPhil) 18 Today's chip multiprocessors Today's chip multiprocessors • IBM Power 7 (2010) Intel Nahalem-EX (2009) – 8 core (dual-chip module to hold 16 cores) – 32MB shared eDRAM L3 cache – 2-channel DDR3 controllers L1 – Individual cores • 4-thread SMT per core L2 • 6 ops/cycle • 4GHz Shared L3 Memory Chip Multiprocessors (ACS MPhil) 19 Chip Multiprocessors (ACS MPhil) 20

  6. Today's chip multiprocessors Today's chip multiprocessors • Sun Niagara T1 (2005) IBM Power 7 (2010) Each core has its own level 1 cache (16KB for instructions, 8KB for data). The level 2 caches are 3MB in total and are effectively 12-way associative. They are interleaved by 64-byte cache lines. Chip Multiprocessors (ACS MPhil) 21 Chip Multiprocessors (ACS MPhil) 22 Oracle M7 Processor (2014) “Manycore” designs: Tilera • Tilera (now Mellanox) • 32 core – Evolution of MIT RAW – Dual-issue, OOO – 100-cores • Dynamic multithreading 1-8 threads/core – grid of identical tiles • 256KB I&D L2 caches – Low-power 3-way VLIW shared by groups of 4 cores cores • 64MB L3 – Cores interconnected • Technology: 20nm, 13 metal by a selection of static layers and dynamic on-chip • 16 DDR channels networks – 160GB/s – (vs. ~20GB/s for T1) • >10B transistors! Chip Multiprocessors (ACS MPhil) 23 Chip Multiprocessors (ACS MPhil) 24

  7. “Manycore” designs: Celerity (2017) GPUs • TESLA P100 – 56 Streaming multiprocessors x 64 cores = 3584 “cores” or lanes – 732GB/s memory bandwidth – 4MB L2 cache – 15.3 billion transistors “The NVIDIA GeForce 8800 GPU”, Hot Chips 2007 Tiered Accelerator Fabric General-purpose tier: 5 “Rocket” RISC-V cores Massively parallel tier: 496 5-stage RISC-V cores, 16x31 tiled mesh array Specialised tier: Binarized Neural Network accelerator Chip Multiprocessors (ACS MPhil) 25 Chip Multiprocessors (ACS MPhil) 26 Communication latencies Approaches to parallel programming • Chip multiprocessor • “ Principles of Parallel Programming ”, Calvin – Some have very fast core to core communication, as Lin and Lawrence low as 1-3 cycles Snyder, Pearson, 2009 – Opportunities to add dedicated core-to-core links • This book provides a – Typical L1-to-L1 communication latencies may be good overview of the around 10-100 cycles different approaches to • Other types of parallel machine: parallel programming – Shared memory multiprocessor ~500 • There is also a – Cluster/supercomputer ~5000-10000 significant amount of information on the course wiki – Try some examples! Chip Multiprocessors (ACS MPhil) 27 Chip Multiprocessors (ACS MPhil) 28

Recommend


More recommend