parallel software design asd shared memory hpc workshop
play

Parallel Software Design ASD Shared Memory HPC Workshop Computer - PowerPoint PPT Presentation

Parallel Software Design ASD Shared Memory HPC Workshop Computer Systems Group, ANU Research School of Computer Science Australian National University Canberra, Australia February 14, 2020 Schedule - Day 5 Computer Systems (ANU) Parallel


  1. Software Patterns Finding Concurrency Dependency Analysis - Data Sharing Given a clear ordering between task groups, identify and manage access to shared data Overly coarse-grained synchronization can lead to poor scaling, e.g. using barriers between phases of a computation Shared data can be categorized as follows: Read-only : No access protection required on shared memory systems. Usually replicated on distributed memory systems. Effectively-local : Data partitioned into task-local chunks which need not have access protections Read-write : Data accessed by multiple task groups and must be protected by using exclusive-access methods such as locks, semaphores, monitors etc. Two special cases of read-write data are: Accumulate : A reduction operation that results in multiple tasks updating each shared data item and accumulating a result. For example, global sum, maximum or minimum Multiple-read/single-write : Data read by multiple tasks to obtain initial values, but eventually a single task modifies the data Computer Systems (ANU) Parallel Software Design Feb 14, 2020 17 / 141

  2. Software Patterns Finding Concurrency Design Evaluation Is the decomposition and dependency analysis good enough to move on to the next design space? Software design is an iterative process and is rarely perfected in a single iteration Evaluate the suitability of the design to the intended target platform Does the design need to cater to other parallel systems? Computer Systems (ANU) Parallel Software Design Feb 14, 2020 18 / 141

  3. Software Patterns Finding Concurrency Hands-on Exercise: Finding Concurrency Objective: To identify tasks and the different possible decompositions and perform dependency analysis Computer Systems (ANU) Parallel Software Design Feb 14, 2020 19 / 141

  4. Algorithmic Structure Patterns Outline 3 Program and Data Structure Patterns 1 Software Patterns 4 Systems on chip: Introduction 2 Algorithmic Structure Patterns Task Parallelism 5 System-on-chip Processors Divide & Conquer Geometric Decomposition Recursive Data Pipeline 6 Emerging Paradigms and Challenges in Parallel Computing Event-Based Coordination Computer Systems (ANU) Parallel Software Design Feb 14, 2020 20 / 141

  5. Algorithmic Structure Patterns Algorithm Structure Patterns Computer Systems (ANU) Parallel Software Design Feb 14, 2020 21 / 141

  6. Algorithmic Structure Patterns Algorithmic Structure – Objectives Develop an algorithmic structure for exploiting the concurrency identified in the previous space How can the concurrency be mapped to multiple units of execution? Many parallel algorithms exist but most adhere to six basic patterns as described in this space At this stage, consider: Target platform characteristics such as number of processing elements and how they communicate Avoid the tendency to over-constrain the design by making it too specific for the target platform! What is the major organizing principle implied by exposed concurrency, i.e. is there a particular way of looking at it that stands out? For example, tasks, data, or flow of data Computer Systems (ANU) Parallel Software Design Feb 14, 2020 22 / 141

  7. Algorithmic Structure Patterns Organize by Tasks For when the execution of the tasks themselves is the best organizing principle If the task groups form a linear set or can be spawned linearly, use the Task Parallelism pattern. This includes both embarrassingly parallel problems and situations where task groups have some dependencies, share data, and/or require communication If task groups are recursive, use the Divide and Conquer pattern. The problem is recursively broken into sub-problems and each of them are solved independently, then their solutions are recombined to form the complete solution Computer Systems (ANU) Parallel Software Design Feb 14, 2020 23 / 141

  8. Algorithmic Structure Patterns Task Parallelism The Task Parallelism Pattern Three key aspects: Task Definition : The functionality that constitutes a task or a group of tasks. There must be enough tasks to ensure a proper load balance Inter-task Dependencies : How different task groups interact and what the computation/communication costs or overheads of these interactions are Task Schedule : How task groups are assigned to different units of execution Computer Systems (ANU) Parallel Software Design Feb 14, 2020 24 / 141

  9. Algorithmic Structure Patterns Task Parallelism The Task Parallelism Pattern – Dependencies Dependencies between tasks can be further categorized as follows: Ordering Constraints : The program order in which task groups must execute Shared Data Dependencies : These can be further categorized into: Removable Dependencies : Those that can be resolved by simple code transformations Separable Dependencies : When accumulation into a shared data structure is required. Often the data structure can be replicated for each task group which then accumulates a local result. These local results are then combined to produce the final result once all task groups have finished. This is also known as a reduction operation Computer Systems (ANU) Parallel Software Design Feb 14, 2020 25 / 141

  10. Algorithmic Structure Patterns Task Parallelism The Task Parallelism Pattern – Scheduling The manner in which task groups are assigned to units of execution (UE) in order to ensure a good computational load balance. Two primary categories of schedules are: Static Schedule : the task group distribution among UEs is determined before program execution starts. For example, round-robin assignment of similar-sized task groups to UEs Dynamic Schedule : the task group distribution varies during program execution and is non-deterministic. This form of scheduling is used when: Task group sizes vary widely and/or are unpredictable The capabilities of UEs are either unknown or vary unpredictably Common approaches to implement dynamic scheduling are: Global Task Queue : There exists a global queue containing all task groups. Each UE removes task groups from the global queue when free and executes them Work Stealing : Each UE has its own local task queue which is populated before program execution starts. When a UE is finished with tasks in its local queue, it attempts to steal tasks from the other UEs’ local queues Computer Systems (ANU) Parallel Software Design Feb 14, 2020 26 / 141

  11. Algorithmic Structure Patterns Divide & Conquer The Divide And Conquer Pattern – Sequential The sequential divide-and-conquer strategy solves a problem in the following manner: 1 func solve returns Solution; // a solution stage func baseCase returns Boolean; // direct solution test 3 func baseSolve returns Solution; // direct solution func merge returns Solution; // combine sub -solutions 5 func split returns Problem []; // split into subprobs 7 Solution solve(Problem P) { if (baseCase(P)) 9 return baseSolve(P); else { 11 Problem subProblems [N]; Solution subSolution [N]; 13 subProblems = split(P); for (int i=0; i<N; i++) 15 subSolutions [i] = solve( subProblems [i]); return merge( subSolutions ); 17 } } The recursive solve() function doubles the concurrency at each stage if it is not a baseCase() The baseSolve() function should only be called if the overhead of further splits/merges worsens performance or when the size of the problem is optimal for the target node, for e.g. when the data size fits into cache Computer Systems (ANU) Parallel Software Design Feb 14, 2020 27 / 141

  12. Algorithmic Structure Patterns Divide & Conquer The Divide And Conquer Pattern – Sequential Computer Systems (ANU) Parallel Software Design Feb 14, 2020 28 / 141

  13. Algorithmic Structure Patterns Divide & Conquer The Divide and Conquer Pattern – Parallel The concurrency is obvious as the sub-problems can be solved independently One task for each invocation of the solve() function is mapped to a single UE Each task in effect dynamically generates and absorbs a task for each sub-problem For efficiency, a sequential solution may be used as soon as the size of a task goes below a particular threshold, or once all processing elements have enough work When the task sizes are irregular, it is helpful to use a Global Task Queue to maintain a healthy load balance Computer Systems (ANU) Parallel Software Design Feb 14, 2020 29 / 141

  14. Algorithmic Structure Patterns Divide & Conquer The Divide and Conquer Pattern – Parallel Computer Systems (ANU) Parallel Software Design Feb 14, 2020 30 / 141

  15. Algorithmic Structure Patterns Divide & Conquer Organize by Data Decomposition For when decomposition of data forms the major organizing principle in understanding concurrency of the problem If data can be broken into discrete subsets and operated on independently by task groups, then use the Geometric Decomposition . Solutions for a subset may require data from a small number of other subsets for e.g. to satisfy boundary conditions in grid-based problems If the problem requires use of a recursive data structure such as a binary tree, use the Recursive Data pattern Computer Systems (ANU) Parallel Software Design Feb 14, 2020 31 / 141

  16. Algorithmic Structure Patterns Geometric Decomposition The Geometric Decomposition Pattern An expression of coarse-grained data parallelism Applicable for linear data structures such as arrays The key aspects of this pattern are: Data Decomposition : Decomposition of the data structure to substructures or chunks in a manner analogous to dividing a geometric region into sub-regions Update : Each chunk has an associated update task that computes a local result Exchange : Each task has the required data it needs, possibly from neighbouring chunks, to perform the update Task Schedule : Mapping the task groups and data chunks to UEs Computer Systems (ANU) Parallel Software Design Feb 14, 2020 32 / 141

  17. Algorithmic Structure Patterns Geometric Decomposition The Geometric Decomposition Pattern Important points to note: Chunk Granularity : The granularity or size of data chunks directly impacts the efficiency of the program Small number of large chunks ⇒ Smaller number of large message exchanges ⇒ Reduced communication overhead, Increased load balancing difficulty Large number of small chunks ⇒ Larger number of small message exchanges ⇒ Increased communication overhead, Decreased load balancing difficulty It is important to parameterize the granularity so that it can be fine tuned either at compile time or runtime Chunk Shape : Often data to be exchanged between tasks lie at the boundaries of their respective data chunks. This implies that minimizing the surface area of the chunks should reduce the amount of data that must be exchanged Ghost Boundaries : In order to reduce communication during execution, the boundary data required from other chunks can be replicated. A ghost boundary refers to duplicates of data at boundaries of neighbouring chunks Computer Systems (ANU) Parallel Software Design Feb 14, 2020 33 / 141

  18. Algorithmic Structure Patterns Recursive Data The Recursive Data Pattern Applies to problems involving a recursive data structure that appear to require sequential processing to update all of its elements The key aspects of this pattern are: Data Decomposition : The recursive data structure is completely decomposed into individual elements and each element or group of elements is assigned to a task group running on a separate UE, which is responsible for updating its partial result Structure : The top-level operation is a sequential loop. Each iteration is a parallel update on all elements to produce a partial result. The loop ends when a result convergence condition is met. Synchronization : A partial result calculation might require combining results from neighbouring elements, leading to a requirement for communication between UEs at each loop iteration There are distinct similarities to the Divide and Conquer pattern, however this pattern does not require recursive spawning of tasks and has a static task schedule to start with Computer Systems (ANU) Parallel Software Design Feb 14, 2020 34 / 141

  19. Algorithmic Structure Patterns Recursive Data Organize by Flow of Data When the flow of data imposes an ordering on the execution of task groups and represents the major organizing principle If the flow of data is regular (static) and does not change during program execution, the task groups can be structured into a Pipeline pattern through which the data flows If the data flows in an irregular, dynamic or unpredictable manner, use the Event-Based Coordination pattern where the task groups may interact through asynchronous events Computer Systems (ANU) Parallel Software Design Feb 14, 2020 35 / 141

  20. Algorithmic Structure Patterns Pipeline The Pipeline Pattern The overall computation involves performing a calculation on many sets of data The calculation can be viewed as data flowing through a pre-determined sequence of stages similar to a factory assembly line Each stage in the pipeline computes the i-th step of the computation Each stage may be assigned to a different task and data elements may be passed from one task to another as operations are completed Notice that some resources are idle initially when the pipeline is being filled and again during the end of the computation when the pipeline is drained Computer Systems (ANU) Parallel Software Design Feb 14, 2020 36 / 141

  21. Algorithmic Structure Patterns Event-Based Coordination The Event-Based Coordination Pattern This pattern is used when the problem can be decomposed into semi-independent tasks interacting in an irregular fashion and the interactions are determined by the flow of data between these tasks No restriction to a linear structure like for the pipeline pattern Express data flow using abstractions called events Each event must have a task that generates it and a task that processes it The computation within each task can be defined as follows: initialize while (not done) { receive event process event send event } finalize Computer Systems (ANU) Parallel Software Design Feb 14, 2020 37 / 141

  22. Algorithmic Structure Patterns Event-Based Coordination Hands-on Exercise: Algorithm Structure Patterns Objective: To identify an algorithm structure for a parallel stencil-based computation and complete an implementation of it Computer Systems (ANU) Parallel Software Design Feb 14, 2020 38 / 141

  23. Program and Data Structure Patterns Outline Shared data Shared queue Distributed Array 1 Software Patterns 4 Systems on chip: Introduction 2 Algorithmic Structure Patterns 5 System-on-chip Processors 3 Program and Data Structure Patterns SPMD Master-worker Loop parallelism 6 Emerging Paradigms and Challenges in Parallel Computing Fork-join Computer Systems (ANU) Parallel Software Design Feb 14, 2020 39 / 141

  24. Program and Data Structure Patterns Program and Data Structure Patterns Computer Systems (ANU) Parallel Software Design Feb 14, 2020 40 / 141

  25. Program and Data Structure Patterns Objective – Choosing the Right Pattern Program structures represent an intermediate stage between an algorithmic structure and the implemented source code They describe software constructions or structures that support the expression of parallel algorithms Choosing a program structure pattern is usually straightforward The outcomes of the algorithmic structure design space analysis should point towards a suitable program structure pattern In the table below, the number of ✓ ’s indicate the likelihood of suitability of a program structure pattern for a particular algorithmic structure Task Divide and Geometric Recursive Pipeline Event Based Parallelism Conquer Decomposition Data Coordination SPMD ✓✓✓✓ ✓✓✓ ✓✓✓✓ ✓✓ ✓✓✓ ✓✓ Loop Parallelism ✓✓✓✓ ✓✓ ✓✓✓ ✗ ✗ ✗ Master/Worker ✓✓✓✓ ✓✓ ✓ ✓ ✓ ✓ Fork/Join ✓✓ ✓✓✓✓ ✓✓ ✗ ✓✓✓✓ ✓✓✓✓ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 41 / 141

  26. Program and Data Structure Patterns SPMD Program Structures: SPMD Pattern Single Program Multiple Data All UEs run the same program in parallel, but each has its own set of data Different UEs can follow different paths through the program A unique ID is associated with each UE which determines its course through the program and its share of global data that it needs to process Computer Systems (ANU) Parallel Software Design Feb 14, 2020 42 / 141

  27. Program and Data Structure Patterns SPMD Program Structures: SPMD Pattern The following program structure is implied by the SPMD Pattern: 1 Initialize : Load the program on a UE, and perform book-keeping. Establish communications with other UEs 2 Obtain a Unique Identifier : Computation may proceed differently on different UEs, conditional on ID 3 Execution on UE : Start the computation and have different UEs take different paths through the source code using: Branching statements to give specific blocks of code to different UEs Using the UE identifier in loop index calculations to split loop iterations among the UEs 4 Distribute Data : Global data is decomposed into chunks and stored in UE local memory based on the UE’s unique identifier 5 Finalize : Recombine the local results into a global data structure and perform cleanup and book-keeping Computer Systems (ANU) Parallel Software Design Feb 14, 2020 43 / 141

  28. Program and Data Structure Patterns Master-worker Program Structures: Master-Worker Pattern A master process or thread sets up a pool of worker processes or threads and a bag of tasks The workers execute concurrently, with each worker repeatedly removing a task from the bag of tasks and processing it, until all tasks have been processed or some other termination condition is reached Some implementations may have more than one master or no explicit master Computer Systems (ANU) Parallel Software Design Feb 14, 2020 44 / 141

  29. Program and Data Structure Patterns Master-worker Program Structures: Master Worker Pattern Computer Systems (ANU) Parallel Software Design Feb 14, 2020 45 / 141

  30. Program and Data Structure Patterns Loop parallelism Program Structures: Loop Parallelism Pattern This pattern addresses the problem of transforming a serial program whose runtime is dominated by a set of compute-intensive loops The concurrent tasks are identified as iterations of parallelized loops This pattern applies particularly to problems that already have a mature sequential code base and cannot be invested in for major restructuring to parallelize for high performance When the existing code is available, the goal becomes incremental evolution of the sequential code to its final parallel form, one loop at a time Ideally, most of the changes required are localized around loop transformations to remove any loop-carried dependencies The OpenMP programming API was primarily created to support loop parallelism on shared-memory computers Computer Systems (ANU) Parallel Software Design Feb 14, 2020 46 / 141

  31. Program and Data Structure Patterns Loop parallelism Program Structures: Loop Parallelism Pattern Steps to undertake to apply this pattern to a sequential problem are: Identify Bottlenecks : Locate the most computationally intensive loops either by code inspection, understanding the problem, or by using performance analysis tools Eliminate Loop Carried Dependencies : Loop iterations must be independent in order to be parallelized Parallelize the loops : Distribute iterations among the UEs Optimize the loop schedule : Distribution among UEs must be evenly balanced Two commonly used loop transformations are: Merge Loops : If a problem consists of a sequence of loops that have consistent loop limits, the loops can often be merged into a single loop with more complicated iterations Coalesce Nested Loops : Nested loops can often be combined into a single loop with a larger combined iteration count, which might offset the overhead of nesting and create more tasks per UE, i.e. achieve better load balance Computer Systems (ANU) Parallel Software Design Feb 14, 2020 47 / 141

  32. Program and Data Structure Patterns Fork-join Program Structures: Fork/Join Pattern This pattern applies to problems where the number of concurrent tasks may vary during program execution The tasks are spawned dynamically or forked and later terminated or joined with the forking task A main UE forks off some number of other UEs that then continue in parallel to accomplish some portion of the overall work The tasks map onto UEs in different ways such as: A simple direct mapping where there is one task per UE An indirect mapping where a pool of UEs work on sets of tasks. If the problem consists of multiple fork-join sequences (which are expensive), then it is more efficient to first create a pool of UEs to match the number of processing elements. Then use a global task queue to map tasks to UEs as they are created Computer Systems (ANU) Parallel Software Design Feb 14, 2020 48 / 141

  33. Program and Data Structure Patterns Shared data Data Structures: Shared Data Pattern This pattern addresses the problem of handling data that is shared by more than one UE Typical problems that this pattern applies to are: At least one data structure is accessed by multiple tasks in the course of the program’s execution At least one task modifies the shared data structure The tasks potentially need to use the modified value during the concurrent computation For any order of execution of tasks, the computation must be correct, i.e. shared data access must obey program order Computer Systems (ANU) Parallel Software Design Feb 14, 2020 49 / 141

  34. Program and Data Structure Patterns Shared data Data Structures: Shared Data Pattern To manage shared data, follow these steps: ADT Definition : Start by defining an abstract data type (ADT) with a fixed set of operations on the data If the ADT is a stack, then the operations would be push and pop If executed serially, these operations should leave the data in a consistent state Concurrency-control protocol : Devise a protocol to ensure that, if used concurrently, the operations on the ADT are sequentially consistent. Approaches for this include: Mutual exclusion and critical sections on a shared memory system. Minimize length of critical sections Assign shared data to a particular UE in a distributed memory system Identify non-interfering sets of operations Readers/writers protocol Nested locks Computer Systems (ANU) Parallel Software Design Feb 14, 2020 50 / 141

  35. Program and Data Structure Patterns Shared queue Data Structures: Shared Queue Pattern An important shared data structure commonly utilized in the master/worker pattern This pattern represents a “thread-safe” implementation of the familiar queue ADT Concurrency-control protocols that encompass too much of the shared queue in a single synchronization construct increase the chances of UEs being blocked waiting for access Maintaining a single queue for systems with complicated memory hierarchies (such as NUMA systems) can cause excess communication and increase parallel overhead Computer Systems (ANU) Parallel Software Design Feb 14, 2020 51 / 141

  36. Program and Data Structure Patterns Shared queue Data Structures: Shared Queue Pattern Types of shared queues that can be implemented: Non-blocking queue : No interference by multiple UEs accessing the queue concurrently Block-on-empty queue : A UE trying to pop from an empty queue will wait until the queue has an element available Distributed shared queue : Each UE has a local queue that is shared with other UEs. Commonly used for work stealing Computer Systems (ANU) Parallel Software Design Feb 14, 2020 52 / 141

  37. Program and Data Structure Patterns Distributed Array Data Structures: Distributed Array Pattern One of the most commonly used data structures, i.e. an array This pattern represents arrays of one or more dimensions that are decomposed into sub-arrays and distributed among available UEs Particularly important when the geometric decomposition algorithmic structure is being utilized along with the SPMD program structure Although it primarily applies for distributed memory systems, it also has applications for NUMA systems Some commonly used array distributions are: 1D Block : The array is decomposed in one dimension only and distributed one block per UE. Sometimes referred to as column block or row block , in the context of 2D arrays 2D Block : One 2D block or tile per UE Block-cyclic : More blocks than UEs and blocks are assigned to UEs in a round-robin fashion Computer Systems (ANU) Parallel Software Design Feb 14, 2020 53 / 141

  38. Program and Data Structure Patterns Distributed Array Summary Structured thinking is key when designing parallel software Patterns help in thought process but are not set in stone! Quality of software design matters most since software outlives hardware Computer Systems (ANU) Parallel Software Design Feb 14, 2020 54 / 141

  39. Program and Data Structure Patterns Distributed Array Hands-on Exercise: Program and Data Structure Patterns Objective: To identify and exploit parallelism in large matrix multiplication Computer Systems (ANU) Parallel Software Design Feb 14, 2020 55 / 141

  40. Systems on chip: Introduction Outline 4 Systems on chip: Introduction 1 Software Patterns 5 System-on-chip Processors 2 Algorithmic Structure Patterns 6 Emerging Paradigms and Challenges in Parallel Computing 3 Program and Data Structure Patterns Computer Systems (ANU) Parallel Software Design Feb 14, 2020 56 / 141

  41. Systems on chip: Introduction Systems on a Chip former systems can now be integrated into a single chip usually for special-purpose systems high speed per price, power often have hierarchical networks (courtesy EDA360 Insider) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 57 / 141

  42. Systems on chip: Introduction On-chip Networks: Bus-based traditional model; buses have address and data pathways can be several ’masters’ (operating the bus, e.g. CPU, DMA engine) in a multicore context, there be many! (scalability issue) hence arbitration is a complex issue (and takes time!) techniques for improving bus utilization: burst transfer mode : multiple requests (in regular pattern) once granted access pipelining : place next address as data from previous broadcast : one master sends to others e.g. cache coherency - snoop, invalidation Computer Systems (ANU) Parallel Software Design Feb 14, 2020 58 / 141

  43. Systems on chip: Introduction On-Chip Networks: Current and Next Generation buses : only one device can access (make a ‘transaction’) at one time crossbars : devices split in 2 groups of size p can have p transactions at once, provided at most 1 per device e.g. UltraSPARC T2: p = 8 cores and L2$ banks; Fermi GPU: p = 14 for largish p , need to be internally organized as a series of switches may be also organized as a ring for larger p , may need a more scalable topology such as a 2-D mesh (or torus , e.g. Intel SCC), or hierarchies of these Computer Systems (ANU) Parallel Software Design Feb 14, 2020 59 / 141

  44. Systems on chip: Introduction Sandy Bridge Ring On-Die Interconnect a ring-based interconnect between Cores, Graphics, Last Level Cache (LLC) and System Agent domains has 4 physical rings: Data (32B), Request, Acknowledge and Snoop rings fully pipelined; bandwidth, latency and power scale with cores shortest path chosen to minimize latency has distributed arbitration & sophisticated protocol to handle coherency and ordering (courtesy www.lostcircuits.com) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 60 / 141

  45. Systems on chip: Introduction Cache Coherency Considered Harmful (also known as the ‘Coherency Wall’) a core writes at address x in its L1$; must invalidate x in other L1$s standard protocols requires a broadcast message for each invalidation maintaining (MOESI) protocol also requires a broadcast on every miss also causes contention (& delay) in the network (worse than O ( p 2 )?) directory-based protocols can direct invalidation messages to only the caches holding the same data far more scalable (e.g. SGI Altix SMP), for lightly-shared data for each cached line, need a bit vector of length p : O ( p 2 ) storage cost false sharing in any case results wasted traffic writes P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 hey, what about GPUs? atomic instructions sync down to the LLC, cost O ( p ) energy each! cache line size is sub-optimal for messages on on-chip networks Computer Systems (ANU) Parallel Software Design Feb 14, 2020 61 / 141

  46. System-on-chip Processors Outline 4 Systems on chip: Introduction 1 Software Patterns 5 System-on-chip Processors Motivation Case Study: TI Keystone II SoC 2 Algorithmic Structure Patterns Bare-metal Runtime on DSP Programming DSP cores with OpenCL and OpenMP 3 Program and Data Structure Patterns 6 Emerging Paradigms and Challenges in Parallel Computing Computer Systems (ANU) Parallel Software Design Feb 14, 2020 62 / 141

  47. System-on-chip Processors Motivation High Performance Computing Using more than one computer to solve a large scale problem Using clusters of compute nodes Large clusters = Supercomputers! Applications include: Data analysis Numerical Simulations Modeling Complex mathematical calculations Computations dominated by Floating-point operations (FLOPs) History dates back to the 1960s! Computer Systems (ANU) Parallel Software Design Feb 14, 2020 63 / 141

  48. System-on-chip Processors Motivation High Performance Computing [ Summit] Summit Supercomputer 1 Each node has two 22-core IBM POWER9 processors and six NVIDIA V100 GPUs 4608 nodes = 2,414,592 compute cores Peak performance of 225 PetaFLOPs No.1 on the Top 500 list, Nov 2019 Power consumption: 13 MWatt 1nbcnews.com Computer Systems (ANU) Parallel Software Design Feb 14, 2020 64 / 141

  49. System-on-chip Processors Motivation High Performance Computing Power consumption is a major problem Power Consumption ∝ Heat Generation ∝ Cooling Requirement ∝ Increase in Maintenance Cost Majority of research targeting energy efficiency Alternative building blocks for Supercomputers? Energy-efficient system-on-chips with accelerators ✓ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 65 / 141

  50. System-on-chip Processors Motivation System-on-chip Processors All components of a computer on a single chip Single/Multi core CPU One or more on-chip accelerators Computer Systems (ANU) Parallel Software Design Feb 14, 2020 66 / 141

  51. System-on-chip Processors Motivation Few of the ones we play with. . . [ [ [ Par16] Adapteva Parallella Jetson] NVIDIA Jetson TK1 TI K2H] TI Keystone II Epiphany Board 4 Board 3 Evaluation Module 2 0http://ti.com/ 2http://anandtech.com/ 4http://arstechnica.com/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 67 / 141

  52. System-on-chip Processors Case Study: TI Keystone II SoC Case Study: TI Keystone II SoC Overview of the TI Keystone II (Hawking) SoC Very Long Instruction Word (VLIW) architecture Software-managed Cache-coherency on DSP cores How shared memory is managed between ARM and DSP cores Execution of a binary on the DSP cores from ARM Linux Bare-metal Runtime on DSP cores Programming for HPC on the Hawking Computer Systems (ANU) Parallel Software Design Feb 14, 2020 68 / 141

  53. System-on-chip Processors Case Study: TI Keystone II SoC TI ’Hawking’ SoC Host: Quad-core ARM Cortex A15 Accelerator: Eight-core floating point C66X DSP Communication: Shared Memory, Hardware Queues Features: DSP: 32 KB L1-D, L1-P Cache DSP: 1 MB L2 Cache (Configurable as SRAM) DSP: Aggregate 157.184 SP GFLOPS ARM: 32 KB L1-D, L2-P Cache ARM: 4 MB L2 Shared Cache ARM: Aggregate 38.4 SP GFLOPS Common: 6 MB Shared SRAM ARM cores are cache coherent but DSP cores are not DSPs have no MMU, no virtual memory [ Power consumption around 15 Watts TDP TI K2H] TI K2H ARM-DSP SoC 5 5http://ti.com/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 69 / 141

  54. System-on-chip Processors Case Study: TI Keystone II SoC ARM Cortex A15 MPCore 1-4 Cache Coherent Cores per on-chip cluster 32-bit ARMv7 Reduced Instruction Set Computer (RISC) instructions 40-bit Large Physical Address Extension (LPAE) addressing Out-of-order execution, branch prediction Vector Floating Point (VFPv4) unit per core Advanced Single Instruction Multiple Data (SIMD) extension aka NEON unit per core [ 4 32-bit registers (quad) can hold a single 128-bit vector 4 SP multiplies/cycle ARM Cortex-A15] ARM Cortex-A15 Host CPU 6 6http://geek.com/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 70 / 141

  55. System-on-chip Processors Case Study: TI Keystone II SoC TI C66x Digital Signal Processor L1P 8-way Very Long Instruction Word SRAM/Cache 32KB (VLIW) processor Instruction level parallelism Embedded Compiler generates VLIW instructions L1P Debug composed of instructions for separate functional units that can run in parallel Interrupt controller 8 RISC functional units in two sides Emulation Fetch Power Management Multiplier (M): multiplication Dispatch Exectute Data (D): load/store C66x ALU (L) and Control (S): addition and branch DSP Single Instruction Multiple Data L2 L M S D L M S D (SIMD) up to 128-bit vectors 4 32-bit registers (quad) can hold a single Register file A Register file B Prefetch 128-bit vector M: 4 SP multiplies/cycle L and S: 2 SP add/cycle DMA L1D 8 Multiply-Accumulate (MAC)/cycle L1D L2 SRAM/Cache SRAM/Cache 32KB 1MB Computer Systems (ANU) Parallel Software Design Feb 14, 2020 71 / 141

  56. System-on-chip Processors Case Study: TI Keystone II SoC VLIW Architecture L1P A very long instruction word consists SRAM/Cache of multiple independent instructions 32KB which may be logically unrelated, Embedded packed together by the compiler L1P Debug Thew onus on compiler to statically Interrupt controller schedule independent instructions into Emulation Fetch Power a single VLIW instruction Management Dispatch Exectute C66x Multiple functional units in the DSP processor L2 Instructions in a bundle are statically L M S D L M S D aligned to be directly fed into Register file A Register file B Prefetch functional units in lock-step Simple expression of Instruction Level DMA L1D Parallelism L1D L2 SRAM/Cache SRAM/Cache 32KB 1MB Computer Systems (ANU) Parallel Software Design Feb 14, 2020 72 / 141

  57. System-on-chip Processors Case Study: TI Keystone II SoC VLIW Trade-offs Advantages: Hardware is simplified, i.e. no dynamic scheduling required VLIW instructions contain independent sub-instructions, i.e. no dependency checking required i.e. simplified instruction issue unit Instruction alignment/distribution is not required after fetch to separate functional units, i.e. simplified hardware Disadvantages: Compiler complexity, i.e. independent operations need to be found for every cycle NOPs are inserted for when suitable operations are not found for a cycle When functional units or instruction latencies change, i.e. when executing on another processor having the same architecture, recompilation is required Lock-step execution causes independent operations to stall until the longest-latency instruction completes Computer Systems (ANU) Parallel Software Design Feb 14, 2020 73 / 141

  58. System-on-chip Processors Case Study: TI Keystone II SoC C66x DSP Cache Coherency C66X DSPs are not cache-coherent with each other Between flush points, threads do not access the same data – that would be a data race! The runtime triggers software-managed cache operations at flush points Costs around 1350 clock cycles for each flush operation Computer Systems (ANU) Parallel Software Design Feb 14, 2020 74 / 141

  59. System-on-chip Processors Case Study: TI Keystone II SoC Memory Hierarchy Available to both ARM and DSP: 8 GB DDR3 RAM ∼ 100 cycle access time 6 MB Scratchpad RAM (SRAM) ∼ 20 cycle access time Available to DSP only: 1 MB L2 cache per core configurable as SRAM ∼ 7 cycle access time 32 KB L1 data and L1 instruction cache per core configurable as SRAM ∼ 2 cycle access time Computer Systems (ANU) Parallel Software Design Feb 14, 2020 75 / 141

  60. System-on-chip Processors Case Study: TI Keystone II SoC Sharing memory between ARM and DSP Obstacles: No shared MMU between ARM and DSP cores No MMU on DSP cores No shared virtual memory What elements are required in order for ARM and DSP programs to share memory? Linux virtual memory mapping to shared physical memory Shared heap between ARM and DSP cores Memory management library that provides malloc/free routines into the shared heap (TI’s Contiguous Memory (CMEM) package) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 76 / 141

  61. System-on-chip Processors Case Study: TI Keystone II SoC Executing a binary on a DSP core The TI Multiprocess Manager (MPM) package allows linux userspace programs to load and run binaries onto the DSP cores individually Has two major components: Daemon ( mpmsrv ) and CLI utility ( mpmcl ) Uses the remoteproc driver the DSP output (trace) is obtained using the rpmsg bus (Requires a resource table entry in the loaded binary ELF sections) Maintained at: git.ti.com/keystone-linux/multi-proc-manager.git 1 root@k2hk -evm :~# mpmcl status dsp0 dsp0 is in running state 3 root@k2hk -evm :~# mpmcl reset dsp0 reset succeeded 5 root@k2hk -evm :~# mpmcl status dsp0 dsp0 is in reset state 7 root@k2hk -evm :~# mpmcl load dsp0 main.out load successful 9 root@k2hk -evm :~# mpmcl run dsp0 run succeeded 11 root@k2hk -evm :~# cat /sys/kernel/debug/ remoteproc/ remoteproc0 /trace0 Main started on core 0 13 ... root@k2hk -evm :~# Computer Systems (ANU) Parallel Software Design Feb 14, 2020 77 / 141

  62. System-on-chip Processors Bare-metal Runtime on DSP C66x DSP Runtime Support System Eight DSP cores What does bare-metal execution mean? No OS running on DSP cores Each core boots every time a binary is loaded onto it The executable binary loaded on the DSP cores must at least provide: Task execution Memory management File I/O Inter-process communication (IPC) Basic task scheduling Computer Systems (ANU) Parallel Software Design Feb 14, 2020 78 / 141

  63. System-on-chip Processors Bare-metal Runtime on DSP A bare-metal runtime system A runtime library or a runtime system is software intended to support the execution of a program by providing: API implementations of programming language features Type-checking, debugging, code generation and optimization, possibly garbage-collection Access to runtime environment data structures Interface to OS system calls Without the presence of an OS, a runtime system must also provide the basic services of an OS such as those provided by a typical micro-kernel: Memory management Thread management IPC Computer Systems (ANU) Parallel Software Design Feb 14, 2020 79 / 141

  64. System-on-chip Processors Bare-metal Runtime on DSP Memory management Defining a physical memory region to place the heap data structure Can be in shared MSMC SRAM or DDR3 RAM The exact location is determined by the linker Memory sections can be specified via a linker command file Initialization of the heap Provide mutually exclusive access to the heap from all cores – provide Locking mechanisms for critical sections, i.e. mutex/semaphore C API functions: malloc, free, calloc, realloc, memalign Computer Systems (ANU) Parallel Software Design Feb 14, 2020 80 / 141

  65. System-on-chip Processors Bare-metal Runtime on DSP Thread management The runtime system can be considered as a single process running on a DSP core with multiple threads of work running within it. It manages: The execution of units of work or tasks on each DSP core Pointer to a function in memory: void (*fn)(void*) Pointer to argument buffer in memory: void* args Multiplexing threads of execution Task/thread dispatcher Scheduling of threads (Pre-defined policy) Maintain thread state data structures in non-cacheable memory Sharing the same address space for each core Sharing locks on local memory (L2 SRAM) Teams of threads Possible pre-emption Computer Systems (ANU) Parallel Software Design Feb 14, 2020 81 / 141

  66. System-on-chip Processors Bare-metal Runtime on DSP Inter-process Communication Fundamental to process execution Exchanging data between threads on one or more DSP cores Three primary methods: Mutually exclusive access to objects in shared memory Mutually exclusive access to objects in local memory Atomic access to hardware queues We focus on the use of hardware queues present in the Keystone II SoC Computer Systems (ANU) Parallel Software Design Feb 14, 2020 82 / 141

  67. System-on-chip Processors Bare-metal Runtime on DSP Hardware Queues Part of the Multicore Navigator present on the K2 SoC Queue Manager Sub-System (QMSS) 16384 queues Can be used via: QMSS Low-level drivers (LLD) Open Event Machine (OpenEM): Abstraction above QMSS LLD LIFO and FIFO configurations available Computer Systems (ANU) Parallel Software Design Feb 14, 2020 83 / 141

  68. System-on-chip Processors Bare-metal Runtime on DSP Hardware Queues What can you push to a hardware queue? The address of a single message descriptor at a time A message descriptor can be any data structure created by the user. Typically, a C struct of fixed size. 20 available memory regions from which at least one must be mapped and configured for message descriptor storage The descriptor size must be a multiple of 16 bytes and must be a minimum 32 bytes Computer Systems (ANU) Parallel Software Design Feb 14, 2020 84 / 141

  69. System-on-chip Processors Bare-metal Runtime on DSP Hardware Queues Using hardware queues 7 Each push and pop operation is atomic. 7http://www.deyisupport.com/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 85 / 141

  70. System-on-chip Processors Programming DSP cores with OpenCL and OpenMP OpenCL A parallel programming library API specification that provides: A consistent execution model: Host, Devices, Compute Units, Processing Elements A consistent memory model: Global, Constant, Local, Private Asynchronous execution on the device Data-parallel: NDRange Index Space, Work Groups, Work Items Task-parallel: In-order and out-of-order queues, asynchronous dispatch Architecture invariant kernel specification Computer Systems (ANU) Parallel Software Design Feb 14, 2020 86 / 141

  71. System-on-chip Processors Programming DSP cores with OpenCL and OpenMP OpenCL Platform Model OpenCL Platform Model 8 8http://developer.amd.com/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 87 / 141

  72. System-on-chip Processors Programming DSP cores with OpenCL and OpenMP OpenCL Memory Model OpenCL Platform Model 9 9http://developer.amd.com/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 88 / 141

  73. System-on-chip Processors Programming DSP cores with OpenCL and OpenMP OpenCL Example: Vector Addition const char * kernelStr = 2 "kernel void VectorAdd(global const short4* a, " " global const short4* b, " 4 " global short4* c) " "{" 6 " int id = get_global_id (0);" " c[id] = a[id] + b[id];" 8 "}"; /* Create device context */ 10 Context context( CL_DEVICE_TYPE_ACCELERATOR ); std :: vector <Device > devices = context.getInfo <CL_CONTEXT_DEVICES >(); 12 /* Declare device buffers */ Buffer bufA (context , CL_MEM_READ_ONLY , bufsize); 14 Buffer bufB (context , CL_MEM_READ_ONLY , bufsize); Buffer bufDst (context , CL_MEM_WRITE_ONLY , bufsize); 16 /* Create program from kernel string , compile program , associate with context */ Program :: Sources source (1, std :: make_pair(kernelStr ,strlen(kernelStr))); 18 Program program = Program(context , source); program.build(devices); 20 /* Set kernel arguments */ Kernel kernel(program , "VectorAdd"); 22 kernel.setArg (0, bufA); kernel.setArg (1, bufB); kernel.setArg (2, bufDst); /* Create Command Queue */ 24 CommandQueue Q(context , devices[d], CL_QUEUE_PROFILING_ENABLE ); /* Write data to device */ 26 Q. enqueueWriteBuffer (bufA , CL_FALSE , 0, bufsize , srcA , NULL , &ev1); Q. enqueueWriteBuffer (bufB , CL_FALSE , 0, bufsize , srcB , NULL , &ev2); 28 Q. enqueueNDRangeKernel (kernel , NullRange , NDRange( NumVecElements ), /* Enqueue kernel */ NDRange( WorkGroupSize ), NULL , &ev3); 30 Q. enqueueReadBuffer (bufDst , CL_TRUE , 0, bufsize , dst , NULL , &ev4); /* Read result */ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 89 / 141

  74. System-on-chip Processors Programming DSP cores with OpenCL and OpenMP OpenMP Shared memory parallel programming specification: Using compiler directives to partition work across cores Fork-join model - master thread creates a team of threads for each parallel region Memory model does not require hardware cache coherency Data and task parallel Mature programming model: Spec v1.0 out in 1997 Suited to multi-core systems with shared memory Widely used in HPC community int i; 2 #pragma omp parallel for for (i = 0; i < size; i++) 4 c[i] = a[i] + b[i]; Computer Systems (ANU) Parallel Software Design Feb 14, 2020 90 / 141

  75. System-on-chip Processors Programming DSP cores with OpenCL and OpenMP OpenMP Fork-Join Model OpenMP Fork-Join Model 10 10http://computing.llnl.gov/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 91 / 141

  76. System-on-chip Processors Programming DSP cores with OpenCL and OpenMP OpenMP for Accelerators Recent addition in OpenMP 4.0 (Feb, 2014) Notion of host and target device Use target constructs to offload work from host to target device Target regions contain OpenMP parallel regions Map clauses specify data synchronization #pragma omp target map (to:a[0: size], b[0: size], size) \ 2 map (from:c[0: size ]) { 4 int i; #pragma omp parallel for 6 for (i = 0; i < size; i++) c[i] = a[i] + b[i]; 8 } Computer Systems (ANU) Parallel Software Design Feb 14, 2020 92 / 141

  77. System-on-chip Processors Programming DSP cores with OpenCL and OpenMP OpenMPAcc Library and compiler clacc : Shell compiler omps2s : Source-to-source translator libOpenMPAcc : Thin layer on top of OpenCL ARM - DSP communication and synchronization using OpenCL over shared memory Source-to-source lowering generates separate ARM and DSP source code and libOpenMPAcc API calls Computer Systems (ANU) Parallel Software Design Feb 14, 2020 93 / 141

  78. System-on-chip Processors Programming DSP cores with OpenCL and OpenMP CLACC compiler Computer Systems (ANU) Parallel Software Design Feb 14, 2020 94 / 141

  79. System-on-chip Processors Programming DSP cores with OpenCL and OpenMP Source-to-source translation Computer Systems (ANU) Parallel Software Design Feb 14, 2020 95 / 141

  80. System-on-chip Processors Programming DSP cores with OpenCL and OpenMP Leveraging CMEM CMEM: Contiguous (Shared) Memory Can be used to access DRAM outside linux memory space Buffer usage same as normal malloc’d buffers No memcpy required when mapping data to target regions CMEM cache operations performed in libOpenMPAcc to maintain data consistency CMEM wbInvAll() used when data size is greater than a threshold float* buf_in_ddr = (float *) __malloc_ddr ( size_bytes); 2 float* buf_in_msmc = (float *) __malloc_msmc ( size_bytes); 4 __free_ddr (buf_in_ddr ); __free_msmc ( buf_in_msmc ); Computer Systems (ANU) Parallel Software Design Feb 14, 2020 96 / 141

  81. System-on-chip Processors Programming DSP cores with OpenCL and OpenMP Leveraging fast local memory Each DSP has 1 MB L2 cache 896 KB out of the 1 MB is configured as SRAM OpenCL uses 128 KB leaving 768 KB free Using the local map-type 1 float* local_buf = malloc(sizeof(float)*size); 3 #pragma omp target map (to:a[0: size], b[0: size], size) \ map (from:c[0: size ])\ 5 map (local:local_buf [0: size ]) { 7 #pragma omp parallel { 9 int i; for (i = 0; i < size; i++) 11 local_buf[i] = a[i] + b[i]; 13 for (i = 0; i < size; i++) c[i] = local_buf[i]; 15 } } 17 free(local_buf); Computer Systems (ANU) Parallel Software Design Feb 14, 2020 97 / 141

  82. Emerging Paradigms and Challenges in Parallel Computing Outline 4 Systems on chip: Introduction 1 Software Patterns 5 System-on-chip Processors 2 Algorithmic Structure Patterns 6 Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models Accelerators and Energy Efficiency 3 Program and Data Structure Patterns Computer Systems (ANU) Parallel Software Design Feb 14, 2020 98 / 141

  83. Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models DAG Execution Models: Motivations and Ideas In most programming models, serial or parallel, the algorithm is over-specified sequencing that is not necessary is often specified specifying what (sub-) tasks of a program can run in parallel is difficult and error-prone the model may constrain the program to run on a particular architecture (e.g. single memory image) Directed acyclic task graph programming models specify only the necessary semantic ordering constraints we express an instance of an executing program as a graph of tasks a node has an edge pointing to a second node if there is a (data) dependency between them The DAG run-time system can then determine when and where each task executes, with the potential to extract maximum concurrency Computer Systems (ANU) Parallel Software Design Feb 14, 2020 99 / 141

  84. Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models DAG Execution Models: Motivations and Ideas In DAG programming models, we version data (say by iteration count) it thus has a declarative, write-once semantics a node in a DAG will have associated with it: the input data items (including version number) required the output data items produced (usually with an updated version number) the function which performs this task Running a task-DAG program involves: generating the graph allowing an execution engine to schedule tasks to processors a task may execute when all of its input data items are ready the task informs the engine that its output data items have been produced before exiting Potential advantages include: maximizing parallelism, transparent load balancing arguably simpler programming model: no race hazards / deadlocks! abstraction the over underlying architecture permitting fault-tolerance: tasks on a failed process may be re-executed (requires that data items are kept in a ‘resilient store’) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 100 / 141

Recommend


More recommend