Parallel Programs 1 Why Bother with Programs? Theyre what runs on - PowerPoint PPT Presentation

Parallel Programs 1

Why Bother with Programs? They’re what runs on the machines we design • Helps make design decisions • Helps evaluate systems tradeoffs Led to the key advances in uniprocessor architecture • Caches and instruction set design More important in multiprocessors • New degrees of freedom • Greater penalties for mismatch between program and architecture 2

Important for Whom? Algorithm designers • Designing algorithms that will run well on real systems Programmers • Understanding key issues and obtaining best performance Architects • Understand workloads, interactions, important degrees of freedom • Valuable for design and for evaluation 3

Next Three Sections of Class: Software 1. Parallel programs • Process of parallelization • What parallel programs look like in major programming models 2. Programming for performance • Key performance issues and architectural interactions 3. Workload-driven architectural evaluation • Beneficial for architects and for users in procuring machines Unlike on sequential systems, can’t take workload for granted • Software base not mature; evolves with architectures for performance • So need to open the box Let’s begin with parallel programs ... 4

Outline Motivating Problems (application case studies) Steps in creating a parallel program What a simple parallel program looks like • In the three major programming models • Ehat primitives must a system support? Later : Performance issues and architectural interactions 5

Motivating Problems Simulating Ocean Currents • Regular structure, scientific computing Simulating the Evolution of Galaxies • Irregular structure, scientific computing Rendering Scenes by Ray Tracing • Irregular structure, computer graphics Data Mining • Irregular structure, information processing • Not discussed here (read in book) 6

Simulating Ocean Currents (a) Cross sections (b) Spatial discretization of a cross section • Model as two-dimensional grids • Discretize in space and time – finer spatial and temporal resolution => greater accuracy • Many different computations per time step – set up and solve equations • Concurrency across and within grid computations 7

Simulating Galaxy Evolution • Simulate the interactions of many stars evolving over time • Computing forces is expensive • O(n 2 ) brute force approach m 1 m 2 • Hierarchical Methods take advantage of force law: G r 2 Star on which forces Large group far are being computed enough away to approximate Small group far enough away to approximate to center of mass Star too close to approximate • Many time-steps, plenty of concurrency across stars within one 8

Rendering Scenes by Ray Tracing • Shoot rays into scene through pixels in image plane • Follow their paths – they bounce around as they strike objects – they generate new rays: ray tree per input ray • Result is color and opacity for that pixel • Parallelism across rays All case studies have abundant concurrency 9

Creating a Parallel Program Assumption: Sequential algorithm is given • Sometimes need very different algorithm, but beyond scope Pieces of the job: • Identify work that can be done in parallel • Partition work and perhaps data among processes • Manage data access, communication and synchronization • Note : work includes computation, data access and I/O Main goal: Speedup (plus low prog. effort and resource needs) Performance(p) Speedup (p) = Performance(1) For a fixed problem: Time(1) Speedup (p) = Time(p) 10

Steps in Creating a Parallel Program Partitioning O D A M r e s a c c s p h o i p p 0 p 1 e p 0 p 1 m g i P P s 0 1 p n n t o m g r s e a i n t t t P P 2 3 p 2 p 3 i p 2 p 3 i o o n n Sequential Parallel Tasks Processes Processors computation program 4 steps: Decomposition, Assignment, Orchestration, Mapping • Done by programmer or system software (compiler, runtime, ...) • Issues are the same, so assume programmer does it all explicitly 11

Some Important Concepts Task : • Arbitrary piece of undecomposed work in parallel computation • Executed sequentially; concurrency is only across tasks • E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace • Fine-grained versus coarse-grained tasks Process (thread) : • Abstract entity that performs the tasks assigned to processes • Processes communicate and synchronize to perform their tasks Processor : • Physical engine on which process executes • Processes virtualize machine to programmer – first write program in terms of processes, then map to processors 12

Decomposition Break up computation into tasks to be divided among processes • Tasks may become available dynamically • No. of available tasks may vary with time i.e. identify concurrency and decide level at which to exploit it Goal: Enough tasks to keep processes busy, but not too many • No. of tasks available at a time is upper bound on achievable speedup 13

Limited Concurrency: Amdahl’s Law • Most fundamental limitation on parallel speedup • If fraction s of seq execution is inherently serial, speedup <= 1/s • Example: 2-phase calculation – sweep over n -by- n grid and do some independent computation – sweep again and add each value to global sum • Time for first phase = n 2 /p • Second phase serialized at global variable, so time = n 2 2n 2 • Speedup <= or at most 2 n 2 + n 2 p • Trick: divide second phase into two – accumulate into private sum during sweep – add per-process private sum into global sum 2n 2 • Parallel time is n 2 /p + n2/p + p, and speedup at best 2n 2 + p 2 14

Pictorial Depiction 1 (a) n 2 n 2 work done concurrently p 1 (b) n 2 /p n 2 p 1 (c) Time n 2 /p n 2 /p p 15

Concurrency Profiles • Cannot usually divide into serial and parallel part 1,400 1,200 1,000 Concurrency 800 600 400 200 0 150 219 247 286 313 343 380 415 444 483 504 526 564 589 633 662 702 733 Clock cycle number • Area under curve is total work done, or time with 1 processor • Horizontal extent is lower bound on time (infinite processors) ∞ ∑ f k k 1 k=1 • Speedup is the ratio: , base case: ∞ ∑ s + 1-s k f k p p k=1 • Amdahl’s law applies to any overhead, not just limited concurrency 16

Assignment Specifying mechanism to divide work up among processes • E.g. which process computes forces on which stars, or which rays • Together with decomposition, also called partitioning • Balance workload, reduce communication and management cost Structured approaches usually work well • Code inspection (parallel loops) or understanding of application • Well-known heuristics • Static versus dynamic assignment As programmers, we worry about partitioning first • Usually independent of architecture or prog model • But cost and complexity of using primitives may affect decisions As architects, we assume program does reasonable job of it 17

Orchestration • Naming data • Structuring communication • Synchronization • Organizing data structures and scheduling tasks temporally Goals • Reduce cost of communication and synch. as seen by processors • Reserve locality of data reference (incl. data structure organization) • Schedule tasks to satisfy dependences early • Reduce overhead of parallelism management Closest to architecture (and programming model & language) • Choices depend a lot on comm. abstraction, efficiency of primitives • Architects should provide appropriate primitives efficiently 18

Mapping After orchestration, already have parallel program Two aspects of mapping: • Which processes will run on same processor, if necessary • Which process runs on which particular processor – mapping to a network topology One extreme: space-sharing • Machine divided into subsets, only one app at a time in a subset • Processes can be pinned to processors, or left to OS Another extreme: complete resource management control to OS • OS uses the performance techniques we will discuss later Real world is between the two • User specifies desires in some aspects, system may ignore Usually adopt the view: process <-> processor 19

Parallelizing Computation vs. Data Above view is centered around computation • Computation is decomposed and assigned (partitioned) Partitioning Data is often a natural view too • Computation follows data: owner computes • Grid example; data mining; High Performance Fortran (HPF) But not general enough • Distinction between comp. and data stronger in many applications – Barnes-Hut, Raytrace (later) • Retain computation-centric view • Data access and communication is part of orchestration 20

Parallel Programs 1 Why Bother with Programs? Theyre what runs on - PowerPoint PPT Presentation

Parallel Programs 1 Why Bother with Programs? Theyre what runs on the machines we design Helps make design decisions Helps evaluate systems tradeoffs Led to the key advances in uniprocessor architecture Caches and instruction set

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Multiple Programs How do programs communicate? 1 Multiple Programs How do programs communicate?

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

c p e c Writing Message-Passing Parallel Programs with MPI 1 Edinburgh Parallel Computing

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Introduction to Parallel Computing George Karypis Analytical Modeling of Parallel Algorithms

Parallel Debugging Objective Learn the basics of debugging parallel programs

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David B2-206 Topic Overview

1 Analysis of sequential algorithms: The PRAM Model a Parallel RAM RAM model (Random Access

Scan Mark Greenstreet CpSc 418 Jan. 20, 2016 Mark Greenstreet Scan CS 418 Jan. 20,

Parallel Programming and Heterogeneous Computing FPGA Accelerators Max Plauth, Sven Khler, Felix

Lecture 13: Block Diagrams and the Inverse Z Transform Mark Hasegawa-Johnson ECE 401: Signal and

PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI

Overview on Parallel Programming Paradigms Ivan Giro3o

Parallel- 0 : A fully parallel algorithm for combinatorial compressed sensing Jared Tanner

P A R A L L E L A L G O R I T H M S F O R M I N I N G L A R G E - S C A L E T I M E - V A R Y