D YNAMIC F INE -G RAIN S CHEDULING OF P IPELINE P ARALLELISM Daniel - PowerPoint PPT Presentation

D YNAMIC F INE -G RAIN S CHEDULING OF P IPELINE P ARALLELISM Daniel Sanchez, David Lo, Richard M. Yoo, Jeremy Sugerman, Christos Kozyrakis Stanford University PACT-20, October 11 th 2011

Executive Summary 2  Pipeline-parallel applications are hard to schedule  Existing techniques either ignore pipeline parallelism, cannot handle its dependences, or suffer from load imbalance  Contributions:  Design a runtime that dynamically schedules pipeline- parallel applications efficiently  Show it outperforms typical scheduling techniques from multicore, GPGPU and Streaming programming models

Outline 3  Introduction  GRAMPS Programming Model  GRAMPS Runtime  Evaluation

High-Level Programming Models 4  High-level parallel programming models provide:  Simple, safe constructs to express parallelism  Automatic resource management and scheduling  Many aspects; we focus on scheduling  Model, scheduler and architecture often intimately related  In terms of scheduling, three main types of models:  Task-parallel models, typical in multicore (Cilk, X10)  Data-parallel models, typical in GPU (CUDA, OpenCL)  Streaming models, typical in streaming architectures (StreamIt, StreamC)

Pipeline-Parallel Applications 5  Some models (e.g. streaming) define applications as a graph of stages that communicate explicitly through queues  Each stage can be sequential or data-parallel  Arbitrary graphs allowed (multiple inputs/outputs, loops) Shadow Frame Camera Intersect Shade Tiler Sampler Camera Camera Camera Camera Camera Camera Camera Camera Intersect Buffer Ray tracing pipeline  Well suited to many algorithms  Producer-consumer communication is explicit  Easier to exploit to improve locality  Traditional scheduling techniques have issues dynamically scheduling pipeline-parallel applications

Task-Parallel – Task-Stealing 6  Model: Task-parallel with fork-join dependences or independent tasks (Cilk, X10, TBB, OpenMP , …)  Task-Stealing Scheduler:  Worker threads enqueue/dequeue tasks from local queue  Steal from another queue if out of tasks T 0 T 1 T n  Efficient load-balancing Dequeue Enqueue  Unable to handle dependences of pipeline-parallel programs Steal

Data-Parallel – Breadth-First 7  Model: Sequence of data-parallel kernels (CUDA, OpenCL)  Breadth-First Scheduler: Execute one stage at a time in breadth-first order (source to sink) Stage 2 Stage 1 Camera Stage 3 Camera T 0 T 1 T 2 T 3 1 3 2 2 2 2  Very simple model  Ignores pipeline parallelism  works poorly with sequential stages, worst-case memory footprint

Streaming – Static Scheduling 8  Model: Graph of stages communicating through streams  Static Scheduler:  Assume app and architecture are regular, known in advance  Use sophisticated compile-time analysis and scheduling to minimize inter-core communication and memory footprint  Very efficient if application and architecture are regular  Load imbalance with irregular applications or non- predictable architectures (DVFS, multithreading …)

Summary of Scheduling Techniques 9 Supports pipeline- Supports irregular parallel apps apps/archs   Task-Stealing   Breadth-First   Static   GRAMPS

GRAMPS Programming Model 11  Programming model for dynamic scheduling of irregular pipeline-parallel workloads  Brief overview here, details in [Sugerman 2010]  Shader (data-parallel) and Thread (sequential) stages  Stages send packets through fixed-size data queues  Queues can be ordered or unordered  Can enqueue full packets or push elements (coalesced by runtime) Shadow Frame Camera Intersect Shade Tiler Sampler Camera Camera Camera Camera Camera Camera Camera Camera Intersect Buffer Shader Stage Queue Push Queue Thread Stage

GRAMPS: Threads vs Shaders 12  Threads are stateful, instanced by the programmer  Arbitrary number of input and output queues Thread  Blocks on empty input/full output queue Stage  Can be preempted by the scheduler  Shaders are stateless, automatically instanced  Single input queue, one or more outputs Shader Camera Stage Camera  Each instance processes an input packet  Does not block

GRAMPS Scheduling 13  Similar model to Streaming, but features ease dynamic scheduling of irregular applications:  Packet granularity  reduce scheduling overheads  Stages can produce variable output (e.g., push queues)  Data parallel stages, queue ordering are explicit  Static requires applications to have a steady state; GRAMPS can schedule apps with no steady state  GRAMPS was evaluated with an idealized scheduler when proposed; we implement a real multicore runtime

GRAMPS Runtime Overview 15  Runtime = Scheduler + Buffer Manager  Scheduler: Decide what to run where  Dynamic, low-overhead, keeps bounded footprint  Based on task-stealing with multiple task queues/thread  Buffer Manager: Provide dynamic allocation of packets  Generic memory allocators are too slow for communication- intensive applications  Low-overhead solution, based on packet-stealing

Scheduler organization 16  As many worker pthreads as hardware threads  Work is represented with tasks  Shader stages are function calls (stateless, non- preemptive)  One task per runnable shader instance  Thread stages are user-level threads (stateful, preemptive)  User-level threads enable fast context-switching (100 cycles)  One task per runnable thread

Scheduler: Task Queues 17  Load-balancing with task stealing  Each thread has one LIFO task queue per stage  Stages sorted by breadth-first order (higher priority to consumers)  Dequeue from high-priority first, steal low-priority first  Higher priority tasks drain the pipeline, improve locality  Lower priority tasks produce more work (less stealing) Dequeue order 2 Camera Camera 1 4 1 2 3 4 3 2 3 Camera Camera 2 2 Steal order

Scheduler: Data Queues 18  Thread input queues maintained as linked lists  Shader input queues implicitly maintained in task queues  Each shader task includes a pointer to its input packet  Queue occupancy tracked for all queues  Backpressure: When a queue fills up, disable dequeues and steals from queue producers  Producers remain stalled until packets are consumed, workers shift to other stages  Queues never exceed capacity  bounded footprint  Queues are optionally ordered (see paper for details)

Example 19 Queue 1 Queue 2 Shader Thread Thread occupancy occupancy Camera Camera Q2 Q1 2 1 3 10/20 4/20 0/20 9/20 0/10 1/10 T 0 T 1 T 2 T 3 1 2 2 2 3 1 2 2 2 2 3 2 2 2 2 2 2

Example (cont.) 20 Queue 1 Queue 2 Shader Thread Thread occupancy occupancy Camera Camera Q2 Q1 2 1 3 8/20 7/20 10/10 9/10 0/10 T 0 T 1 T 2 T 3 1 2 2 2 3 2 2 2 2 2 2     Queue 2 full  disable dequeues and steals from Stage 2

Packet-Stealing Buffer Manager 21  Packets pre-allocated to a set of pools  Each pool has packets of a specific size  Each worker thread maintains a LIFO queue per pool  Release used input packets to local queue  Allocate new output packets from local queue, if empty, steal  Due to bounded queue sizes, no need to dynamically allocate packets  LIFO policy results in high locality and reuse

Methodology 23  Test system: 2-socket, 12-core, 24-thread Westmere  32KB L1I+D, 256KB private L2, 12MB per-socket L3  48GB 1333MHz DDR3 memory, 21GB/s peak BW  Benchmarks from different programming models:  GRAMPS: raytracer Combine Camer Map Split Camer Camera Reduce  MapReduce: histogram, lr, pca (opt) Camera a a  Cilk: mergesort Serial Camer Camer Combine Merge Part Camera Camera Sort Camera Camera a  StreamIt: fm, tde, fft2, serpent a  CUDA: srad, recursiveGaussian

Alternative Schedulers 24  GRAMPS scheduler can be substituted with other implementations to compare scheduling approaches  Task-Stealing: Single LIFO task queue per thread, no backpressure  Breadth-First: One stage at a time, may do multiple passes due to loops, no backpressure  Static: Application is profiled first, then partitioned using METIS, and scheduled using a min-latency schedule, using per-thread data queues

GRAMPS Scheduler Scalability 25 Numbers…fucking PowerPoint import…  All applications scale well  Knee at 12 threads due to HW multithreading  Sublinear scaling due to memory bandwidth (hist, CUDA)

Performance Comparison 26 GRAMPS MapReduce Cilk StreamIt CUDA

Performance Comparison 27  Dynamic runtime overheads are small in GRAMPS  Task-Stealing performs worse on complex graphs (fm, tde, fft2)  Breadth-First does poorly when parallelism comes from pipelining  Static has no overheads and better locality, but higher stalled time due to load imbalance

Footprint Comparison 28  Task-Stealing fails to keep footprint bounded (tde)  Breadth-First has worst-case footprints  much higher footprint, memory bandwidth requirements

D YNAMIC F INE -G RAIN S CHEDULING OF P IPELINE P ARALLELISM Daniel - PowerPoint PPT Presentation

D YNAMIC F INE -G RAIN S CHEDULING OF P IPELINE P ARALLELISM Daniel Sanchez, David Lo, Richard M. Yoo, Jeremy Sugerman, Christos Kozyrakis Stanford University PACT-20, October 11 th 2011 Executive Summary 2 Pipeline-parallel applications

Rain/Snow Harvesting FAQ What is rain/snow harvesting? Rain/snow harvesting is simply to

JET Job Skills Elementary School I Like Rain By Sarah Rogers-Tanner I like rain I dont like

Environmental Environmental Acid rain Acid rain Chemistry Chemistry APCH211 APCH211 Dr PG

P IPELINE S AFETY T RUST A NNUAL C ONFERENCE S ETTING S TAGE FOR G REATER E NGAGEMENT D AVID M URK

P IPELINE S AFETY T RUST A NNUAL C ONFERENCE E VOLUTION OF P UBLIC A WARENESS D AVID M URK P

Rain Garden Maintenance G. Eric French President, Eisler Landscapes Inc. All successful rain

Rain Garden Design Understanding Stormwater Runoff Sizing a Rain Garden Choosing a

Rain Living USA Lina Yeh Agenda Rain Africa Our brand Marketing strategy Campaign

Hebrews 6:7, For the earth which Hebrews 6:7, For the earth which drinks in the rain that

MART INE Z CRE E K L INE AR CRE E KWAY T RAIL Pub lic Me e ting Ja nua ry 18, 2018 L

R ECONFIGURING THE I MAGING P IPELINE FOR C OMPUTER V ISION Mark Buckler, Suren Jayasuriya, Adrian

Rain Garden Workshop Rain Garden Workshop UNL Extension Stormwater Work Group UNL Extension

Rain Industries Limited (Formerly Rain Commodities Limited) Corporate Presentation May 2015

Singing in the Rain Wisconsin and Upper Michigan Rain Gardens Brent Hanson of Hansons Garden

FALLING RAIN ESTATE, PHASE ONE Bamah Nissi Multilinks Limited FALLING RAIN ESTATE: A 40 hectare

Simulating heavy rain damage in an insurance context Stefanie Busch HydroPredict , Prague,

Lecture 1: Introduction to CS 5220 David Bindel 24 Aug 2011 CS 5220: Applications of Parallel

Introduction Dag T. Wisland Spring 2014 Outline Practical information Curriculum overview

Digital System Design Lecture 12: Altera-Xilinx SOPC Amir Masoud Gharehbaghi

Openlab Status and Plans 2003/2004 **** Openlab - FM Workshop 8 July 2003 1 June 2003 Sverre

Preface There are more slides here than will be used in lectures. The slides not covered will be

Lecture 22 Logistics HW7 is due on Friday Lab 8 this week Lab 8 this week Last

Cuauhtemoc Carbajal ITESM CEM Modified version of Sparkfun Slides

Opera.ng Systems History and Overview Por*ons of this material courtesy Profs. Wong and Stark

Sambuz

Useful Links

Newsletter

Mail Us