Outline Introduction Contribution: Novel Vectorization and Mapping - PDF document

11/4/14 ¡ Vectorization and Mapping of Software Defined Radio Applications on GPU Platforms Shuvra S. Bhattacharyya Professor, ECE and UMIACS University of Maryland at College Park ssb@umd.edu , http://www.ece.umd.edu/~ssb With Contributions from G. Zaki, W. Plishker, C. Clancy, and J. Kuykendall GPU Summit at the UMD/NVIDIA CUDA Center for Excellence, College Park MD, October 27, 2014 Outline • Introduction • Contribution: Novel Vectorization and Mapping Workflow. • Evaluation • Summary 2 1 ¡

11/4/14 ¡ Outline • Introduction • Contribution: Novel Vectorization and Mapping Workflow. • Evaluation • Summary 3 DSPCAD Methodologies: Computer-Aided Design (CAD) for Digital Signal Processing (DSP) Systems Platforms Applications and Tasks [Bhattacharyya 2013] Image: medical, computer vision, Programmable DSP feature detection, etc. Imaging Data Image device preprocessing reconstruction Post Advanced Image reconstruction image analysis visualization GPU Video: coding, compression, etc. FPGA Color Transformation & Entropy Prediction processing Quantization Coding Audio: sample rate conversion, speech, etc. Audio Data Data Feature device preprocessing postprocessing extraction Microcontroller Wireless communication systems Source Channel Digital D/A RF encoding encoding modulation conversion Back-end 4 2 ¡

11/4/14 ¡ Motivation • Diversity of platforms: – ASIC, FPGA, DSPs, GPUs, GPP • Complex application environments – GNU Radio • Exposing parallelism – Task, Data and Pipeline • Difficult Mapping Problem • Multi-objective (throughput, Latency) 5 Background: GNU Radio • A software development framework that provides software defined radio (SDR) developers a rich library and a customized runtime engine to design and test radio applications [Blossom 2004] 6 3 ¡

11/4/14 ¡ DSP-oriented Dataflow Models of Computation • Applica.on ¡is ¡modeled ¡as ¡a ¡ directed ¡graph ¡ – Nodes ¡(actors) ¡represent ¡func.ons ¡of ¡arbitrary ¡complexity ¡ – Edges ¡represent ¡communica.on ¡channels ¡between ¡func.ons ¡ – Nodes ¡produce ¡and ¡consume ¡data ¡from ¡edges ¡ – Edges ¡buffer ¡data ¡( logically ) ¡in ¡a ¡FIFO ¡(first-‑in, ¡first-‑out) ¡fashion ¡ • Data-‑driven ¡ execu.on ¡model ¡ ¡ – An ¡actor ¡can ¡execute ¡whenever ¡it ¡has ¡sufficient ¡data ¡on ¡its ¡ input ¡edges. ¡ – The ¡ order ¡in ¡which ¡actors ¡execute ¡is ¡not ¡part ¡of ¡the ¡ specifica9on . ¡ – The ¡order ¡is ¡typically ¡determined ¡by ¡the ¡compiler, ¡the ¡ hardware, ¡or ¡both. ¡ • Itera.ve ¡execu.on ¡ – Body ¡of ¡loop ¡to ¡be ¡iterated ¡a ¡large ¡or ¡infinite ¡number ¡of ¡.mes ¡ ¡ 7 DSP-oriented Dataflow Graphs • Ver.ces ¡(actors) ¡represent ¡computa.onal ¡modules ¡ • Edges ¡represent ¡FIFO ¡buffers ¡ • Edges ¡may ¡have ¡delays, ¡implemented ¡as ¡ini.al ¡tokens ¡ • Tokens ¡are ¡produced ¡and ¡consumed ¡on ¡edges ¡ • Different ¡models ¡have ¡different ¡rules ¡for ¡produc.on ¡ (SDF ¡ à ¡fixed, ¡CSDF ¡ à ¡periodic, ¡BDF ¡ à ¡dynamic) ¡ p 1,i p 2,i c 1,i c 2,i X Y Z 5 e 2 e 1 8 4 ¡

11/4/14 ¡ Dataflow Production and Consumption Rates p 1,i p 2,i c 1,i c 2,i X Y Z 5 e 2 e 1 9 Dataflow Graph Scheduling • Assigning actors to processors, and ordering actor subsets that share common processors • Here, a “processor” means a hardware resource for actor execution on which assigned actors are time-multiplexed • Scheduling objectives include – Exploiting parallelism – Buffer management – Minimizing power/energy consumption 10 5 ¡

11/4/14 ¡ Background: Contemporary Architectures Vector Operations in General Purpose Graphics Processing Units(GPUs) Processors (GPPs) 11 Primary Contribution A novel workflow for scheduling SDF graphs while taking into account – Actor execution times. – Efficient vectorization. – Heterogeneous multiprocessors. Demonstration system – Applications described in a domain specific language. – Systematic integration of precompiled libraries. – Targeted to architectures consisting of GPPs and GPUs. 12 6 ¡

11/4/14 ¡ Previous Work • Automatic SIMDzation [Hormati, 2010] -Based on StreamIT compiler. • Hierarchical models for SDR [Lin, 2007] -Targeted towards special architectures. • Multi-processor scheduling [Stuijk, 2007]: - Formulation towards special objectives. • Vectorization [Ritz,1992]: - Single processor block processing optimization. 13 Outline • Introduction • Contribution: Novel Vectorization and Mapping Workflow. • Evaluation • Summary 14 7 ¡

11/4/14 ¡ DIF-GR-GPU Workflow GNU Radio Data Flow Graph Dataflow Scheduler • Start from a model-based Throughput, Latency application description. Constraints Application Graph Platform Description Multiprocessor • Use tools to optimize Scheduler scheduling, assignment. Actor Profiles Mapping and Ordering Schedule Library of Actor • Generate an accelerated GNU Radio Implementations system. Engine Final Implementation 15 Workflow Goals • Adequately make use of all sources of parallelism in order to utilize the underlying architecture. Sources of parallelism: 100 1 A B – Data Parallelism (prod and cons in SDF) B – Task Parallelism (implicit in DFG) A C – Pipeline Parallelism (Looped schedules) 16 8 ¡

11/4/14 ¡ SDF Scheduling Preliminaries • An SDF graph G = (V,E) has a valid (periodic) schedule if it is deadlock-free and is sample rate consistent (i.e., it has a periodic schedule that fires each actor at least once and produces no net change in the number of tokens on each edge). • For each actor v in a consistent SDF graph, there is a unique repetition count q(v), which gives the number of times that v must be executed in a minimal valid schedule. 3 2 A B Some Possible Schedules: (1) AABAB (2) AAA BB q(A) = 3 q(B) = 2 17 DIF-GR-GPU: Dataflow Scheduler Objective: – Optimize exploitation of data and pipeline parallelism à Higher throughput. Flat Schedule: Executes an SDF graph as a cascade of Data Parallelism distinct loops with no inter-actor nesting of loops. Vectorization of a schedule S: A unique positive integer B, Pipeline called the blocking factor of S, such that S invokes each Parallelism actor v exactly (B x q(v)) times. Original SDF Graph Corresponding BPDAG, B = 10 20 20 10 60 60 10 10 60 10 60 20 20 18 9 ¡

11/4/14 ¡ DIF-GR-GPU Workflow GNU Radio Data Flow Graph Dataflow Scheduler • Start from a model-based Throughput, Latency application description Constraints Application Graph Platform Description Multiprocessor • Use tools to optimize Scheduler scheduling, assignment. Actor Profiles Mapping and Ordering Schedule Library of Actor • Generate an accelerated GNU Radio Implementations system. Engine Final Implementation 19 Heterogeneous Multiprocessor Scheduler • Objective : Utilize available multiprocessors in the platform. Task Parallelism • Architecture Descriptions: The platform is described by a set P of processors and a set B of all to all communication buses. • Execution times depend on the blocking factor. • Every processor is assumed to have a shared memory. Communication Bus 0 GPU0 1 GPU1 N GPU N 20 10 ¡

11/4/14 ¡ Scheduler Inputs • Architecture description: set P of processors and a set B of communication buses. • Application description: The application model (input BPDAG) consists of a set T of tasks, and a set E of edges. • Task and edge profiles: These profiles are described by two functions: - RTP(t ∈ T, p ∈ P) → R defines the execution time of task t on processor p , - REB(e ∈ E, b ∈ B) → R defines the execution time of edge e on bus b . • Dependency analysis: Task t 1 is said to be dependent on task t 2 if there is a path that starts at t 1 and ends at t 2 . If no such path exists between t 1 and t 2 , then they are called parallel tasks. A similar concept can be applied to edges. 21 Multiprocessor Scheduler • The basic scheduler functionality is to – Map every task to a given processor. – Order the execution of parallel actors assigned to the same processor. – “Zero” out the communication cost of collocated dependent actors . • The scheduler objective is: Minimize the latency L B of B graph iterations. 22 11 ¡

11/4/14 ¡ MLP formulation • Why? – Offline analysis of SDF graphs. – Coarse grain nature of SDF graphs. – Solver gives a bound from optimal solution. • Basic Variables: – Mapping: XT[ t , p ] = 1 if task t is assigned to processor p ; XT[ t , p ] = 0 otherwise. – Ordering: For all parallel tasks t 1 , t 2 that are assigned to the same processor YT[ t 1 , t 2 ] = 1 if t 1 is ordered before t 2 ; YT[t 1 , t 2 ] = 1 otherwise. – Running time: RT[ t ] = actual (platform dependent) execution time of the task t depending on its mapping. – Start time: ST[ t ] is the start time for execution of task t . 23 MLP formulation (continued) • Constraints: – Assignment: – Dataflow dependency: – Zero cost communication: • Objective: Minimize M 24 12 ¡

Outline Introduction Contribution: Novel Vectorization and Mapping - PDF document

11/4/14 Vectorization and Mapping of Software Defined Radio Applications on GPU Platforms Shuvra S. Bhattacharyya Professor, ECE and UMIACS University of Maryland at College Park ssb@umd.edu , http://www.ece.umd.edu/~ssb With

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

TUHH Institute of Telematics TUHH Hamburg University of Technology Hamburg University of

V8 engine of Node.js on IA: JavaScript-JITTED x86 machine code mapping profiling support and X87

SWARM Extreme Jitendra Bothra Baturalp Torun jitendrabothra@gmail.com baturalp@gmail.com

Runtime Verification of P4 Switches with Reinforcement Learning Apoorv Shukla (TU Berlin) with

30 Transformational Design with Essential Aspect Decomposition: Model-Driven Architecture (MDA)

Cloudy/Clear Sky Relative Humidity in the Upper Troposphere Observed by AIRS, CloudSat, and

ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM Programming Meeting place: BA

TensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance Pre-release