Scheduling complex streaming applications on the Cell processor - PowerPoint PPT Presentation

Scheduling complex streaming applications on the Cell processor Mathias Jacquelin, joint work with Matthieu Gallet and Loris Marchal INRIA ROMA project-team LIP (ENS-Lyon, CNRS, INRIA) ´ Ecole Normale Sup´ erieure de Lyon, France Workshop on Multithreaded Architectures and Applications, Atlanta, April 23, 2010. 1/ 28

Outline Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works 2/ 28

Motivation ◮ Multicore architectures: new opportunity to test the scheduling strategies designed in the ROMA team. ◮ Our trademark: efficient scheduling on heterogeneous platforms ◮ Most multicore architecture are homogeneous, regular ◮ Need for tailored algorithms (linear algebra,. . . ) ◮ Emerging heterogeneous multicore: ◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator ◮ This study: steady-state scheduling on CELL (bounded heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques 3/ 28

Introduction: Steady-state Scheduling Rationale: ◮ A pipelined application: ◮ Simple chain ◮ More complex application (Directed Acyclic Graph) ◮ Objective: optimize the throughput of the application (number of input files treated per seconds) ◮ Today: simple case where each task has to be mapped on one single resource 4/ 28

Introduction: Steady-state Scheduling Rationale: ◮ A pipelined application: ◮ Simple chain ◮ More complex application (Directed Acyclic Graph) T 1 ◮ Objective: optimize the throughput T 2 of the application (number of input files treated per T 3 seconds) ◮ Today: simple case where each task has to be mapped on one single resource 4/ 28

Introduction: Steady-state Scheduling Rationale: ◮ A pipelined application: ◮ Simple chain ◮ More complex application T 1 (Directed Acyclic Graph) T 2 T 3 T 4 ◮ Objective: optimize the throughput of the application T 5 T 6 T 7 T 8 (number of input files treated per seconds) T 9 ◮ Today: simple case where each task has to be mapped on one single resource 4/ 28

Introduction: Steady-state Scheduling Rationale: ◮ A pipelined application: P 1 ◮ Simple chain ◮ More complex application T 1 P 3 P 2 (Directed Acyclic Graph) T 2 T 3 T 4 ◮ Objective: optimize the throughput of the application T 5 T 6 T 7 T 8 (number of input files treated per seconds) P 4 T 9 ◮ Today: simple case where each task has to be mapped on one single resource 4/ 28

CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture 5/ 28

CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture SPE 0 SPE 1 SPE 7 SPE 6 MEMORY PPE 0 EIB SPE 5 SPE 4 SPE 2 SPE 3 5/ 28

CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture SPE 0 SPE 1 SPE 7 SPE 6 MEMORY PPE 0 EIB SPE 5 SPE 4 SPE 2 SPE 3 ◮ 1 PPE core ◮ VMX unit ◮ L1, L2 cache ◮ 2 way SMT 5/ 28

CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture SPE 0 SPE 1 SPE 7 SPE 6 MEMORY PPE 0 EIB SPE 5 SPE 4 SPE 2 SPE 3 ◮ 8 SPEs ◮ 128-bit SIMD instruction set ◮ Local store 256KB ◮ Dedicated Asynchronous DMA engine 5/ 28

CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture SPE 0 SPE 1 SPE 7 SPE 6 MEMORY PPE 0 EIB SPE 5 SPE 4 SPE 2 SPE 3 5/ 28

CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture SPE 0 SPE 1 SPE 7 SPE 6 MEMORY PPE 0 EIB SPE 5 SPE 4 SPE 2 SPE 3 ◮ Element Interconnect Bus (EIB) ◮ 200 GB/s bandwidth 5/ 28

CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture SPE 0 SPE 1 SPE 7 SPE 6 MEMORY PPE 0 EIB SPE 5 SPE 4 SPE 2 SPE 3 ◮ 25 GB/s bandwidth 5/ 28

Outline Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works 6/ 28

Platform modeling Simple CELL modeling: ◮ 1 PPE and 8 SPE: 9 processing elements P 1 , . . . , P 9 , with unrelated speed, ◮ Each processing element access the communication bus with a (bidirectional) bandwidth b = (25 GB / s ) , ◮ The bus is able to route all concurrent communications without contention (in a first step), ◮ Due to the limited size of the DMA stack on each SPE: ◮ Each SPE can perform at most 16 simultaneous DMA operations, ◮ The PPE can perform at most 8 simultaneous DMA operations to/from a given SPE. ◮ Linear cost communication model: a data of size S is sent/received in time S / b 7/ 28

Application modeling Application is described by a directed acyclic graph: T 1 ◮ Tasks T 1 , . . . , T n T 2 T 3 T 4 ◮ Processing time of task T k on P i is t i ( k ), T 5 T 6 T 7 T 8 ◮ If there is a dependency T k → T l , data k , l is the size of the file T 9 produced by T k and needed by T l , ◮ If T k is an input task, it reads read k bytes from main memory, ◮ If T k is an output task, it writes write k bytes to main memory, 8/ 28

Target application: any DAG ◮ Today, we will focus on three random task graphs: 9/ 28

Scheduling complex streaming applications on the Cell processor - PowerPoint PPT Presentation

Scheduling complex streaming applications on the Cell processor Mathias Jacquelin, joint work with Matthieu Gallet and Loris Marchal INRIA ROMA project-team LIP (ENS-Lyon, CNRS, INRIA) Ecole Normale Sup erieure de Lyon, France Workshop

Bacteria Without a Cell Wall L-forms Pros & Cons of Cell Wall Cell membrane Cell wall DNA

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Cell Communication and Cell Signaling Why is cell signaling important? Why is cell signaling

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Does God play dice with the cell? Does God play dice with the cell? Does God play dice with the

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

How To Take Your Nonprofit Stories From Boring2Brilliant Lori L. Jacobwith, President/Founder

The 5 th Competition on Syntax-Guided Synthesis Rajeev Alur, Dana Fisman, Rishabh Singh and

Approaching Mean-Variance Efficiency for Large Portfolios Yingying Li Department of ISOM &

Order-Optimal Permutation Codes in the Generalized Cayley Metric Siyi Yang , Clayton Schoeny, Lara

Brendan Gregg Sr. Performance Architect, Netflix Take Aways Identify bottlenecks: 1. In the

Planning Your Next Career Move: Developing the Skills to Make it Happen Elizabeth Atcheson Blue

rt t rts

Stochastic Forward-Backward Splitting Silvia Villa joint work with Lorenzo Rosasco and Bang Cong

Scheduling complex streaming applications on the Cell processor - PowerPoint PPT Presentation

Scheduling complex streaming applications on the Cell processor Mathias Jacquelin, joint work with Matthieu Gallet and Loris Marchal INRIA ROMA project-team LIP (ENS-Lyon, CNRS, INRIA) Ecole Normale Sup erieure de Lyon, France Workshop

Bacteria Without a Cell Wall L-forms Pros &amp; Cons of Cell Wall Cell membrane Cell wall DNA

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Cell Communication and Cell Signaling Why is cell signaling important? Why is cell signaling

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Does God play dice with the cell? Does God play dice with the cell? Does God play dice with the

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

How To Take Your Nonprofit Stories From Boring2Brilliant Lori L. Jacobwith, President/Founder

The 5 th Competition on Syntax-Guided Synthesis Rajeev Alur, Dana Fisman, Rishabh Singh and

Approaching Mean-Variance Efficiency for Large Portfolios Yingying Li Department of ISOM &amp;

Order-Optimal Permutation Codes in the Generalized Cayley Metric Siyi Yang , Clayton Schoeny, Lara

Brendan Gregg Sr. Performance Architect, Netflix Take Aways Identify bottlenecks: 1. In the

Planning Your Next Career Move: Developing the Skills to Make it Happen Elizabeth Atcheson Blue

rt t rts

Stochastic Forward-Backward Splitting Silvia Villa joint work with Lorenzo Rosasco and Bang Cong

Bacteria Without a Cell Wall L-forms Pros & Cons of Cell Wall Cell membrane Cell wall DNA

Approaching Mean-Variance Efficiency for Large Portfolios Yingying Li Department of ISOM &