A Component-Based Framework for the Cell Broadband Engine Timothy - PowerPoint PPT Presentation

A Component-Based Framework for the Cell Broadband Engine Timothy D. R. Hartley, Umit V. Catalyurek Department of Biomedical Informatics Department of Electrical and Computer Engineering The Ohio State University, Columbus, OH, USA hartleyt@ece.osu.edu, umit@bmi.osu.edu Department of Biomedical Informatics 1

Outline • Motivation • Contributions • CBE Intercore Messaging Library • DataCutter-Lite • Experimental Results • Conclusions and Future Work Department of Biomedical Informatics 2

Motivation • Programming the Cell requires expertise • Parallel programming • Data decomposition • Parallel algorithms • Streaming programming • Small scratch-pad memories • Double buffering • Cell peculiarities • DMA commands • SPE optimizations – not addressed here • Component-based streaming frameworks are natural fits for heterogeneous, parallel processors Department of Biomedical Informatics 3

Cell Broadband Engine • Cell Broadband Engine • Designed jointly by Sony, T oshiba, and IBM • 9-core heterogeneous microprocessor • Integrated high-bandwidth ring bus for on-chip communication • Quick Specs • >200 GFLOP/s floating-point arithmetic • >200 GB/s internal bus bandwidth Department of Biomedical Informatics 4

Cell Programming Peculiarities • DMA commands • Simple in concept, difficult in practice • Fence, barrier, lists, alignments, etc. • SPE code optimizations* • 25 GFLOP/s only reached on SPE when using SIMD FMA commands • Static dual-issue scheduling • Branch hints Department of Biomedical Informatics 5

Contributions • Cell Broadband Engine Intercore Messaging Library (CIML) • T wo-sided communication library • DataCutter-Lite for Cell Broadband Engine • Filter-stream programming framework and runtime engine • Uses CIML for intercore communication Department of Biomedical Informatics 6

CBE Intercore Messaging Library (CIML) • T wo-sided communication library • Allows two-sided communication between all processors in Cell (PPU and SPU) • Different from LANL's Cell Messaging Layer • CML uses a receiver-initiated protocol • Not suitable for streaming frameworks • Sender unknown • Good performance for larger message sizes Department of Biomedical Informatics 7

CIML Performance Department of Biomedical Informatics 8

CIML Performance Department of Biomedical Informatics 9

Component-based Programming Frameworks • Application is decomposed into a natural task- graph • T ask graph performs computation • Individual tasks perform single function • T asks are independent, with well-defined interfaces • Higher-level programming abstraction Department of Biomedical Informatics 10

DataCutter • DataCutter • Coarse-grained filter-stream framework • OSU/Maryland-bred component-based framework • Third-generation runtime uses MPI for high-bandwidth network support Department of Biomedical Informatics 11

DataCutter-Lite (DCL) • Component-based, filter-stream programming framework • Define computation as task-graph • T asks are filters , which are functions which compute • Data flows along streams to/from filters along pre- defined paths • Automatic multi-buffering of buffers • Automatic PPE-SPE, inter-SPE communication • DCL is event-based • Arrival of stream buffer (a quantum of data in the application) triggers filter execution Department of Biomedical Informatics 12

Sample DCL Application Layout Department of Biomedical Informatics 13

Experimental Results • Use three applications • Variety of Communication-to-Computation Ratios (CCR) • Matrix addition • High CCR • Compare with IBM's Accelerated Library Framework (ALF) example • Image color-space transformation • Low CCR • Compare with custom-coded IBM SDK-based baseline • Biomedical image analysis application • Medium CCR • Three-stage pipeline Department of Biomedical Informatics 14

Matrix Addition Performance • Compare against IBM ALF example • 1024 x 512 matrix • DCL has 8–91% longer execution time Department of Biomedical Informatics 15

Color-Space Transform Performance • Compare against custom IBM SDK version • 32 1Kx1K image tiles • DCL has 2-4% longer execution time Department of Biomedical Informatics 16

Biomedical Application Filter Layout Department of Biomedical Informatics 17

Biomedical Application Performance • Compared against custom IBM SDK version • 32 1Kx1K image tiles • Overheads included: DCL takes 23-57% longer • Overheads excluded: SDK takes 5-26% longer Department of Biomedical Informatics 18

Mixed Granularity DataCutter Example • DataCutter for coarse-grained parallelism • DCL for fine-grained parallelism Department of Biomedical Informatics 19

Biomedical Application Performance (2) • 1024 1Kx1K image tiles • DC+DCL has up 42% shorter execution time Department of Biomedical Informatics 20

Conclusions Future Work • Contributions • T wo-sided communication library (CIML) • Filter-stream programming framework and runtime engine (DataCutter-Lite) • Conclusions • CIML and DCL give good performance with easier programming than raw IBM SDK • Future work • Extend fine-grained filter-stream framework to CMP, GPU • Automate trial-and-error fine-tuning • Simplify placement/sizing of filter instances with performance modeling Department of Biomedical Informatics 21

Related Work • MPI-like • MPI u-tasks – IBM Research • Cell Messaging Layer (CML) - LANL • Block-based • BlockLib • Sequoia - Stanford • Charm++ - UIUC • Accelerated Library Framework (ALF) – IBM SDK • Source compilers • CellSs - BSC • Streaming frameworks • StreamIt – MIT Department of Biomedical Informatics 22

DCL Code Examples // Omitted: Set up Matrices A, B, pointers, a_ptr, • PPE Code // b_ptr, constants int main(int argc, char ** argv) { • main() init_dcl(); • setup_application() for (i = 0; i < NUM_ROWS; i++) { • filter function DCLBuffer * buffer = • SPE Code create_buffer("raw_data", BUF_SIZE); • filter function append_array(buffer, a_ptr, NUM_COLS * sizeof(float)); append_array(buffer, b_ptr, NUM_COLS * sizeof(float)); stream_write(buffer); // Omitted: increment pointers a_ptr, b_ptr } finish_dcl(); return 0; } Department of Biomedical Informatics 23

DCL Code Examples // PPE setup and filter code • PPE Code // Called by init_dcl() void setup_application(Placement * p) { • main() Filter * console = get_console(p); • setup_application() Filter * fadded = place_ppu_filter(p, "added_data"); • filter function Filter * fadder = place_filter(p, 0, "add_values"); • SPE Code Stream * sraw = add_stream(p, "raw_data"); • filter function add_source(p, sraw, console); add_sink(p, sraw, fadder); Stream * sadded = add_stream(p, "added_matrix"); add_source(p, sadded, fadder); add_sink(p, sadded, fadded); } Department of Biomedical Informatics 24

DCL Code Examples // When receving a buffer from SPE • PPE Code void added_data(DCLBuffer * buffer) { // Omitted: Deal with added matrix data • main() } • setup_application() EVENT_PROVIDE1(added_data); • filter function • SPE Code • filter function Department of Biomedical Informatics 25

DCL Code Examples // Omitted: Set up constants • PPE Code • PPE Code void add_values(DCLBuffer * buffer) { DCLBuffer * out_buffer = create_buffer( • main() • main() "added_matrix", BUF_SIZE); • setup_application() • setup_application() float * a = get_float_data_pointer(buffer); • filter function • filter function increment_extract_pointer(buffer, • SPE Code • SPE Code num_values * sizeof(float)); float * b = get_float_data_pointer(buffer); • filter function • filter function float * c = get_float_data_pointer(out_buffer); for (i = 0; i < NUM_COLS; i++) c[i] = a[i] + b[i]; stream_write(out_buffer); } EVENT_PROVIDE1(add_values); Department of Biomedical Informatics 26

A Component-Based Framework for the Cell Broadband Engine Timothy - PowerPoint PPT Presentation

A Component-Based Framework for the Cell Broadband Engine Timothy D. R. Hartley, Umit V. Catalyurek Department of Biomedical Informatics Department of Electrical and Computer Engineering The Ohio State University, Columbus, OH, USA

Communication Analysis of the Communication Analysis of the Communication Analysis of the Cell

Hardware Architecture of the Cell Broadband Engine Processor LOGO Presented by Wei Wei,

Data visualization on Cell Broadband Engine VUT FEL v Praze 36SPA 21. kvtna 2007 Martin

Bacteria Without a Cell Wall L-forms Pros & Cons of Cell Wall Cell membrane Cell wall DNA

Broadband Mobile Communications Broadband Mobile Communications Broadband Mobile Communications

Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet Transform on the Cell Broadband

Cell Communication and Cell Signaling Why is cell signaling important? Why is cell signaling

Intro This talk will focus on Cell processor Cell Broadband Engine Architecture (CBEA)

Broadband 101 Broadband Technologies Overview & Whats happening in South Central Minnesota

BROADBAND DEVELOPMENT: access and adoption Douglas County Broadband Forum Wednesday, January 18,

Broadband Facts, Fiction, and Broadband Facts, Fiction, and Urban Myths Urban Myths Rod Tucker

Emergency Broadband Investment July 2, 2020 COVID-19 Missouris Response: Emergency Broadband

Open Broadband, LLC Providing Broadband to Underserved Communities http://openbb.net

Does God play dice with the cell? Does God play dice with the cell? Does God play dice with the

The New Growth Engine Saran Phaloprakarn Senior Vice President Fixed Broadband Business

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

DetNet Bounded Latency-04 drafu-fjnn-detnet-bounded-latency-04 Norman Finn, Jean-Yves Le Boudec,

1 L Jan-25-04 SMD159, Input and Interaction Overview Basic input devices - Physical

1 2 3 Partial Permutations top-level actions can be permuted using rules of for example: T

Programmation des syst emes Working with Graphics March 21, 2016 Basics No OS or any

Database Systems II Secondary Storage CMPT 454, Simon Fraser University, Fall 2009, Martin Ester

CuPP A framework for easy CUDA integration Jens Breitbart 1 1 University of Kassel Research

Fall 2003 6.893 UI Design

Robustness issues in timed models Nicolas Markey LSV, CNRS & ENS Cachan, France (based on

A Component-Based Framework for the Cell Broadband Engine Timothy - PowerPoint PPT Presentation

A Component-Based Framework for the Cell Broadband Engine Timothy D. R. Hartley, Umit V. Catalyurek Department of Biomedical Informatics Department of Electrical and Computer Engineering The Ohio State University, Columbus, OH, USA

Communication Analysis of the Communication Analysis of the Communication Analysis of the Cell

Hardware Architecture of the Cell Broadband Engine Processor LOGO Presented by Wei Wei,

Data visualization on Cell Broadband Engine VUT FEL v Praze 36SPA 21. kvtna 2007 Martin

Bacteria Without a Cell Wall L-forms Pros &amp; Cons of Cell Wall Cell membrane Cell wall DNA

Broadband Mobile Communications Broadband Mobile Communications Broadband Mobile Communications

Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet Transform on the Cell Broadband

Cell Communication and Cell Signaling Why is cell signaling important? Why is cell signaling

Intro This talk will focus on Cell processor Cell Broadband Engine Architecture (CBEA)

Broadband 101 Broadband Technologies Overview &amp; Whats happening in South Central Minnesota

BROADBAND DEVELOPMENT: access and adoption Douglas County Broadband Forum Wednesday, January 18,

Broadband Facts, Fiction, and Broadband Facts, Fiction, and Urban Myths Urban Myths Rod Tucker

Emergency Broadband Investment July 2, 2020 COVID-19 Missouris Response: Emergency Broadband

Open Broadband, LLC Providing Broadband to Underserved Communities http://openbb.net

Does God play dice with the cell? Does God play dice with the cell? Does God play dice with the

The New Growth Engine Saran Phaloprakarn Senior Vice President Fixed Broadband Business

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

DetNet Bounded Latency-04 drafu-fjnn-detnet-bounded-latency-04 Norman Finn, Jean-Yves Le Boudec,

1 L Jan-25-04 SMD159, Input and Interaction Overview Basic input devices - Physical

1 2 3 Partial Permutations top-level actions can be permuted using rules of for example: T

Programmation des syst emes Working with Graphics March 21, 2016 Basics No OS or any

Database Systems II Secondary Storage CMPT 454, Simon Fraser University, Fall 2009, Martin Ester

CuPP A framework for easy CUDA integration Jens Breitbart 1 1 University of Kassel Research

Fall 2003 6.893 UI Design

Robustness issues in timed models Nicolas Markey LSV, CNRS &amp; ENS Cachan, France (based on

Bacteria Without a Cell Wall L-forms Pros & Cons of Cell Wall Cell membrane Cell wall DNA

Broadband 101 Broadband Technologies Overview & Whats happening in South Central Minnesota

Robustness issues in timed models Nicolas Markey LSV, CNRS & ENS Cachan, France (based on