JIT renaming and lazy write- back on the Cell/B.E. Pieter Bellens, - PowerPoint PPT Presentation

JIT renaming and lazy write- back on the Cell/B.E. Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS) pieter.bellens@bsc.es

Overview • Cell Broadband Engine (Cell/B.E.) • Cell Superscalar (CellSs) • Bypassing • Motivation • Implementation • Results • Lazy write-back • Just-In-Time renaming • Current status and ongoing work

Cell Broadband Engine asynchronous DMA transfers 256 Kb 2 hardware threads

CellSs Runtime environment that automatically parallelizes sequential user applications for the Cell/B.E. user application CellSs compiler Parallel Cell/B.E. application CellSs PPE runtime CellSs SPE runtime SPE SPE SPE SPE PPE SPE SPE SPE SPE

CellSs: sample code (sparse LU) int main(int argc, char **argv) { B int ii, jj, kk; NB … B for (kk=0; kk<NB; kk++) { NB lu0(A[kk][kk]); B for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) B fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } } void lu0(float *diag); } } void bdiv(float *diag, float *row); void bmod(float *row, float *col, float *inner); void fwd(float *diag, float *col);

CellSs: sample code (sparse LU) int main(int argc, char **argv) { B int ii, jj, kk; NB … B for (kk=0; kk<NB; kk++) { NB lu0(A[kk][kk]); B for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) B fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } #pragma css task inout(diag[B][B]) } void lu0(float *diag); } #pragma css task input(diag[B][B]) inout(row[B][B]) } void bdiv(float *diag, float *row); #pragma css task input(row[B][B],col[B][B]) inout(inner[B][B]) void bmod(float *row, float *col, float *inner); #pragma css task input(diag[B][B]) inout(col[B][B]) void fwd(float *diag, float *col);

CellSs: compiler CellSs SPE library llib_css-spe.so annotated user application app.c SDK app_spe.c SPE app_spe.o SPE Compiler SPE Linker executable SPE Linker CellSs compiler SPE Embedder app_ppe.o PPE Compiler app_ppe.c PPE PPE Linker Object CellSs SPE library parallel Cell/B.E. application llib_css-ppe.so Cell executable

CellSs: runtime libraries 1) task creation PPE 2) dependence analysis and data renaming CellSs main thread CellSs helper thread 3) update TDG 4) scheduling 3 5) synchronisation with SPEs user 1 6) stage in main 7) execute program 8) stage out and synchronisation 4 2 5 memory SPE renaming table 6 TDG 7 original user data task 8 code

CellSs: runtime behaviour (matrix multiply) • Visualization of the runtime phases in function of time using Paraver • Each phase is assigned a different colour • SPE task execution • SPE DMA wait • Thread idling

CellSs: runtime behaviour (matrix multiply)

Bypassing: motivation A new architecture, but the song remains the same: Improve the performance Let's take a closer look at code executing on the Cell/B.E.: • General computation pattern • PPE generates work for SPEs • SPEs repeatedly fetch work and perform computation • T raditional approach vs. bypassing approach • Cell/B.E. Interconnect • Element Interconnect Bus (EIB)

Bypassing: motivation: general computation pattern traditional: memory memory access! access! stage in execute stage out SPE1 SPE2 3 stage in 2 stage out 1 stage in 4 stage out main memory

Bypassing: motivation: Cell/B.E. interconnect Element Interconnect Bus (EIB): “ Another class of bottlenecks is contention. For instance, if four SPEs are trying to move data to or from the MIC at the same time, their aggregate bandwidth of 102.4GB/sec completely swamps the MIC's bandwidth of 25.6GB/sec. Similarly, while the SPEs are trying to interact with the MIC, the PPE may have degraded access to main memory. When a unit is overwhelmed, it might need to retry commands, which in turn slows traffic down even further. ” David Krolak, “Unleashing the Cell Broadband Engine Processor: the Element Interconnect Bus”

Bypassing: motivation How does contention and blocking influence the execution? • Countermeasures: • software cache in the LS of an SPE • double buffering • ???

Bypassing: motivation Transfer objects between the LS of SPEs without going through main memory • General idea: • Effect on PPE threads?

Bypassing: motivation bypassing: memory free up access LS space stage in execute stage out or bypass 2 bypass SPE1 SPE2 1 stage in (3 stage out) main memory

Bypassing: implementation • General solution • SPE runtime autonomously decides to go to main memory or to bypass from another SPE • No need to tailor the bypassing mechanism to a specific application • Implemented using the SPE's Atomic Cache Unit (ACU) • Location of software objects in the system is updated using the ACU • Distributed solution • Makes good use of hardware features

Bypassing: results: opportunities for bypassing Are there opportunities to bypass data from one SPE to another?

Bypassing: results: reduction in wait time Does the wait time effectively decrease when bypassing?

Lazy write-back: concept • Do not tranfer objects back to main memory unless strictly necessary • Exploit the information available in the bypassing mechanism • object versions #pragma css task inout(a) • read count of a version void foo(int a[4096]); int a[4096]; • T oken passing to avoid early stage-outs int main(int argc, char *argv[]) SPE1 SPE2 { 2 bypass ... buffer buffer foo(a); ... (3 stage out) foo(a); ... main memory return 0; } 1 stage in (4 stage out) a[4096]

Lazy write-back: example • Below is the perfect scenario 1. T ask 1 reads and writes a → Obj(a,1) • Variations possible depending on 2. T ask 2 reads a (Obj(a,1)) relative ordering of execution of tasks 3. T ask 3 reads a (Obj(a,1)) and schedule 4. T ask 4 reads and writes a → Obj(a,2) 5. T ask 5 reads a (Obj(a,2)) SPE1 SPE2 SPE3 5 2 4 1 6 bypass 3 2 bypass 4 bypass buffer buffer buffer 8 stage out (3 stage out) main memory (5 stage out) (7 stage out) 1 stage in a[4096]

Lazy write-back: results Can we avoid a significant fraction of the tranfers to main memory?

Renaming: traditional concept #pragma css task inout(a) original object, “user space” main memory void foo(int a[4096]); #pragma css task out(a) A[4096] void moo(int a[4096]); int a[4096]; renaming, “CellSs space” int main(int argc, char *argv[]) A_ren[4096] { ... foo(a) ... moo(a) ... • Renaming improves parallelism at the cost of extra return 0; memory. • Centralized }

Renaming: traditional concept SPE1 SPE2 foo moo buffer B buffer A main memory A[4096] Explicit renaming in LS Explicit renaming A_ren[4096] in main memory

Renaming: JIT renaming SPE1 SPE2 foo moo buffer B buffer A bypass main memory implicit renaming in LS original object A[4096] in main memory

Renaming: JIT renaming main memory SPE foo buffer A A[4096] bypass moo buffer B • JIT renaming sometimes requires an SPE to bypass from itself. original object in main memory

Renaming: JIT renaming • Decision between stage-out or renaming made at the SPE very last moment • No synchronisation with PPE unless renaming pool too small • Relation between scheduling and renaming renaming stage out main memory renaming pool original user data

Ongoing work • verification of the bypassing protocol • studying ways to incorporate scheduling • distributed scheduling • shared representation of the T ask Dependence Graph (TDG)

Questions?

task dependence graph (TDG)

Speedup results • Very much work in progress • Linear algebra applications on 16x16 hypermatrices of 64x64 floats • Matrix multiplication, 2 variants of the Cholesky decomposition, a Jacobi computation and an LU decomposition.

JIT renaming and lazy write- back on the Cell/B.E. Pieter Bellens, - PowerPoint PPT Presentation

JIT renaming and lazy write- back on the Cell/B.E. Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS) pieter.bellens@bsc.es Overview Cell Broadband Engine (Cell/B.E.) Cell

Just-In-Time (JIT) Motivation JIT Philosophy JIT Procedure Toyota Kanban Systems

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

Superinstructions and Replication in the Cacao JVM interpreter M. Anton Ertl Christian Thalinger

ORC LLVMs Next Generation of JIT API Contents LLVM JIT APIs Past, Present and Future I

JVM Optimization 101 Sebastian Zarnekow itemis Static vs Dynamic Compilation AOT vs JIT JIT

Can We Represent Infinite Lists? Lazy Evaluation Amtoft Motivation Lazy Lists Conversions

Imagine for a moment @trentmwillis Lazy Loading Engines: Anything But Lazy Engines allow

Bacteria Without a Cell Wall L-forms Pros & Cons of Cell Wall Cell membrane Cell wall DNA

Back To The Future Going Back In Time To Abuse Androids JIT 1 $ whoami Benjamin

Integration of Health and Social Care Simon Carr, Housing Team,JIT JIT is a strategic

Cell Communication and Cell Signaling Why is cell signaling important? Why is cell signaling

LLV8: LLV8: Adding Adding LLVM LLVM as as an an extra extra JIT tier to V8 JavaScript engine

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Does God play dice with the cell? Does God play dice with the cell? Does God play dice with the

Washington-Lee High School Renaming Committee Recommendation December 20, 2018 Arlington School

Metz Recreation Center Renaming Larry Williams Community Services Coordinator Agenda

Incremental Modeling of System Architecture Satisfying SysML Functional Requirements O. Carrillo,

GPDs from charged current meson production in ep experiments Marat Siddikov In collaboration

German EoI for Power Converters of SFRS H.Welker/EET 1 1 SFRS Power Converters SFRS

A Constructor-Based Reachability Logic for Rewrite Theories Stephen Skeirik, Andrei Stefanescu,

Estimating Environmental Exposure using Cell Tower Data Owais Gilani, Bucknell University

BayesOpt: Extensions and applications Javier Gonz alez Masterclass, 7-February, 2107

Opioid Use & Pregnancy the opportunity to learn from. Soraya Azari, MD Associate Professor

Multi-Quark Hadrons in the Quark Model Makoto Oka Advanced Science Research Center, JAEA

JIT renaming and lazy write- back on the Cell/B.E. Pieter Bellens, - PowerPoint PPT Presentation

JIT renaming and lazy write- back on the Cell/B.E. Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS) pieter.bellens@bsc.es Overview Cell Broadband Engine (Cell/B.E.) Cell

Just-In-Time (JIT) Motivation JIT Philosophy JIT Procedure Toyota Kanban Systems

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

Superinstructions and Replication in the Cacao JVM interpreter M. Anton Ertl Christian Thalinger

ORC LLVMs Next Generation of JIT API Contents LLVM JIT APIs Past, Present and Future I

JVM Optimization 101 Sebastian Zarnekow itemis Static vs Dynamic Compilation AOT vs JIT JIT

Can We Represent Infinite Lists? Lazy Evaluation Amtoft Motivation Lazy Lists Conversions

Imagine for a moment @trentmwillis Lazy Loading Engines: Anything But Lazy Engines allow

Bacteria Without a Cell Wall L-forms Pros &amp; Cons of Cell Wall Cell membrane Cell wall DNA

Back To The Future Going Back In Time To Abuse Androids JIT 1 $ whoami Benjamin

Integration of Health and Social Care Simon Carr, Housing Team,JIT JIT is a strategic

Cell Communication and Cell Signaling Why is cell signaling important? Why is cell signaling

LLV8: LLV8: Adding Adding LLVM LLVM as as an an extra extra JIT tier to V8 JavaScript engine

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Does God play dice with the cell? Does God play dice with the cell? Does God play dice with the

Washington-Lee High School Renaming Committee Recommendation December 20, 2018 Arlington School

Metz Recreation Center Renaming Larry Williams Community Services Coordinator Agenda

Incremental Modeling of System Architecture Satisfying SysML Functional Requirements O. Carrillo,

GPDs from charged current meson production in ep experiments Marat Siddikov In collaboration

German EoI for Power Converters of SFRS H.Welker/EET 1 1 SFRS Power Converters SFRS

A Constructor-Based Reachability Logic for Rewrite Theories Stephen Skeirik, Andrei Stefanescu,

Estimating Environmental Exposure using Cell Tower Data Owais Gilani, Bucknell University

BayesOpt: Extensions and applications Javier Gonz alez Masterclass, 7-February, 2107

Opioid Use &amp; Pregnancy the opportunity to learn from. Soraya Azari, MD Associate Professor

Multi-Quark Hadrons in the Quark Model Makoto Oka Advanced Science Research Center, JAEA

Bacteria Without a Cell Wall L-forms Pros & Cons of Cell Wall Cell membrane Cell wall DNA

Opioid Use & Pregnancy the opportunity to learn from. Soraya Azari, MD Associate Professor