The OmpSs Programming Model Jesus Labarta Director Computer - PowerPoint PPT Presentation

The OmpSs Programming Model Jesus Labarta Director Computer Sciences Research Dept. BSC

Challenges on the way to Exascale • Efficiency ( …, power, … ) • Variability • Memory • Faults • Scale (…,concurrency, strong scaling,…) J. Labarta, et all, “BSC Vision towards Exascale ” IJHPCA vol 23, n. 4 Nov 2009 • Complexity (…Hierarchy /Heterogeneity,…) Jesus Labarta. OmpSs @ EPoPPEA, January 2012 2

Supercomputer Development Application Is any of them more Algorithm important than the others? Progr. Model Run time Which? Architecture The sword to cut the “multicore” Gordian Knot Jesus Labarta. OmpSs @ EPoPPEA, January 2012 3

StarSs: a pragmatic approach • Rationale • Runtime managed, asynchronous data-flow execution models are key • Need to provide a natural migration towards dataflow • Need to tolerate “acceptable” relaxation of pure models • Focus on algorithmic structure and not so much on resources • StarSs: a family of task based programming models • Basic concept: write sequential on a flat single address space + directionality annotations • Order IS defined !!! • Dependence and data access related information (NOT specification) in a single mechanism • Think global, specify local • Power to the runtime !!! Jesus Labarta. OmpSs @ EPoPPEA, January 2012 4

StarSs: data-flow execution of sequential programs Decouple Write how we write form how it is executed Execute TS NB TS NB void Cholesky( float *A ) { TS int i, j, k; for (k=0; k<NT; k++) { TS spotrf (A[k*NT+k]) ; for (i=k+1; i<NT; i++) strsm (A[k*NT+k], A[k*NT+i]); // update trailing submatrix for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) sgemm( A[k*NT+i], A[k*NT+j], A[j*NT+i]); ssyrk (A[k*NT+i], A[i*NT+i]); } } #pragma omp task inout ([TS][TS]A) void spotrf (float *A); #pragma omp task input ([TS][TS]T) inout ([TS][TS]B) void strsm (float *T, float *B); #pragma omp task input ([TS][TS]A,[TS][TS]B) inout ([TS][TS]C ) void sgemm (float *A, float *B, float *C); #pragma omp task input ([TS][TS]A) inout ([TS][TS]C) void ssyrk (float *A, float *C); Jesus Labarta. OmpSs @ EPoPPEA, January 2012 5

StarSs vs OpenMP void Cholesky( float *A ) { int i, j, k; for (k=0; k<NT; k++) { spotrf (A[k*NT+k]); #pragma omp parallel for for (i=k+1; i<NT; i++) strsm (A[k*NT+k], A[k*NT+i]); for (i=k+1; i<NT; i++) { #pragma omp parallel for void Cholesky( float *A ) { for (j=k+1; j<i; j++) int i, j, k; sgemm( A[k*NT+i], A[k*NT+j], A[j*NT+i]); for (k=0; k<NT; k++) { ssyrk (A[k*NT+i], A[i*NT+i]); spotrf (A[k*NT+k]); } #pragma omp parallel for } for (i=k+1; i<NT; i++) strsm (A[k*NT+k], A[k*NT+i]); for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task sgemm( A[k*NT+i], A[k*NT+j], A[j*NT+i]); void Cholesky( float *A ) { int i, j, k; } for (k=0; k<NT; k++) { spotrf (A[k*NT+k]); #pragma omp task #pragma omp parallel for for (i=k+1; i<NT; i++) ssyrk (A[k*NT+i], A[i*NT+i]); strsm (A[k*NT+k], A[k*NT+i]); // update trailing submatrix #pragma omp taskwait for (i=k+1; i<NT; i++) { #pragma omp task { } #pragma omp parallel for for (j=k+1; j<i; j++) } sgemm( A[k*NT+i], A[k*NT+j], A[j*NT+i]); } } #pragma omp task ssyrk (A[k*NT+i], A[i*NT+i]); } #pragma omp taskwait } } Jesus Labarta. OmpSs @ EPoPPEA, January 2012 6

StarSs: the potential of data access information • Flat global address space seen by programmer • Flexibility to dynamically traverse dataflow graph “optimizing” • Concurrency. Critical path • Memory access: data transfers performed by run time • Opportunities for runtime to • Prefetch • Reuse • Eliminate antidependences (rename) • Replication management • Coherency/consistency handled by the runtime Jesus Labarta. OmpSs @ EPoPPEA, January 2012 7

Hybrid MPI/StarSs • Overlap communication/computation • Extend asynchronous data-flow P 0 P 1 P 2 execution to outer level • Linpack example: Automatic lookahead … for (k=0; k<N; k++) { if (mine) { Factor_panel(A[k]); send (A[k]) } else { receive (A[k]); if (necessary) resend (A[k]); } for (j=k+1; j<N; j++) update (A[k], A[j]); #pragma css task input(A[SIZE]) … void send(float *A); #pragma css task inout(A[SIZE]) #pragma css task output(A[SIZE]) void Factor_panel(float *A); void receive(float *A); #pragma css task input(A[SIZE]) inout(B[SIZE]) #pragma css task input(A[SIZE]) void update(float *A, float *B); void resend(float *A); V. Marjanovic, et al, “Overlapping Communication and Computation by using a Hybrid MPI/SMPSs Approach” ICS 2010 Jesus Labarta. OmpSs @ EPoPPEA, January 2012 8

All that easy/wonderful? • Difficulties for adoption Early adopters and porting Research support: • Chicken and egg issue users ↔ manufacturers • Consolider (Spain) • • ENCORE, TEXT, Montblanc, Availability. DEEP (EC) • Runtime implementations chasing new platforms • Standardization: Development as we go • OpenMP , … • • Fairly stable, minimal application update cost. Maturity • Happens to all models, by all developers ( companies, research,…) New tools • Lack of program development support • Taskification • Performance prediction • • Understand application dependences Debugging • Understand potential and best direction New Platforms • Difficulties of the models themselves • ARM + GPUs • • MIC Simple concepts take time to be matured • … • As clean/elegant as we claim? • Examples Legacy sequential code less structured than ideal Training Education Jesus Labarta. OmpSs @ EPoPPEA, January 2012 9

The TEXT project • Towards EXaflop applicaTions (EC FP7 Grant 261580) • Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges in a an productive and efficient way. • Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC • Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn) and develop algorithms. • Develop additional environment capabilities • tools (debug, performance) • improvements in runtime systems (load balance and GPUSs) • Support other users • Identify users of TEXT applications • Identify and support interested application developers • Contribute to Standards (OpenMP ARB, PERI-XML) Jesus Labarta. OmpSs @ EPoPPEA, January 2012 10

Deployment Jesus Labarta. OmpSs @ EPoPPEA, January 2012 11

Codes being ported • Scalapack: Cholesky factorization (UJI) • Example of the issues in porting legacy code • Demonstration that it is feasible • The importance of scheduling • LBC Boltzmann Equation Solver Tool (HLRS) • Solver for incompressible flows based on Lattice -Boltzmann methods (LBM) • LBM well suited for highly complex geometries. Simplified implementation: lbc • weak scaling experiment Stencil. Sub domains 2 1.8 1.6 1.4 normalized walltime 1.2 1 0.8 0.6 0.4 ideal 0.2 StarSs/MPI MPI OpenMP/MPI 0 8 16 32 64 128 256 512 1024 2048 cores Jesus Labarta. OmpSs @ EPoPPEA, January 2012 12

StarSs: history/strategy/versions Basic SMPSs must provide directionality  argument Contiguous, non partially overlapped Renaming Several schedulers (priority, locality,…) No nesting C/Fortran MPI/SMPSs optims. SMPSs regions C, No Fortran must provide directionality  argument ovelaping &strided OMPSs Reshaping strided accesses Priority and locality aware scheduling C, C++, Fortran OpenMP compatibility (~) Contiguous and strided args. Separate dependences/transfers Inlined/outlined pragmas Nesting Heterogeneity: SMP/GPU/Cluster No renaming, Evolving research since 2005 Several schedulers: “Simple” locality aware sched ,… Jesus Labarta. OmpSs @ EPoPPEA, January 2012 13

OmpSs • What; Our long term infrastructure • “Acceptable” relaxation of basic StarSs concept • Reasonable merge/evolution of OpenMP • Basic features • Inlined/outlined task specifications • Support multiple implementations for outlined tasks • Separation of information to compute dependences and data movement • Not necessary to specify directionality for an argument • Concurrent: Breaking inout chains (for reduction implementation) • Nesting • Heterogeneity: CUDA, OpenCL (in the pipe) • Strided and partially aliased arguments • C, C++ and Fortran Jesus Labarta. OmpSs @ EPoPPEA, January 2012 14

OmpSs: Directives Task implementation for a GPU device The compiler parses CUDA kernel invocation syntax Support for multiple implementations of a task #pragma omp target device ({ smp | cuda }) \ [ implements ( function_name )] \ { copy_deps | [ copy_in ( array_spec ,...)] [ copy_out (...)] [ copy_inout (...)] } Ask the runtime to ensure consistent data is accessible in the address space of the device #pragma omp task [ input (...)] [ output (...)] [ inout (...)] [ concurrent (...)] { function or code block } To compute dependences To allow concurrent execution of commutative tasks #pragma omp taskwait [on (...)] [noflush] Relax consistency to main program Master wait for sons or specific data availability Jesus Labarta. OmpSs @ EPoPPEA, January 2012 15

The OmpSs Programming Model Jesus Labarta Director Computer - PowerPoint PPT Presentation

The OmpSs Programming Model Jesus Labarta Director Computer Sciences Research Dept. BSC Challenges on the way to Exascale Efficiency ( , power, ) Variability Memory Faults Scale (,concurrency, strong scaling,) J.

The OmpSs programming model and its runtime support Jess Labarta BSC 13 th Charm++ Workshop

1. We must SEE Jesus clearly 1. We must SEE Jesus clearly 1. We must SEE Jesus clearly 1. We

outside the Gospels Sayings of Jesus outside the Gospels Sayings of Jesus outside the Gospels

Waiting on Jesus John 20:19-31 page 1647 Jesus was lost to them finding Jesus John

IS JESUS REALLY GOD? HOW COULD JESUS BE A HUMAN BEING? WHAT DID JESUS REALLY DO ON EARTH? DID

OmpSs - programming model for heterogenous and distributed platforms Rosa M Badia Uppsala, 3

The MareIncognito Project Jesus Labarta Director Computer Sciences Research Dept. BSC Objective

The Gospel of Jesus Christ and me The Gospel of Jesus Christ Who is Jesus Christ and what is

THE LORDSHIP OF JESUS RD THE LORDSHIP OF JESUS RD VOCABULARY THE LORDSHIP OF JESUS RD

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

Correlating Performance, Code Location and Memory Access Harald Servat, Jesus Labarta, Judit

OUR GREAT SUBSTITUTION: PART 2 Mark 15:16-43 MAIN POINTS: 1. The Person of Jesus Christ 2. The

In Our Place John 18 1. The Arrest of Jesus 2. The Denial of Jesus 3. The Trial of Jesus Key

Jesus in Bethany (Saturday) Jesus in Bethany (Saturday) Six days before the Passover, Jesus

See Jesus, see God! John 12:44-45 See Jesus, see God! And Jesus cried out and said,

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead

VI-EPSCoR Annual Conference 2015 VI-EPSCoR Annual Conference 2015 VI-EPSCoR Annual Conference

Thoughts on system software for next-generation hardware !"#"$%"&'()*$ $

Roadrunner: What makes it tick? Los Alamos Computer Science Symposium October 14, 2008 Ken Koch

Networks on chip: Evolution or Revolution? Luca Benini lbenini@deis.unibo.it DEIS-Universita

National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention: NCHHSTP Division of Viral

A Malaria Week Dialogue: STRONG SURVEILLANCE SYSTEMS AND TIMELY REPORTING Requestin ing

IN THE Supporting better quality health and social care for everyone in Scotland OVERVIEW A

FCA BI Test Case - the judgment Branko Bjelobaba FCII Regulation & Compliance Consultant

The OmpSs Programming Model Jesus Labarta Director Computer - PowerPoint PPT Presentation

The OmpSs Programming Model Jesus Labarta Director Computer Sciences Research Dept. BSC Challenges on the way to Exascale Efficiency ( , power, ) Variability Memory Faults Scale (,concurrency, strong scaling,) J.

The OmpSs programming model and its runtime support Jess Labarta BSC 13 th Charm++ Workshop

1. We must SEE Jesus clearly 1. We must SEE Jesus clearly 1. We must SEE Jesus clearly 1. We

outside the Gospels Sayings of Jesus outside the Gospels Sayings of Jesus outside the Gospels

Waiting on Jesus John 20:19-31 page 1647 Jesus was lost to them finding Jesus John

IS JESUS REALLY GOD? HOW COULD JESUS BE A HUMAN BEING? WHAT DID JESUS REALLY DO ON EARTH? DID

OmpSs - programming model for heterogenous and distributed platforms Rosa M Badia Uppsala, 3

The MareIncognito Project Jesus Labarta Director Computer Sciences Research Dept. BSC Objective

The Gospel of Jesus Christ and me The Gospel of Jesus Christ Who is Jesus Christ and what is

THE LORDSHIP OF JESUS RD THE LORDSHIP OF JESUS RD VOCABULARY THE LORDSHIP OF JESUS RD

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

Correlating Performance, Code Location and Memory Access Harald Servat, Jesus Labarta, Judit

OUR GREAT SUBSTITUTION: PART 2 Mark 15:16-43 MAIN POINTS: 1. The Person of Jesus Christ 2. The

In Our Place John 18 1. The Arrest of Jesus 2. The Denial of Jesus 3. The Trial of Jesus Key

Jesus in Bethany (Saturday) Jesus in Bethany (Saturday) Six days before the Passover, Jesus

See Jesus, see God! John 12:44-45 See Jesus, see God! And Jesus cried out and said,

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher &amp; Activity Lead

VI-EPSCoR Annual Conference 2015 VI-EPSCoR Annual Conference 2015 VI-EPSCoR Annual Conference

Thoughts on system software for next-generation hardware !&quot;#&quot;$%&quot;&amp;'()*$ $

Roadrunner: What makes it tick? Los Alamos Computer Science Symposium October 14, 2008 Ken Koch

Networks on chip: Evolution or Revolution? Luca Benini lbenini@deis.unibo.it DEIS-Universita

National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention: NCHHSTP Division of Viral

A Malaria Week Dialogue: STRONG SURVEILLANCE SYSTEMS AND TIMELY REPORTING Requestin ing

IN THE Supporting better quality health and social care for everyone in Scotland OVERVIEW A

FCA BI Test Case - the judgment Branko Bjelobaba FCII Regulation &amp; Compliance Consultant

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead

Thoughts on system software for next-generation hardware !"#"$%"&'()*$ $

FCA BI Test Case - the judgment Branko Bjelobaba FCII Regulation & Compliance Consultant