X-KAAPI: a Multi Paradigm Runtime for Multicore Architectures - - PowerPoint PPT Presentation

x kaapi a multi paradigm runtime for multicore
SMART_READER_LITE
LIVE PREVIEW

X-KAAPI: a Multi Paradigm Runtime for Multicore Architectures - - PowerPoint PPT Presentation

X-KAAPI: a Multi Paradigm Runtime for Multicore Architectures Thierry Gautier , Fabien Lementec , Vincent Faucher, Bruno Raffin INRIA, Grenoble, France CEA, DEN, DANS, DM2S, SEMT, DYN, Gif-sur Yvette, France Thierry Gautier


slide-1
SLIDE 1

P2S2/ICPP 2013

X-KAAPI: a Multi Paradigm Runtime for Multicore Architectures

Thierry Gautier∗, Fabien Lementec∗, Vincent Faucher†, Bruno Raffin∗ ∗INRIA, Grenoble, France †CEA, DEN, DANS, DM2S, SEMT, DYN, Gif-sur Yvette, France Thierry Gautier thierry.gautier@inrialpes.fr

MOAIS, INRIA, Grenoble

slide-2
SLIDE 2

➡ High complexity

  • million of components
  • heterogeneity

๏ memory ๏ processor

  • Parallel architecture
  • Complex architecture
  • Computing resources

๏ CPU, GPU, ...

  • Memory

๏ hierarchical memory (register, L1, L2, L3, main memory) ๏ private / shared cache

  • Interconnection network

๏ between several cores & memory

2

Memory

Network

CORE CORE CORE CORE

Memory Memory Memory Memory Memory Memory Memory Memory

cache Core Core cache Core Core cache Core Core cache Core Core cache Core Core cache Core Core cache Core Core cache Core Core

cache cache

GPU

Memory

cache

CORE CORE

GPU GPU GPU

Xeon X5650

Tylersburg 36D PEX 8647 Tesla C2050 Xeon core

Xeon X5650

PEX 8647

PCIe 16x

PEX 8647 PEX 8647

PCIe 16x GPU GPU GPU GPU GPU QPI QPI QPI QPI ESI (I/O)

Tylersburg 36D

GPU GPU GPU

slide-3
SLIDE 3
  • Goal: Write Once, Run Anywhere
  • Provide performance guarantee of application
  • On multiple parallel architectures
  • With dynamic variation (OS jitter, application load)
  • Two steps solution
  • Definition of a programming model

๏ Task based

  • recursive task, adaptive task

๏ Data flow dependencies

  • computed at runtime
  • Efficient scheduling algorithms

๏ Work stealing based with heuristic ๏ HEFT, DualApproximation, ... ๏ Theoretical analysis of performance

3 13 6

Distributed Scheduler

GPU GPU C P U C P U C P U C P U

Application

APIs: #pragma, C, C++, F

0x100805300 v0 __Z4TypeI9TaskDGEMM task=0x100805270 date=0 wc=3, counter=3, move task=0x1008053a8 date=0 wc=0, counter=0, __Z4TypeI9TaskDGEMM task=0x100805438 date=0 wc=3, counter=3, 0x100805300 v1 0x100805108 v0 __Z4TypeI9TaskDGEMM task=0x1008050a8 date=0 wc=3, counter=3, move task=0x1008051b0 date=0 wc=0, counter=0, 0x100804f70 v2 0x100805300 v2 0x100804f70 v0 __Z4TypeI9TaskDGEMM task=0x100804dc0 date=0 wc=3, counter=3, move task=0x100805018 date=0 wc=0, counter=0, 0x100804f70 v1 0x100804e20 v0 move task=0x100804ec8 date=0 wc=0, counter=0, 0x100804c70 v0 __Z4TypeI9TaskDGEMM task=0x100804bf8 date=0 wc=3, counter=3, move task=0x100804d18 date=0 wc=0, counter=0, 0x100804ac0 v2 0x100804ac0 v0 __Z4TypeI9TaskDGEMM task=0x100804910 date=0 wc=3, counter=3, move task=0x100804b68 date=0 wc=0, counter=0, 0x100804ac0 v1 0x100804988 v0 move task=0x100804a30 date=0 wc=0, counter=0, 0x1008047c0 v0 __Z4TypeI9TaskDGEMM task=0x100804628 date=0 wc=3, counter=3, move task=0x100804868 date=0 wc=0, counter=0, 0x1008044f0 v2 0x100804688 v0 move task=0x100804730 date=0 wc=0, counter=0, 0x1008044f0 v0 __Z4TypeI9TaskDGEMM task=0x100804220 date=0 wc=3, counter=3, move task=0x100804598 date=0 wc=0, counter=0, 0x1008044f0 v1 0x1008043b8 v0 move task=0x100804460 date=0 wc=0, counter=0, 0x100804280 v0 move task=0x100804328 date=0 wc=0, counter=0,
slide-4
SLIDE 4
  • How to program such architecture ?

4

Time

1996

Cilk

Athapascan

distributed work stealing data flow dependencies

Kaapi

+ fault tolerance + static scheduling work stealing independent tasks

Kaapi

+ adaptive task // loop Cilk/Intel Cilk+ Cilk++ + // loop

Kaapi

+ multiGPUs OpenMP 2.0 OMP 3.1 OpenMP 1.0 // loop OpenMP 3.0 + Task TBB 3.0 TBB 4.1 TBB 1.0 + // loop

StarPU

data flow + multiGPUs

Quark

data flow

CellSs

data flow

XKaapi

data flow + //loop + adaptive task

+ OMP RTS

GridSs

data flow

OmpSs

data flow + multiGPUs + cluster

SMPSs

1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

OMP 4.0 TBB 2.0

GPUSs

+ multiGPUs

MPI-1.0

[1994]

MPI-1.1 MPI-1.3 MPI-2.0 MPI-2.1 MPI-2.2 MPI-3.0

2013

slide-5
SLIDE 5
  • Outline
  • Introduction
  • Overview of Kaapi parallel programming model
  • Scheduling tasks with data flow dependencing
  • XKaapi’s on-demand task creation
  • Evaluations
  • Micro benchmarks
  • EPX parallelization
  • Conclusions

5

slide-6
SLIDE 6
  • Using code annotation
  • Other APIs: C, C++, Fortran
  • Task ~ OpenMP structured block
  • assumption: no side effect, description of access mode
  • Related work
  • StarPU [Bordeaux,France], OmpSS [BSC, Spain], Quark [UTK]
  • and new standard OpenMP-4.0 !

void main() {

/* data result is produced */

compute( input, &result );

/* data result is consumed */

display( &result ); }

void main() { #pragma kaapi task read(input) write(result) compute( input, &result ); #pragma kaapi task read(result) display( &result ); }

  • Data flow dependencies

6

compute display

result

slide-7
SLIDE 7
  • XKaapi programming example

7

#include <cblas.h> #include <clapack.h> void Cholesky( double* A, int N, size_t NB ) { for (size_t k=0; k < N; k += NB) {

#pragma kaapi task readwrite(&A[k*N+k]{ld=N; [NB][NB]})

clapack_dpotrf( CblasRowMajor, CblasLower, NB, &A[k*N+k], N ); for (size_t m=k+ NB; m < N; m += NB) {

#pragma kaapi task read(&A[k*N+k]{ld=N; [NB][NB]}) \ readwrite(&A[m*N+k]{ld=N; [NB][NB]})

cblas_dtrsm ( CblasRowMajor, CblasLeft, CblasLower, CblasNoTrans, CblasUnit, NB, NB, 1., &A[k*N+k], N, &A[m*N+k], N ); } for (size_t m=k+ NB; m < N; m += NB) {

#pragma kaapi task read(&A[m*N+k]{ld=N; [NB][NB]}) \ readwrite(&A[m*N+m]{ld=N; [NB][NB]})

cblas_dsyrk ( CblasRowMajor, CblasLower, CblasNoTrans, NB, NB, -1.0, &A[m*N+k], N, 1.0, &A[m*N+m], N ); for (size_t n=k+NB; n < m; n += NB) {

#pragma kaapi task read(&A[m*N+k]{ld=N; [NB][NB]}, &A[n*N+k]{ld=N; [NB][NB]})\ readwrite(&A[m*N+n]{ld=N; [NB][NB]})

cblas_dgemm ( CblasRowMajor, CblasNoTrans, CblasTrans, NB, NB, NB, -1.0, &A[m*N+k], N, &A[n*N+k], N, 1.0, &A[m*N+n], N ); } } } }

slide-8
SLIDE 8
  • XKaapi programming example

8

#include <cblas.h> #include <clapack.h> void Cholesky( double* A, int N, size_t NB ) { for (size_t k=0; k < N; k += NB) {

#pragma kaapi task readwrite(&A[k*N+k]{ld=N; [NB][NB]})

clapack_dpotrf( CblasRowMajor, CblasLower, NB, &A[k*N+k], N ); for (size_t m=k+ NB; m < N; m += NB) {

#pragma kaapi task read(&A[k*N+k]{ld=N; [NB][NB]}) \ readwrite(&A[m*N+k]{ld=N; [NB][NB]})

cblas_dtrsm ( CblasRowMajor, CblasLeft, CblasLower, CblasNoTrans, CblasUnit, NB, NB, 1., &A[k*N+k], N, &A[m*N+k], N ); } for (size_t m=k+ NB; m < N; m += NB) {

#pragma kaapi task read(&A[m*N+k]{ld=N; [NB][NB]}) \ readwrite(&A[m*N+m]{ld=N; [NB][NB]})

cblas_dsyrk ( CblasRowMajor, CblasLower, CblasNoTrans, NB, NB, -1.0, &A[m*N+k], N, 1.0, &A[m*N+m], N ); for (size_t n=k+NB; n < m; n += NB) {

#pragma kaapi task read(&A[m*N+k]{ld=N; [NB][NB]}, &A[n*N+k]{ld=N; [NB][NB]})\ readwrite(&A[m*N+n]{ld=N; [NB][NB]})

cblas_dgemm ( CblasRowMajor, CblasNoTrans, CblasTrans, NB, NB, NB, -1.0, &A[m*N+k], N, &A[n*N+k], N, 1.0, &A[m*N+n], N ); } } } }

slide-9
SLIDE 9
  • Main Characteristics of XKaapi
  • Parallelism is explicit, task based, with data flow dependencies
  • Task’s creation is a non blocking operation
  • Dependencies between tasks = Data flow dependencies

๏ Computed at runtime during workstealing requests

≠StarPU, OmpSS, Quark = computed during task’s creation

  • Scheduling
  • by work stealing

๏Cilk’s like performance guarantee

  • Tp = O(T1/p + T∞)
  • Number of steal requests O(p T∞)

๏+ heuristics for data locality

  • + others: ETF, HETF, DualApproximation
  • Target architectures
  • heterogeneous architecture: multi-CPUs / multi-GPUs
  • many-core: Intel Xeon Phi

9

slide-10
SLIDE 10
  • The way XKaapi executes tasks
  • One “worker thread” per core
  • Able to execute XKaapi fine-grain tasks
  • Holds a queue of tasks

๏ Related to sequential C stack of activation frames ๏ T.H.E. low overhead protocol, lock in rare case

  • Task creation is cheap !
  • Reduces to pushing C function pointer + its

arguments into the worker thread queue ๏ ~ 10 cycles / tasks on AMD Many Cours processors

  • Recursive tasks are welcome
  • Work-stealing based scheduling
  • Cilks’s work first principle
  • Work-stealing algorithm = plug-in

๏ Default: steal a task from a randomly chosen queue

Core push pop Idle Core steal

10

slide-11
SLIDE 11
  • Example of use: the XKaapi for_each

construct

  • A general-purpose parallel loop

๏ A task == a range of iterations to compute

  • Execution model

๏ Initially, one task in charge of the whole range

11

On-demand task creation with XKaapi

  • Adaptive tasks in XKaapi
  • Adaptative tasks can be split at run time to create new tasks
  • Provide a “splitter” function called when an idle core decides to steal some of

the remaining computation to be performed by a task under execution

T1 : [0 - 15]

slide-12
SLIDE 12
  • On-demand task creation with XKaapi
  • Exemple of use: the XKaapi for_each

construct

  • A general-purpose parallel loop

๏ A task == a range of iterations to compute

  • Execution model

๏ Initially, one task in charge of the whole range

12

  • Adaptive tasks in XKaapi
  • Adaptative tasks can be split at run time to create new tasks
  • Provide a “splitter” function called when an idle core decides to steal some of

the remaining computation to be performed by a task under execution

T1 : [0 - 15]

slide-13
SLIDE 13
  • On-demand task creation with XKaapi
  • Exemple of use: the XKaapi for_each

construct

  • A general-purpose parallel loop

๏ A task == a range of iterations to compute

  • Execution model

๏ Initially, one task in charge of the whole range

13

  • Adaptive tasks in XKaapi
  • Adaptative tasks can be split at run time to create new tasks
  • Provide a “splitter” function called when an idle core decides to steal some of

the remaining computation to be performed by a task under execution

T1 : [0 - 15]

slide-14
SLIDE 14
  • On-demand task creation with XKaapi
  • Exemple of use: the XKaapi for_each

construct

  • A general-purpose parallel loop

๏ A task == a range of iterations to compute

  • Execution model

๏ Initially, one task in charge of the whole range

14

  • Adaptive tasks in XKaapi
  • Adaptative tasks can be split at run time to create new tasks
  • Provide a “splitter” function called when an idle core decides to steal some of

the remaining computation to be performed by a task under execution

T1 : [0 - 15]

slide-15
SLIDE 15
  • On-demand task creation with XKaapi
  • Exemple of use: the XKaapi for_each

construct

  • A general-purpose parallel loop

๏ A task == a range of iterations to compute

  • Execution model

๏ Initially, one task in charge of the whole range ๏ Idle cores post steal requests and trigger the «split» operation to generate new tasks

T1 : [0 - 15]

split (T1, nb_stealers+1)

Aggregate steal requests Idle Core Idle Core

15

  • Adaptive tasks in XKaapi
  • Adaptative tasks can be split at run time to create new tasks
  • Provide a “splitter” function called when an idle core decides to steal some of

the remaining computation to be performed by a task under execution

slide-16
SLIDE 16
  • On-demand task creation with XKaapi
  • Exemple of use: the XKaapi for_each

construct

  • A general-purpose parallel loop

๏ A task == a range of iterations to compute

  • Execution model

๏ Initially, one task in charge of the whole range ๏ Idle cores post steal requests and trigger the «split» operation to generate new tasks

T1 : [0 - 15]

split (T1, nb_stealers+1)

Aggregate steal requests Idle Core Idle Core

16

  • Adaptive tasks in XKaapi
  • Adaptative tasks can be split at run time to create new tasks
  • Provide a “splitter” function called when an idle core decides to steal some of

the remaining computation to be performed by a task under execution

slide-17
SLIDE 17
  • On-demand task creation with XKaapi
  • Exemple of use: the XKaapi for_each

construct

  • A general-purpose parallel loop

๏ A task == a range of iterations to compute

  • Execution model

๏ Initially, one task in charge of the whole range ๏ Idle cores post steal requests and trigger the «split» operation to generate new tasks

T1 : [3 - 7]

split (T1, nb_stealers+1)

Aggregate steal requests Idle Core Idle Core T2 : [8 - 11] T3 : [12 - 15]

17

  • Adaptive tasks in XKaapi
  • Adaptative tasks can be split at run time to create new tasks
  • Provide a “splitter” function called when an idle core decides to steal some of

the remaining computation to be performed by a task under execution

slide-18
SLIDE 18
  • Outline
  • Introduction
  • Overview of Kaapi parallel programming model
  • Scheduling tasks with data flow dependencing
  • XKaapi’s on-demand task creation
  • Evalutations
  • Micro benchmarks
  • EPX parallelization
  • Conclusions

18

slide-19
SLIDE 19
  • Overhead of task management [AMD48]
  • Benchmark: naive Fibonacci computation
  • create billion of fine grain tasks!

19

Cilk+

long fib(long n) { if (n < 2) return (n); else { long x, y; x = cilk_spawn fib(n - 1); y = fib(n - 2); cilk_sync; return (x + y); } } void fibonacci(long* result, const long n) { if (n<2) *result = n; else { long r1,r2; #pragma omp task fibonacci( &r1, n-1 ); fibonacci( &r2, n-2 ); #pragma omp taskwait *result = r1 + r2; } }

OpenMP

struct FibContinuation: public tbb::task { long* const sum; long x, y; FibContinuation( long* sum_ ) : sum(sum_) {} tbb::task* execute() { *sum = x+y; return NULL; } }; struct FibTask: public tbb::task { long n; long * sum; FibTask( const long n_, long * const sum_ ) : n(n_), sum(sum_) {} tbb::task* execute() { if( n<2){ *sum = n; return NULL; } else { FibContinuation& c = *new( allocate_continuation() ) FibContinuation(sum); FibTask& b = *new( c.allocate_child() ) FibTask(n-1,&c.y); recycle_as_child_of(c); n -= 2; sum = &c.x; // Set ref_count to "two children". c.set_ref_count(2); c.spawn( b ); return this; } } };

TBB

void fibonacci(long* result, const long n) { if (n<2) *result = n; else { long r1,r2; #pragma kaapi task write(&r1) fibonacci( &r1, n-1 ); fibonacci( &r2, n-2 ); #pragma kaapi sync *result = r1 + r2; } }

Kaapi

slide-20
SLIDE 20

#Cores Cilk+

(s)

TBB 4.0

(s)

XKaapi

(s)

OpenMP (gcc)

(s)

1 1.063 2.356 0.728 2.43 8 0.127 0.293 0.094 51.06 16 0.065 0.146 0.047 104.14 32 0.035 0.072 0.024 NO TIME 48 0.028 0.049 0.017 NO TIME

  • Overhead of task management [AMD48]

Serial Cilk+ TBB (4.0) OpenMP

(gcc)

XKaapi 0.0905s (x 1) 1.063 (x 11.7) 2.356s (x 26) 2.429s (x 27) 0.728s (x 8)

  • Fibonacci (35) naive recursive computation
  • AMD Many Cours, 2.2GHz, 48 cores, 256GB main memory

20

slide-21
SLIDE 21
  • PLASMA: multicore library for dense linear algebra
  • 2 versions of algorithms

๏ “static”: hand coded scheduler of operations over threads ๏ “Quark”: library to schedule tasks with data flow dependencies

  • Comparizon:
  • PLASMA Dpotrf over:

๏ static scheduler ๏ quark scheduler ๏ xkaapi thanks to quark APIs ported on top of xkaapi

  • PLASMA 2.4.6
  • 21

Dense Cholesky factorization

AMD Magny Cours - 48 cores BS= 128 BS= 224 AMD Magny Cours - 48 cores

slide-22
SLIDE 22
  • 22

Dense Cholesky factorization

8 GPU Fermi - M2050 - 12 cores

200 400 600 800 1000 1200 1400 1600 1800 4096 8192 12288 16384 20480 24576 28672 32768 36864 40960 Gflop/s Matrix order 4CPU+8GPU 6CPU+6GPU 8CPU+4GPU 10CPU+2GPU 11CPU+1GPU

Intel Xeon Phi 5110, matrix size 8192, BS=256

  • Preliminary results / random work stealing
slide-23
SLIDE 23
  • Parallelization of EPX
  • Multicore parallelization of EPX (EUROPLEXUS) code [CEA - IRC - EDF - ONERA]
  • ANR RepDyn funding.
  • Fluid-Structure systems subjected to fast transient dynamic loading

23

slide-24
SLIDE 24
  • Parallelization of EPX
  • Complex code
  • 600 000 lines of code (Fortran)
  • Two main sources of parallelization (~70% of the computation)
  • Sparse Cholesky factorization

๏ skyline representation ๏ XKaapi Parallel program = dependent tasks with data flow dependencies

  • 2 Independent loops

๏ LOOPELM:

  • iteration over finite elements to compute nodal internal forces

๏ REPERA:

  • iteration for kinematic link detection

๏ Parallelization = on-demand task creation through the XKaapi’s parallel foreach functor

  • Two instances
  • MEPPEN
  • MAXPLANE
  • AMD Many Cours, 2.2Ghz, 48 cores, 256GB main memory

24

slide-25
SLIDE 25
  • Case of study: MEPPEN
  • Main characteristics
  • Most of the time in independent loops LOOPELM and REPERA
  • AMD Many Cours, 2.2GHz, 48 cores, 256GB main memory

25

0" 50" 100" 150" 200" 250" 300" 350" 400" 1" 2" 4" 8" 16" 24" 32" 40" 48" Time%(second)% Number%of%cores%

MEPPEN%

repera" loopelm" Cholesky"

  • ther"

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 speedup (/T1) core count

LOOPELM REPERA ideal

slide-26
SLIDE 26
  • Case of study: MAXPLANE
  • Main characteristics
  • Most of the time in sparse Cholesky factorization
  • AMD Many Cours, 2.2GHz, 48 cores, 256GB main memory

26

0" 200" 400" 600" 800" 1000" 1200" 1400" 1" 2" 4" 8" 16" 24" 32" 40" 48" Time%(second)% Numer%of%cores%

MAXPLANE%

repera" loopelm" Cholesky"

  • ther"

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 speedup (/T1) core count

LOOPELM REPERA ideal

slide-27
SLIDE 27
  • Conclusions
  • Integration of our runtime under OpenMP-4.0 directives
  • Supercomputer = high performance network + multicores + accelerators
  • Software stack
  • Network:
  • Multicores: OpenMP
  • Accelerators: Cuda, OpenCL, OpenACC
  • SIMD Units: Compiler or extension
  • More informations: http://kaapi.gforge.inria.fr
  • [IWOMP2012]:
  • replacement of libGOMP for OpenMP-3.1/GCC with better management of tasks
  • [IWOMP2013]:
  • extension of loop scheduler based on “on-demand task creation”
  • [IPDPS2013]:
  • multi-CPUs / multi-GPUs (12 cores machine + 8 GPUs)
  • [SBAC-PAD13]:
  • Compaizon Intel Xeon Phi / Intel Sandybridge with OpenMP, Cilk+ and Kaapi

OpenMP 4.0

  • XKaapi
  • Low overhead runtime for dependent tasks (data flow), scheduling algorithms
  • Good performances on different architectures
  • multi-CPUs, multi-CPUs-multi-GPUs, Intel Xeon Phi
  • Improving parallelization of EPX
  • Analysis of the remainder sequential part
  • Scalability on thousands of cores

MPI

slide-28
SLIDE 28
  • Thank you for your

attention !

http://kaapi.gforge.inria.fr

28

slide-29
SLIDE 29
  • Sparse Cholesky Factorization (EPX)
  • OpenMP / XKaapi
  • 59462 with 3.59% of non zero elements

29

5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 speedup (Tp/Tseq) core count

OpenMP XKaapi ideal

slide-30
SLIDE 30
  • Performance evaluation: VTK filters
  • Parallel version of the VTK visualization toolkit [M. Ettinger, MOAIS]
  • A framework to develop parallel applications for scientific visualization
  • A VTK «filters» == a computation performed on a 2D/3D scene
  • parallel loop, static OpenMP schedule

30

Regular workload Irregular workload

slide-31
SLIDE 31
  • IFP 2012

Comparison with OMPSS, StarPU

  • DGEMM matrix size 10240, block size 1024

31

slide-32
SLIDE 32
  • IFP 2012

Comparison with OMPSS, StarPU

  • DPOTRF matrix size 10240, block size 1024

32

slide-33
SLIDE 33
  • IFP 2012

Comparison with OMPSS, StarPU

  • DPOTRF matrix size 40960, BS 1024
  • 33

No OMPSS result due to a memory problem for big matrices (bug if bigger than 10280)

slide-34
SLIDE 34
  • Improving OpenMP task implementation
  • [IWOMP 2012]
  • Barcelona OpenMP Task Suite
  • Using libKOMP = our libGOMP implementation (on top of XKaapi)
  • A set of representative benchmarks to evalute OpenMP tasks implementations

Name Arguments used Domain Summary Alignment prot100.aa Dynamic programming Aligns sequences of proteins FFT n=33,554,432 Spectral method Computes a Fast Fourier Transformation Floorplan input.20 Optimization Computes the optimal placement of cells in a floorplan NQueens n=14 Search Finds solutions of the N Queens problem MultiSort n=33,554,432 Integer sorting Uses a mixture of sorting algorithms to sort a vector SparseLU n=128 m=64 Sparse linear algebra Computes the LU factorization of a sparse matrix Strassen n=8192 Dense linear algebra Computes a matrix multiply with Strassen's method UTS medium.input Search Computes the number of nodes in an Unbalanced Tree

  • Evaluation platforms
  • AMD48: 4x12 AMD Opteron (6174) cores
  • Intel32: 4x8 Intel Xeon (X7560) cores
  • Softwares
  • gcc 4.6.2 + libGOMP
  • gcc 4.6.2 + libKOMP
  • icc 12.1.2 + Intel OpenMP runtime (KMP)

34

slide-35
SLIDE 35
  • Running OpenMP BOTS with libKOMP

Speed-Up of BOTS kernels on the AMD48 platform

35

kernel libGOMP libKOMP Intel Alignment 38.8 40.0 37.0 FFT 0.5 12.2 12.0 Floorplan 27.6 32.7 29.2 NQueens 43.7 47.8 39.0 MultiSort 0.6 13.2 11.3 SparseLU 44.1 44.4 35.0 Strassen 20.8 22.4 20.5 UTS 0.9 25.3 15.0

  • Evaluation platforms
  • AMD48: 4x12 AMD Opteron (6174) cores
  • Intel32: 4x8 Intel Xeon (X7560) cores
  • Softwares
  • gcc 4.6.2 + libGOMP
  • gcc 4.6.2 + libKOMP
  • icc 12.1.2 + Intel OpenMP runtime (KMP)
slide-36
SLIDE 36
  • Software stack

36

XKaapi runtime: adaptive task with data flow Pthread Multicore Atomic Network N I C N I C Multicore Pthread Atomic C API

Fortran API

C++ API KaCC Quark

Europlexus VTK PLASMA

C + #pragma

GPU GPU GPU

Sofa OpenMP (gcc)

libKOMP

slide-37
SLIDE 37
  • #include <cblas.h>

#include <clapack.h> void Cholesky( double* A, int N, size_t NB ) {

#pragma omp parallel

for (size_t k=0; k < N; k += NB)

#pragma omp single

{

#pragma omp task depend(inout: &A[k*N+k]{ld=N; [NB][NB]})

clapack_dpotrf( CblasRowMajor, CblasLower, NB, &A[k*N+k], N ); for (size_t m=k+ NB; m < N; m += NB) {

#pragma omp task depend(in: &A[k*N+k]{ld=N; [NB][NB]}, \ inout: &A[m*N+k]{ld=N; [NB][NB]})

cblas_dtrsm ( CblasRowMajor, CblasLeft, CblasLower, CblasNoTrans, CblasUnit, NB, NB, 1., &A[k*N+k], N, &A[m*N+k], N ); } for (size_t m=k+ NB; m < N; m += NB) {

#pragma omp task depend(in: &A[m*N+k]{ld=N; [NB][NB]}, \ inout: &A[m*N+m]{ld=N; [NB][NB]})

cblas_dsyrk ( CblasRowMajor, CblasLower, CblasNoTrans, NB, NB, -1.0, &A[m*N+k], N, 1.0, &A[m*N+m], N ); for (size_t n=k+NB; n < m; n += NB) {

#pragma omp task depend(in: &A[m*N+k]{ld=N; [NB][NB]}, &A[n*N+k]{ld=N; [NB][NB]})\ inout: &A[m*N+n]{ld=N; [NB][NB]})

cblas_dgemm ( CblasRowMajor, CblasNoTrans, CblasTrans, NB, NB, NB, -1.0, &A[m*N+k], N, &A[n*N+k], N, 1.0, &A[m*N+n], N ); } } } }

OpenMP version

37

slide-38
SLIDE 38

#threads Cilk+ (s) OpenMP (gcc)

(s)

XKaapi (s)

1 33.21 65.64 15.52 10 3.34 33.12 1.58 20 1.66 17.54 0.79 60 0.56 6.30 0.27 120 0.38 3.86 0.18 240 0.37 3.18 0.18

  • Overhead of task management [Xeon Phi]
  • Fibonacci (38) naive recursive computation
  • 60 cores of Intel Xeon Phi 5100

38

slide-39
SLIDE 39
  • Stealing a task
  • A thief thread do
  • Iteration through the tasks in a victim queue

๏ iteration order = creation time order

  • Computation of data flow dependencies

๏ with previously visited tasks

  • Detection of ready tasks
  • Lazy computation of data flow dependencies
  • In work stealing “theory”, called work first principle

๏ overhead is move from the work of the program to the critical path

Core push pop Idle Core steal

39

  • ldest

newest

slide-40
SLIDE 40
  • 40

1 GPU

Nvidia Event

Stream H2D Stream D2H Stream Kernel

slide-41
SLIDE 41
  • #include <cblas.h>

#include <clapack.h> void Cholesky( double* A, int N, size_t NB ) { for (size_t k=0; k < N; k += NB) {

#pragma kaapi task readwrite(&A[k*N+k]{ld=N; [NB][NB]})

clapack_dpotrf( CblasRowMajor, CblasLower, NB, &A[k*N+k], N ); for (size_t m=k+ NB; m < N; m += NB) {

#pragma kaapi task read(&A[k*N+k]{ld=N; [NB][NB]}) \ readwrite(&A[m*N+k]{ld=N; [NB][NB]})

cblas_dtrsm ( CblasRowMajor, CblasLeft, CblasLower, CblasNoTrans, CblasUnit, NB, NB, 1., &A[k*N+k], N, &A[m*N+k], N ); } ... }

Memory region description

41

clapack_dpotrf

NB N = leading dimension

cblas_dtrsm

A

slide-42
SLIDE 42
  • False dependencies resolution
  • Also called:
  • Write after Read
  • Writer after Write
  • Here =>
  • Task1 and Task2 cannot be concurrent

except if ‘a’ is duplicated

  • also called ‘variable renaming’

42

Task1

read

time

write

Task2

access to variable ‘a’ variable ‘a’

Task1

read

time

write

Task2

access to variable ‘a’ variable ‘a’

version vi version vi+1