Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An Agile Scripting Approach Vinícius Garcia Pinto, Lucas Mello Schnorr , Luka Stanisic Arnaud Legrand, Samuel Thibault, Vincent Danjean WSPPD Workshop Porto Alegre, Brazil – September 4th, 2017
Context Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs 2 / 11
Context Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs Programming hybrid platforms Traditional, explicit programming models (MPI, CUDA, OpenMP, pthreads, . . . ) Perfect control � maximal achievable performance 2 / 11
Context Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs Programming hybrid platforms Traditional, explicit programming models (MPI, CUDA, OpenMP, pthreads, . . . ) Perfect control � maximal achievable performance Monolithic codes � hard to develop and maintain Hard to optimize � performance portability Fixed scheduling � sensitive to variability 2 / 11
Context Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs Programming hybrid platforms Traditional, explicit programming models (MPI, CUDA, OpenMP, pthreads, . . . ) Perfect control � maximal achievable performance Monolithic codes � hard to develop and maintain Hard to optimize � performance portability Fixed scheduling � sensitive to variability Recent task-based programming models (PaRSEC, OmpSs, Charm++, StarPU, . . . ) Single, abstract programming model based on DAG Runtime responsible for dynamic scheduling Portability of code and performance 2 / 11
Context Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs Programming hybrid platforms Traditional, explicit programming models (MPI, CUDA, OpenMP, pthreads, . . . ) Perfect control � maximal achievable performance Monolithic codes � hard to develop and maintain Hard to optimize � performance portability Fixed scheduling � sensitive to variability Recent task-based programming models (PaRSEC, OmpSs, Charm++, StarPU, . . . ) Single, abstract programming model based on DAG Runtime responsible for dynamic scheduling Portability of code and performance New challenge � scheduling heuristic 2 / 11
Visualization of Task Scheduling Parallel simulation of superscalar scheduling , Haugen, Kurzak, YarKhan, Dongarra. ICPP 2014 . The QR factorization of a matrix (size: 3960; tiles size: 180) The QUARK scheduler: 48 cores (one node). The Cholesky factorization of a matrix (size: 47040; tiles size: 960) The “MPI-Aware” DMDAS scheduler of StarPU+MPI: 2 nodes with 4 cores and 4 GPUs each. 3 / 11
Related Work: Classical Analysis Tools Space/time view (resources may be hierarchically organized) + bonus Paraver (100K) – https://tools.bsc.es/paraver Projections (35K) – http://charm.cs.uiuc.edu/software FrameSoC (300K+LTTNG) – https://soctrace-inria.github.io/framesoc/ Ravel (19K) – https://github.com/LLNL/ravel Paje (31K in Objective-C) – https://github.com/schnorr/Paje ViTE (27K) – http://vite.gforge.inria.fr/ Tiled Cholesky Factorization from StarPU+MPI visualized with ViTE. 4 / 11
Related Work: Emerging Alternatives Ad hoc visualization of task dependencies (??? SLOC) See VPA 2015 Exploiting DAG structure: DAGViz (??? SLOC) See VPA 2015 Entropy-aware aggregation: Ocelotl (3K+300K) https://github.com/soctrace-inria/ocelotl 5 / 11
Current Tools for Visual Performance Analysis Tools Implemented in C/C++ to scale Interactive (depending on scale) and user friendly (mouse interaction) Large and complex source code, difficult to extend Generally not designed for hybrid platforms and dynamic runtimes Flexible filter calls for scripting capability Lack custom views exploiting application and platform structure 6 / 11
Our (Agile, Scriptable, Flexible) 2-Phase Workflow Adopt modern data analysis tools for scripting → pj_dump + R + tidyverse + ggplot2 + plotly ( ≈ 3.5K SLOC) Workflow Execution: screen (1st phase) + org-mode (2nd phase) read left_join left_join Chameleon/Cholesky dot2csv DAG DAG DAG Execution Traces (FXT) SH DOT CSV FEATHER read outliers ZERO left_join states states starpu_fxt_tool FXT FXT FXT PJ CSV FEATHER FXT C read tree_ fi lter y_coord. entities entities PJ CSV FEATHER Trace pjdump read CPP PAJE links links PJ CSV FEATHER read variable variable PJ CSV FEATHER A Export B Conversion C Reading D Cleaning, fi ltering, derivation E Output In-memory analysis & visualization DAG read K-Iteration user con fi g FEATHER TI Space/time ggplot2 CPP Idleness YAML static plots Outlier read states s e G . a t ABE scarce FEATHER A s t D Ready CPE CPP case TI master entities read TI CPP Submitted FEATHER l CPP v i n a k TI r s MPI transfers s . CPP links read fi lter plotly TI GPU transfers FEATHER CPP interactive TI GFlops read variable CPP TI FEATHER Used Mem. CPP A Reading B Data visualization C Assembly D Analisys Simplified 2-phase workflow (see our forthcoming paper). Fail fast if an idea does not work Workflow can be shared to reproduce (and change) the analysis 7 / 11
Experimental validation: application and platform MORSE – Matrices Over Runtime Systems @ Exascale http://icl.cs.utk.edu/projectsdev/morse/ Tiled Cholesky factorization available in Chameleon dpotrf 0 for (k = 0; k < N; k++) { dtrsm 0 dtrsm 0 dtrsm 0 dtrsm 0 DPOTRF (RW,A[k][k]); dsyrk 0 dgemm 0 dgemm 0 dgemm 0 dsyrk 0 dgemm 0 dgemm 0 dsyrk 0 dgemm 0 dsyrk 0 for (i = k+1; i < N; i++) dpotrf 1 DTRSM (RW,A[i][k], R,A[k][k]); dtrsm 1 dtrsm 1 dtrsm 1 for (i = k+1; i < N; i++) { dsyrk 1 dgemm 1 dgemm 1 dsyrk 1 dgemm 1 dsyrk 1 DSYRK (RW,A[i][i], R,A[i][k]); dpotrf 2 for (j = k+1; j < i; j++) dtrsm 2 dtrsm 2 DGEMM (RW,A[i][j], dsyrk 2 dgemm 2 dsyrk 2 R,A[i][k], R,A[j][k]); dpotrf 3 } dtrsm 3 } dsyrk 3 dpotrf 4 StarPU runtime on these platforms idcin-2.grenoble.grid5000.fr (Digitalis, phased out in February 2017) Two 14-core Intel Xeon E5-2697v3 with Three NVIDIA Titan X 8 / 11
Scheduler Comparison (input: 60×60 of 960×960) DMDA DMDAS WS Unconstrained Constrained Small matrix + interaction (12×12) → try yourself at http://perf-ev-runtime.gforge.inria.fr/vpa2016/ 9 / 11
Conclusion and Ongoing Work Achievements Flexible analysis workflow in ≈ 3.5K SLOC Dynamic task-based applications Multi-node, multi-core, multi-GPU · · · What’s next? Suitable for scheduling specialists Immediate work Investigate data dependencies (scheduler) anomalies on scale 10 / 11
Thank you for your attention! schnorr@inf.ufrgs.br vgpinto@inf.ufrgs.br Questions? Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An Agile Scripting Approach. 3rd Workshop on Visual Performance Analysis (VPA) https://hal.inria.fr/hal-01353962 11 / 11
Recommend
More recommend