Automation of Determination of Optimal Intra-Compute Node - PowerPoint PPT Presentation

Automation of Determination of Optimal Intra-Compute Node Parallelism PRESENTED BY: Scalable Tools Workshop Antonio Gómez agomez@tacc.utexas.edu James C. Browne 8/1/16 ¡ 1 ¡

Why? • Many applications using MPI for intra-node parallelism • Not all loops in the code are the same • Improve resources utilization, get highest intranode parallelization • But still, make it as easy as possible for users 8/1/16 ¡ 2 ¡

Using PerfExpert for this • PerfExpert • Under development since 2008 • Show users something simple • We don’t look for best performance, but for good performance • Several different tools integrated into PerfExpert • Compilation, Measurement, Instrumentation, Analysis, Recommendation • Continuous improvements • Analysis parallelization • Load imbalance • Vectorization reports • Support for KNL h)ps://github.com/TACC/perfexpert ¡ 8/1/16 ¡ 3 ¡

What are we trying to do • Help users characterize their codes • Create a list of most critical loops and code sections with: • Information about LCPI • Highest possible degree of parallelism of that loop/section • Expect changes in the code by the user • Rerun analysis • Automate as much as possible • And this is only intra-node 8/1/16 ¡ 4 ¡

Find critical sections • Use LCPI • HPCToolkit/VTune under the cover (Measurement) • LCPI metric is calculated for each code section (Analysis) • Metrics are modified depending on the processor • Still adding support to KNL • Consider MCDRAM • Detect memory mode 8/1/16 ¡ 5 ¡

LCPI • LCPI (Local Cycles Per Instruction) • Several metrics associated to the main one • Processor dependent • Sandy Bridge • Data • TLB • … LCPI Data = L1_HIT*L1_lat+L2_Hit*L2_lat +L2_Miss*Mem_lat)/TOT_INS 8/1/16 ¡ 6 ¡

What’s the idea? • Start with MPI applications • Find critical loops • Optimize the code • Annotate highest degree of parallelism • When no further optimization, introduce OpenMP • Reoptimize • But do this considering the highest degree of parallelism possible (empirical value) and the overhead introduced by OpenMP 8/1/16 ¡ 7 ¡

Automated workflows • MPI Workflow • Many applications still use MPI for intra-node parallelization • Idea • Find critical sections • Identify scalability for those sections • Improve memory access pattern • Rerun scalability • Repeat if necessary 8/1/16 ¡ 8 ¡

Estimation Workflow • For the main loops in the code, identify their LCPI • Get max. theoretical speedup and compare with achieved • Decide whether to continue or not LCPI ¡-‑ ¡Sandy ¡Bridge ¡ 8/1/16 ¡ 9 ¡

Hybrid Workflow • Consider OpenMP overhead • Identify a threshold that specifies whether adding OpenMP is beneficial or not • Add OpenMP • Calculate LCPI • Modify memory access pattern • Calculate LCPI • Check if benefit and compare different with the threshold 8/1/16 ¡ 10 ¡

Some Results (SPPARKS) Original ¡Weak ¡Scalability ¡ OpPmized ¡Weak ¡Scalability ¡ 8/1/16 ¡ 11 ¡

Future of PerfExpert • Lustre counters (IO in general) • Integration of MPI_T (MPI Advisor) • Considering OMPT • Software versioning control • Extending user interface • Instrumentation • Already doing something (MACPO: memory access pattern) • What else? • Keep it simple • Promotion! h)ps://github.com/TACC/perfexpert ¡ 8/1/16 ¡ 12 ¡

Something different now 8/1/16 ¡ 13 ¡

REMORA • Monitoring/Profiling tool developed at TACC • Very simple: • Background task on each node • Collects: • CPU utilization • NUMA stats • Memory utilization (free, virtual,…) • Lustre counters • Fairly popular tool at TACC systems (XALT) • Very easy to use, easy to understand $ remora ./myexe $ remora mpirun ./myexe • Answers simple questions h)ps://github.com/TACC/remora ¡ 8/1/16 ¡ 14 ¡

REMORA =============================== REMORA SUMMARY ============================== Max Memory Used Per Node : 7.65 GB Total Elapsed Time : 0d 0h 1m 9s 176ms ------------------------------------------------------------------------------ Max IO Load / home1 : 0 IOPS 0 RD(MB/S) 0 WR(MB/S) Max IO Load / scratch : 76 IOPS 3011 RD(MB/S) 425 WR(MB/S) Max IO Load / work : 0 IOPS 0 RD(MB/S) 0 WR(MB/S) ============================================================================== Sampling Period : 1 seconds Complete Report Data : /lbm_bench/bin/remora_7306879 ============================================================================== h)ps://github.com/TACC/remora ¡ 8/1/16 ¡ 15 ¡

Use Case: More IO Original Improved • Original code creating 10000 high IO load 9000 • Improved IO: reduce 8000 IO (requests/s) frequency and how it is 7000 implemented 6000 5000 • New code: Improved 4000 performance. Improved 3000 stability of filesystem 2000 1000 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Time (seconds) h)ps://github.com/TACC/remora ¡ 8/1/16 ¡ 16 ¡

Automation of Determination of Optimal Intra-Compute Node Parallelism PRESENTED BY: Scalable Tools Workshop Antonio Gómez agomez@tacc.utexas.edu James C. Browne 8/1/16 ¡ 17 ¡

Automation of Determination of Optimal Intra-Compute Node - PowerPoint PPT Presentation

Automation of Determination of Optimal Intra-Compute Node Parallelism PRESENTED BY: Scalable Tools Workshop Antonio Gmez agomez@tacc.utexas.edu James C. Browne 8/1/16 1 Why? Many applications using MPI for intra-node

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

1 Automation Overview Definition Automation (automation, Automation ) : 1) set of all measures

African Trade Champions African Trade Champions (INTRA-CHAMPS) (INTRA-CHAMPS) Statement by:

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

Image and Video Coding: Intra Prediction & Picture Partitioning Intra-Picture Prediction

Test automation Building automatically repeatable test suites Test automation n Test automation

NODE.JS ANTI-PATTERNS and bad practices ADOPTION OF NODE.JS KEEPS GROWING CHAMPIONS Walmart,

1 Agenda Node&Modules Module&Loaders Node&Packages

Dev Lab: Node + Express What is Node? Node.js = JavaScript + File I/O + A Package Manager or:

Menzies Distributing the world. Problem The whole world in one server API GET node/#id Returns

Recursive Structures in Python class Node: data: int next: Node An attribute can refer to

AN INTRODUCTION TO CONTENT DETERMINATION Gerard Casamayor Chris Mellish Contents 1. The place

Automation is in the Eye of the Automation is in the Eye of the Automation is in the Eye of the

Cyber-Physical System Design Automation: A Tale of Platforms and Contracts Pierluigi Nuzzo Ming

Conditional Entropy and Failed Error Propagation in Software Testing Rob Hierons Brunel

Tracking Predictable Drifting Parameters Paulo Serra of a Time Series The Model Joint work with

TheoryGuru: A Mathematica Package to Apply Quantifier Elimination Technology to Economics C.

Automation in Dense Linear Algebra Paper by Paolo Bientinesi and Robert van de Geijn Presented by

Economics 2 Professor Christina Romer Spring 2016 Professor David Romer LECTURE 21 PLANNED

A Decentralised Strategy gy for r Hete teroge geneous AUV Mis issions via ia Goal l Dis

Educational Spending in India Christophe J. Nordman Institute of Research for Development (IRD),

Automation of Determination of Optimal Intra-Compute Node - PowerPoint PPT Presentation

Automation of Determination of Optimal Intra-Compute Node Parallelism PRESENTED BY: Scalable Tools Workshop Antonio Gmez agomez@tacc.utexas.edu James C. Browne 8/1/16 1 Why? Many applications using MPI for intra-node

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

1 Automation Overview Definition Automation (automation, Automation ) : 1) set of all measures

African Trade Champions African Trade Champions (INTRA-CHAMPS) (INTRA-CHAMPS) Statement by:

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node-&gt;m_data == value) {

Image and Video Coding: Intra Prediction &amp; Picture Partitioning Intra-Picture Prediction

Test automation Building automatically repeatable test suites Test automation n Test automation

NODE.JS ANTI-PATTERNS and bad practices ADOPTION OF NODE.JS KEEPS GROWING CHAMPIONS Walmart,

1 Agenda Node&amp;Modules Module&amp;Loaders Node&amp;Packages

Dev Lab: Node + Express What is Node? Node.js = JavaScript + File I/O + A Package Manager or:

Menzies Distributing the world. Problem The whole world in one server API GET node/#id Returns

Recursive Structures in Python class Node: data: int next: Node An attribute can refer to

AN INTRODUCTION TO CONTENT DETERMINATION Gerard Casamayor Chris Mellish Contents 1. The place

Automation is in the Eye of the Automation is in the Eye of the Automation is in the Eye of the

Cyber-Physical System Design Automation: A Tale of Platforms and Contracts Pierluigi Nuzzo Ming

Conditional Entropy and Failed Error Propagation in Software Testing Rob Hierons Brunel

Tracking Predictable Drifting Parameters Paulo Serra of a Time Series The Model Joint work with

TheoryGuru: A Mathematica Package to Apply Quantifier Elimination Technology to Economics C.

Automation in Dense Linear Algebra Paper by Paolo Bientinesi and Robert van de Geijn Presented by

Economics 2 Professor Christina Romer Spring 2016 Professor David Romer LECTURE 21 PLANNED

A Decentralised Strategy gy for r Hete teroge geneous AUV Mis issions via ia Goal l Dis

Educational Spending in India Christophe J. Nordman Institute of Research for Development (IRD),

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

Image and Video Coding: Intra Prediction & Picture Partitioning Intra-Picture Prediction

1 Agenda Node&Modules Module&Loaders Node&Packages