Germn Llort gllort@bsc.es >10k processes + long runs = large - PowerPoint PPT Presentation

Germán Llort gllort@bsc.es

 >10k processes + long runs = large traces  Blind tracing is not an option  Profilers also start presenting issues  Can you even store the data?  How patient are you? 2 IPDPS - Atlanta, April 2010

 Past methodology: Filters driven by the expert • Get the whole trace • Summarize for a global view • Focus on a representative region  Goal: Transfer the expertise to the run-time 3 IPDPS - Atlanta, April 2010

 Traces of “100 Mb” • Best describe the application behavior • Trade-off: Maximize information / data ratio  The challenge? • Intelligent selection of the information  How? • On-line analysis framework – Decide at run-time what is most relevant 4 IPDPS - Atlanta, April 2010

Application tasks M PItrace attaches  Data acquisition T 0 T 1 T n • MPItrace (BSC) – PMPI wrappers  Data transmission • MRNet (U. of Wisconsin) Reduction Network – Scalable master / worker – Tree topology M RNet Front-end  Data analysis • Clustering (BSC) Clustering Analysis – Find structure of computing regions 5 IPDPS - Atlanta, April 2010

Back-end threads  Local trace buffers T 0 T 1 T n …  BE threads blocked  FE periodically collects data • Automatic / fixed interval Aggregate Broadcast • Reduction on tree data results  Global analysis M RNet Front-end  Propagate results Clustering  Locally emit trace events Analysis 6 IPDPS - Atlanta, April 2010

 Density-based clustering algorithm • J. Gonzalez, J. Gimenez, J. Labarta – IPDPS'09 “Automatic detection of parallel applications computation phases”  Characterize structure of computing regions  Using hardware counters data • Instructions + IPC – Complexity & Performance • Any other metric – i.e. L1, L2 cache misses 7 IPDPS - Atlanta, April 2010

Scatter Plot of Clustering Metrics Clusters Distribution Over Time Clusters Performance Code Linking 8 IPDPS - Atlanta, April 2010

 Trigger clustering analysis periodically • Sequence of structure snapshots  Compare subsequent clusterings • See changes in the application behavior  Find a representative region • Most applications are highly iterative 9 IPDPS - Atlanta, April 2010

 Compare 2 clusterings, cluster per cluster • Inscribe clusters into a rectangle • Match those that overlap with a 5% variance • Sum of the matched clusters cover the 85% of total computing time OK KO  Stability = N equivalent clusterings “in-a-row” • Keep on looking for differences  Gradually lower requisites if can not be met • Best possible region based on “seen” results 10 IPDPS - Atlanta, April 2010

 60 Mb, 6 iterations 11 IPDPS - Atlanta, April 2010

 Clustering time grows with the number of points • 5k pts  10 sec, 50k pts  10 min  Sample a subset of data to cluster (SDBScan) • Space: Select a few processes. Full time sequence. • Time: Random sampling. Wide covering.  Classify remaining data Nearest neighbor algorithm • – Reusing clustering structures 12 IPDPS - Atlanta, April 2010

All processes 25% random records 32 representatives 15% random records 16 representatives 10% random records 8 representatives + 15% random 75% less data Good quality 6s down from 2m Fast analysis 13 IPDPS - Atlanta, April 2010

 Important trace size reductions  Results before the application finishes  Final trace is representative 14 IPDPS - Atlanta, April 2010

 Compared vs. Profiles for the whole run • TAU Performance System (U. of Oregon)  Same overall structure • Same relevant functions, Avg. HWC’s & Time % • Most measurement differences under 1% Full run profile (TAU) Trace segment (M PItrace) GROM ACS user functions % Time Kinstr Kcycles % Time Kinstr Kcycles do_nonbonded 23.72% 24,709 22,349 23.94% 24,700 22,533 solve_pme 10.47% 6,795 9,913 10.52% 6,776 9,898 gather_f_bsplines 5.69% 5,286 5,387 5.64% 5,248 5,302 15 IPDPS - Atlanta, April 2010

matched clusters ∑ % time 16 IPDPS - Atlanta, April 2010

Instructions imbalance IPC imbalance  Study load balancing 17 IPDPS - Atlanta, April 2010

 Initial development • All data centralized • Sampling, clustering & classification at front-end • Bad scaling at large processor counts  >10k tasks • Sampling at leaves • Only put together the clustering set • Broadcast clustering results, classify at leaves 18 IPDPS - Atlanta, April 2010

 On-line automatic analysis framework  Identify structure and see how evolves  Determine a representative region  Detailed small trace + Periodic reports  Reductions in the time dimension  Scalable infrastructure supports other analyses  Current work • Spectral analysis (M. Casas): Better delineate the traced region • Parallel clustering in the tree • Finer stability heuristic 19 IPDPS - Atlanta, April 2010

Germn Llort gllort@bsc.es >10k processes + long runs = large - PowerPoint PPT Presentation

Germn Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? 2 IPDPS - Atlanta, April 2010

Folding Carton Point of Purchase Display Purchasing Small runs Large runs Combo runs

Questions to ask while conducting the Wheat Germ DNA Glop Wheat Germ Please read the

Available Processes Process Name Process Feature C18A6 0.18 CMOS H18A6 0.18 HV-CMOS

CAMx Sensitivity SA Runs G. Pirovano, P. Brotto, J. Ferreira, Y. Long, C. Emery, G. Yarwood, F.

Fractional Factorial Designs Each replicate of a 2 k design requires 2 k runs. E.g. 64 runs for k =

Hard lepton-hadron processes in pQCD (I) Inclusive deep-inelastic scattering (DIS), semi-incl. l +

2 An Aqueduct Runs Through It 3 An Aqueduct Runs Through It 4 An Aqueduct Runs Through It 5

Observing Facts Andreas Zeller 1 Reasoning about Runs Experimentation n controlled runs

(Epi)Genetics in normal and malignant germ cell development. Leendert Looijenga, Department of

Processes of Large-Scale Land Acquisition: Case Studies from Sub-Saharan Africa Laura German,

- chisq.test(x, y) runs <- 1000 rbeta(runs, shape1, shape2) runs <- 1000 experiment_1

Large deviations for Poisson driven processes in epidemiology Peter Kratz joint work with

runs and dat aset s analysis of t he dat aset s remaining quest ions & work runs

Long-Term Memory Introduction Encoding Processes Levels (Depth) of Processing

Thanks to our Sponsors A brief history of Protg 1987 PROTG runs on LISP machines

Nonequilibrium Markov processes conditioned on large deviations Chetrite Raphael Laboratoire

the next Spending Review Sarah Crown, Director of Literature and South West Thinking Runs

CLEAN TOUCH Viruses Algae Bacteria Moulds / Yeasts Anti-germ action of disinfecting wipes

Germ- -line Genetic Therapy line Genetic Therapy Germ Munson- -Davis Look Bravely at a Davis

Emerging opportunities that could help to generate long term wealth Invest in Mar 2020 Large

Moment properties and long-range dependence of queueing processes Dr. E. Morozov A. Rumyantsev

Ceramics session - Processes for additive manufacturing of large parts, State of the art, locks

Experience with FPGA HDK AMI and F1: (all statements are subject to large systematic

Repeat Repeat runs/variations on a theme runs/variations on a theme Model

Germn Llort gllort@bsc.es >10k processes + long runs = large - PowerPoint PPT Presentation

Germn Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? 2 IPDPS - Atlanta, April 2010

Folding Carton Point of Purchase Display Purchasing Small runs Large runs Combo runs

Questions to ask while conducting the Wheat Germ DNA Glop Wheat Germ Please read the

Available Processes Process Name Process Feature C18A6 0.18 CMOS H18A6 0.18 HV-CMOS

CAMx Sensitivity SA Runs G. Pirovano, P. Brotto, J. Ferreira, Y. Long, C. Emery, G. Yarwood, F.

Fractional Factorial Designs Each replicate of a 2 k design requires 2 k runs. E.g. 64 runs for k =

Hard lepton-hadron processes in pQCD (I) Inclusive deep-inelastic scattering (DIS), semi-incl. l +

2 An Aqueduct Runs Through It 3 An Aqueduct Runs Through It 4 An Aqueduct Runs Through It 5

Observing Facts Andreas Zeller 1 Reasoning about Runs Experimentation n controlled runs

(Epi)Genetics in normal and malignant germ cell development. Leendert Looijenga, Department of

Processes of Large-Scale Land Acquisition: Case Studies from Sub-Saharan Africa Laura German,

- chisq.test(x, y) runs &lt;- 1000 rbeta(runs, shape1, shape2) runs &lt;- 1000 experiment_1

Large deviations for Poisson driven processes in epidemiology Peter Kratz joint work with

runs and dat aset s analysis of t he dat aset s remaining quest ions &amp; work runs

Long-Term Memory Introduction Encoding Processes Levels (Depth) of Processing

Thanks to our Sponsors A brief history of Protg 1987 PROTG runs on LISP machines

Nonequilibrium Markov processes conditioned on large deviations Chetrite Raphael Laboratoire

the next Spending Review Sarah Crown, Director of Literature and South West Thinking Runs

CLEAN TOUCH Viruses Algae Bacteria Moulds / Yeasts Anti-germ action of disinfecting wipes

Germ- -line Genetic Therapy line Genetic Therapy Germ Munson- -Davis Look Bravely at a Davis

Emerging opportunities that could help to generate long term wealth Invest in Mar 2020 Large

Moment properties and long-range dependence of queueing processes Dr. E. Morozov A. Rumyantsev

Ceramics session - Processes for additive manufacturing of large parts, State of the art, locks

Experience with FPGA HDK AMI and F1: (all statements are subject to large systematic

Repeat Repeat runs/variations on a theme runs/variations on a theme Model

- chisq.test(x, y) runs <- 1000 rbeta(runs, shape1, shape2) runs <- 1000 experiment_1

runs and dat aset s analysis of t he dat aset s remaining quest ions & work runs