[PPT] - Benchmarking the Performance of Scientific Applications with PowerPoint Presentation

SLIDE 1

Benchmarking the Performance of Scientific Applications with  Irregular I/O at the Extreme Scale

Stephen ¡Herbein, ¡University ¡of ¡Delaware ¡ Sco4 ¡Klasky, ¡Oak ¡Ridge ¡Na<onal ¡Laboratory ¡ Michela ¡Taufer, ¡University ¡of ¡Delaware ¡ ¡

SLIDE 2

Mo<va<on ¡

Let’s ¡consider ¡an ¡applica<on ¡with ¡ irregular ¡I/O ¡such ¡as ¡QMCPack: ¡

I/O ¡overhead ¡can ¡take ¡up ¡to ¡

30% ¡regardless ¡of ¡I/O ¡library ¡

Full ¡scale ¡overhead ¡is ¡

projected ¡to ¡be ¡50% ¡or ¡higher ¡

ADIOS AGGR 2to1 ADIOS AGGR 4to1 ADIOS AGGR 8to1 H D F 5 ADIOS AGGR 2to1 ADIOS AGGR 4to1 ADIOS AGGR 8to1 H D F 5 ADIOS AGGR 2to1 ADIOS AGGR 4to1 ADIOS AGGR 8to1 H D F 5 ADIOS AGGR 2to1 ADIOS AGGR 4to1 ADIOS AGGR 8to1 H D F 5 ADIOS AGGR 2to1 ADIOS AGGR 4to1 ADIOS AGGR 8to1 H D F 5 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Time (%)

I/O Time Execution Time n cores=8192 n cores=16384 n cores=32768 n cores=65536 n cores=131072

QMCPack time – execution vs. I/O on Titan

S. ¡Herbein, ¡et ¡al. ¡Performance ¡Impact ¡of ¡I/O ¡on ¡QMCPack ¡

Simula<ons ¡at ¡the ¡Petascale ¡and ¡Beyond. ¡CSE ¡2013. ¡

1

What ¡are ¡the ¡factors ¡impac<ng ¡ the ¡I/O ¡performance? ¡ ¡ Could ¡irregular ¡I/O ¡be ¡causing ¡ the ¡increasing ¡I/O ¡overhead? ¡

¡ ¡ ¡ ¡512 ¡ ¡ ¡ ¡ ¡ ¡ ¡1K ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2K ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡4K ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡8K ¡ ¡

Nodes ¡

SLIDE 3

Outline ¡

QMCPack ¡and ¡ADIOS ¡

– Overview ¡of ¡the ¡codes ¡

I/O ¡kernels ¡

– From ¡real ¡to ¡synthe<c ¡

ADIOS ¡phases ¡

– In-‑depth ¡overview ¡of ¡the ¡I/O ¡library ¡and ¡phases ¡

Understanding ¡I/O ¡impact ¡

– Does ¡the ¡irregular ¡I/O ¡indeed ¡impact ¡performance? ¡

Conclusion ¡and ¡future ¡work ¡

2

SLIDE 4

Outline ¡

QMCPack ¡and ¡ADIOS ¡

– Overview ¡of ¡the ¡codes ¡

I/O ¡kernels ¡

– From ¡real ¡to ¡synthe<c ¡

ADIOS ¡phases ¡

– In-‑depth ¡overview ¡of ¡the ¡I/O ¡library ¡and ¡phases ¡

Understanding ¡I/O ¡impact ¡

– Does ¡the ¡irregular ¡I/O ¡indeed ¡impact ¡performance? ¡

Conclusion ¡and ¡future ¡work ¡

3

SLIDE 5

QMCPack ¡

QMCPack ¡is ¡a ¡Quantum ¡Monte ¡

Carlo ¡applica<on ¡

Use ¡loosely ¡coupled ¡algorithm: ¡ ¡

§ Each ¡process ¡runs ¡n ¡independent ¡ walkers ¡ ¡ § Walkers ¡are ¡generated ¡randomly ¡ and ¡evolve ¡over ¡a ¡series ¡of ¡steps ¡ towards ¡a ¡higher ¡energy ¡level ¡ § Walker ¡evolu<on ¡results ¡in ¡a ¡ unbalanced ¡load ¡across ¡nodes ¡ ¡

4

In ¡theory, ¡QMCPack ¡is ¡the ¡ perfect ¡exascale ¡applica<on ¡

Write in

utput

Write in

utput

block ¡ step ¡ step ¡ block ¡ step ¡ step ¡ w0 ¡ w1 ¡ w2 ¡ w0 ¡ w1 ¡ p0 ¡ p1 ¡ Kim, et al. Hybrid algorithms in quantum Monte

Carlo. J. of Physics: Conference Series, 2012.

SLIDE 6

Scien<fic ¡Data: ¡From ¡Single ¡Sta<s<cs ¡To ¡Traces ¡

5

avg ¡ avg ¡ avg ¡ avg ¡ Walker ¡1 ¡ Walker ¡2 ¡ Walker ¡3 ¡ Walker ¡1 ¡ Walker ¡2 ¡ Walker ¡3 ¡ Walker ¡1 ¡ Walker ¡2 ¡ Walker ¡3 ¡ Process ¡1 ¡ Process ¡2 ¡ Process ¡3 ¡ Walker ¡1 ¡ Walker ¡2 ¡ Walker ¡3 ¡ Walker ¡1 ¡ Walker ¡2 ¡ Walker ¡3 ¡ Walker ¡1 ¡ Walker ¡2 ¡ Walker ¡3 ¡ Process ¡1 ¡ Process ¡2 ¡ Process ¡3 ¡ I/O ¡Method ¡

Old ¡QMCPack ¡– ¡single ¡sta9s9cs ¡ ¡ New ¡QMCPack ¡– ¡par9cle ¡traces ¡

SLIDE 7

ADIOS ¡

Includes ¡suite ¡of ¡simple, ¡easy-‑to-‑

use ¡APIs ¡ ¡and ¡metadata ¡stored ¡in ¡ external ¡XML ¡file ¡

Allow ¡flexible ¡selec<on ¡of ¡most ¡

effec<ve ¡I/O ¡method ¡

– Different ¡I/O ¡transport ¡methods ¡ – Different ¡supercomputers ¡ – Li4le ¡code ¡modifica<on ¡

Provide ¡both ¡high ¡performance, ¡

reliability, ¡and ¡usability ¡

6

Lofstead ¡et ¡al. ¡ADIOS: ¡Adaptable, ¡Metadata ¡Rich ¡I/O ¡ Methods ¡for ¡Portable ¡High ¡Performance ¡I/O. ¡PDSW ¡2008 ¡

ADIOS components and supported methods

QMCPack

SLIDE 8

I/O ¡Methods ¡

7

HDF ¡ ADIOS-‑POSIX ¡

In POSIX each writer creates a file on the file system.

ADIOS-‑AGGR ¡

Each process group is a collection of writers, one of which is the aggregator Process Group

SLIDE 9

QMCPack ¡I/O ¡Performance ¡

8

ADIOS AGGR 2to1 ADIOS AGGR 4to1 ADIOS AGGR 8to1 HDF5 ADIOS AGGR 2to1 ADIOS AGGR 4to1 ADIOS AGGR 8to1 HDF5 ADIOS AGGR 2to1 ADIOS AGGR 4to1 ADIOS AGGR 8to1 HDF5 ADIOS AGGR 2to1 ADIOS AGGR 4to1 ADIOS AGGR 8to1 HDF5 ADIOS AGGR 2to1 ADIOS AGGR 4to1 ADIOS AGGR 8to1 HDF5 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Time (%)

I/O Time Execution Time

n cores=8192 n cores=16384 n cores=32768 n cores=65536 n cores=131072

¡ ¡ ¡ ¡ ¡512 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡1K ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2K ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡4K ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡8K ¡ ¡

Nodes ¡

SLIDE 10

Are ¡we ¡at ¡the ¡Peak ¡I/O ¡Performance? ¡

Max ¡I/O ¡bandwidth ¡achieved ¡in ¡QMCPack ¡runs ¡

¡16 ¡GB/s ¡= ¡29% ¡I/O ¡Overhead ¡

Theore<cal ¡peak ¡of ¡ORNL’s ¡filesystem ¡at ¡the ¡<me ¡

¡80 ¡GB/s ¡= ¡7.5% ¡I/O ¡Overhead ¡

9

SLIDE 11

Irregular ¡I/O ¡Pa4ern ¡= ¡Poor ¡Performance? ¡ ¡

¡

We ¡hypothesized ¡that ¡the ¡irregular ¡I/O ¡pa4ern ¡of ¡QMCPack ¡

was ¡causing ¡I/O ¡imbalance ¡which ¡in ¡turn ¡was ¡causing ¡a ¡ slowdown ¡

To ¡test ¡this ¡hypothesis, ¡we ¡sta<s<cally ¡modeled ¡the ¡I/O ¡

pa4ern ¡of ¡QMCPack ¡and ¡other ¡scien<fic ¡applica<ons ¡

We ¡benchmarked ¡and ¡profiled ¡I/O ¡kernels ¡generated ¡from ¡

these ¡models ¡

10

What ¡is ¡the ¡key ¡factor ¡affec<ng ¡the ¡performance ¡of ¡QMCPack? ¡

SLIDE 12

Outline ¡

QMCPack ¡and ¡ADIOS ¡

– Overview ¡of ¡the ¡codes ¡

I/O ¡kernels ¡

– From ¡real ¡to ¡synthe<c ¡

ADIOS ¡phases ¡

– In-‑depth ¡overview ¡of ¡the ¡I/O ¡library ¡and ¡phases ¡

Understanding ¡I/O ¡impact ¡

– Does ¡the ¡irregular ¡I/O ¡indeed ¡impact ¡performance? ¡

Conclusion ¡and ¡future ¡work ¡

11

SLIDE 13

I/O ¡Kernels ¡

I/O ¡kernels ¡used ¡to ¡mimic ¡I/O ¡of ¡real ¡applica<ons ¡

– Include ¡a ¡broader ¡range ¡of ¡I/O ¡pa4erns ¡

Advantages ¡of ¡I/O ¡kernels ¡

– Faster ¡to ¡run ¡and ¡profile ¡ – Easier ¡to ¡understand ¡ – Safer ¡to ¡distribute ¡

State ¡of ¡the ¡prac<ce ¡

– Manual ¡extract ¡of ¡I/O ¡rou<nes ¡from ¡large ¡applica<ons ¡ – Automa<c ¡genera<on ¡from ¡I/O ¡pa4erns ¡

¡

12

SLIDE 14

I/O ¡Kernels ¡

I/O ¡kernels ¡used ¡to ¡mimic ¡I/O ¡of ¡real ¡applica<ons ¡

– Include ¡a ¡broader ¡range ¡of ¡I/O ¡pa4erns ¡

Advantages ¡of ¡I/O ¡kernels ¡

– Faster ¡to ¡run ¡and ¡profile ¡ – Easier ¡to ¡understand ¡ – Safer ¡to ¡distribute ¡

State ¡of ¡the ¡prac<ce ¡

– Manual ¡extract ¡of ¡I/O ¡rou<nes ¡from ¡large ¡applica<ons ¡ – Automa9c ¡genera9on ¡from ¡I/O ¡paCerns ¡

¡

13

SLIDE 15

Flavors ¡of ¡Real ¡I/O ¡Pa4erns ¡

14

B B B B B B B B B B B B B B B 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 10000 20000 30000 40000 50000 60000 70000 Data size (KBytes) Number of write steps B B B B B B B B B B B B B B B 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5 10 15 20 25 30 35 40 45 Number of walkers Number of write steps

QMCPack 256 processes (2 per node) ENZO 256 processes (2 per node) Profiles ¡of ¡the ¡number ¡of ¡walkers ¡per ¡ process ¡for ¡15 ¡steps ¡of ¡a ¡QMCPack ¡ simula<on ¡studying ¡a ¡4x4x2 ¡graphite ¡on ¡ 256 ¡nodes ¡on ¡Titan ¡ Profiles ¡of ¡the ¡data ¡wri4en ¡per ¡process ¡ for ¡15 ¡steps ¡of ¡an ¡ENZO ¡simula<on ¡ studying ¡the ¡NFW ¡Cool-‑Core ¡Cluster ¡test ¡

n ¡256 ¡nodes ¡of ¡Titan ¡

SLIDE 16

QMCPack-‑like ¡Applica<ons: ¡Normal ¡Distribu<on ¡

Mimic ¡the ¡I/O ¡of ¡applica<ons ¡with ¡QMCPack-‑like ¡behavior ¡

<0 [0,10) [10,20) [20,30) [30,40) [40,50) [50,60) [60,70) ≥70 50 100 150 200 250 300 Number of processes Number of records

Sta<s<cal ¡model ¡of ¡the ¡I/O ¡pa4ern ¡ mimicking ¡a ¡normal ¡distribu<on ¡ Example ¡of ¡I/O ¡pa4ern ¡generated ¡with ¡ the ¡sta<s<cal ¡model ¡

15

SLIDE 17

Mimic ¡the ¡I/O ¡of ¡applica<ons ¡with ¡ENZO-‑like ¡behavior ¡

¡ ¡

ENZO-‑like ¡Applica<ons: ¡Exponen<al ¡Distribu<on ¡

<0 [0,20) [20,40) [40,60) [60,80) [80,100) [100,120) [120,140) [140,160) [160,180) [180,200) ≥200 50 100 150 200 250 300 Number of processes Number of records

Example ¡of ¡I/O ¡pa4ern ¡generated ¡with ¡ the ¡sta<s<cal ¡model ¡ Sta<s<cal ¡model ¡of ¡the ¡I/O ¡pa4ern ¡ mimicking ¡an ¡exponen<al ¡distribu<on ¡

16

SLIDE 18

Mimic ¡the ¡I/O ¡of ¡applica<ons ¡with ¡S3D-‑like ¡behavior ¡

17

Regular ¡Applica<ons: ¡Uniform ¡Distribu<on ¡

<0 [0,5) [5,10) [10,15) [15,20) [20,25) [25,30) [30,35) ≥35 50 100 150 200 250 300 Number of processes Number of records

Sta<s<cal ¡model ¡of ¡the ¡I/O ¡pa4ern ¡ mimicking ¡a ¡uniform ¡distribu<on ¡ Example ¡of ¡I/O ¡pa4ern ¡generated ¡with ¡ the ¡sta<s<cal ¡model ¡

SLIDE 19

18

Genera<ng ¡the ¡I/O ¡Kernels ¡

Sta<s<cal ¡ I/O ¡ Model ¡ ADIOS ¡ XML ¡

Skel ¡

Executable ¡ I/O ¡Kernel ¡

J. ¡Logan, ¡et ¡al. ¡Skel: ¡Genera<ve ¡Sokware ¡for ¡Producing ¡

Skeletal ¡I/O ¡Applica<ons. ¡e-‑Science ¡Workshops ¡2011. ¡

SLIDE 20

Outline ¡

QMCPack ¡and ¡ADIOS ¡

– Overview ¡of ¡the ¡codes ¡

I/O ¡kernels ¡

– From ¡real ¡to ¡synthe<c ¡

ADIOS ¡phases ¡

– In-‑depth ¡overview ¡of ¡the ¡I/O ¡library ¡and ¡phases ¡

Understanding ¡I/O ¡impact ¡

– Does ¡the ¡irregular ¡I/O ¡indeed ¡impact ¡performance? ¡

Conclusion ¡and ¡future ¡work ¡

19

SLIDE 21

ADIOS ¡Times ¡

20

Open ¡ Write ¡ Close ¡

SLIDE 22

p ¡

ADIOS ¡Times: ¡Open ¡

21

Node ¡1 ¡ Writer ¡ 1 ¡ Writer ¡ 2 ¡ Node ¡2 ¡ Writer ¡ 3 ¡ Writer ¡ 4 ¡ Node ¡3 ¡ Writer ¡ 5 ¡ Writer ¡ 6 ¡ Node ¡4 ¡ Writer ¡ 7 ¡ Writer ¡ 8 ¡ OST ¡1 ¡ OST ¡2 ¡ OST ¡3 ¡ OSS ¡1 ¡ OST ¡4 ¡ OST ¡5 ¡ OST ¡6 ¡ OSS ¡2 ¡ Process Group Process Group MDS ¡

SLIDE 23

p ¡

ADIOS ¡Times: ¡Write ¡

22

Node ¡1 ¡ Writer ¡ 1 ¡ Writer ¡ 2 ¡ Node ¡2 ¡ Writer ¡ 3 ¡ Writer ¡ 4 ¡ Node ¡3 ¡ Writer ¡ 5 ¡ Writer ¡ 6 ¡ Node ¡4 ¡ Writer ¡ 7 ¡ Writer ¡ 8 ¡ OST ¡1 ¡ OST ¡2 ¡ OST ¡3 ¡ OSS ¡1 ¡ OST ¡4 ¡ OST ¡5 ¡ OST ¡6 ¡ OSS ¡2 ¡ Process Group Process Group

SLIDE 24

p ¡

ADIOS ¡Times: ¡Simple ¡Close ¡

23

Node ¡1 ¡ Writer ¡ 1 ¡ Writer ¡ 2 ¡ Node ¡2 ¡ Writer ¡ 3 ¡ Writer ¡ 4 ¡ Node ¡3 ¡ Writer ¡ 5 ¡ Writer ¡ 6 ¡ Node ¡4 ¡ Writer ¡ 7 ¡ Writer ¡ 8 ¡ OST ¡1 ¡ OST ¡2 ¡ OST ¡3 ¡ OSS ¡1 ¡ OST ¡4 ¡ OST ¡5 ¡ OST ¡6 ¡ OSS ¡2 ¡ Process Group Process Group

SLIDE 25

p ¡

ADIOS ¡Times: ¡Brigade ¡Close ¡

24

Node ¡1 ¡ Writer ¡ 1 ¡ Writer ¡ 2 ¡ Node ¡2 ¡ Writer ¡ 3 ¡ Writer ¡ 4 ¡ Node ¡3 ¡ Writer ¡ 5 ¡ Writer ¡ 6 ¡ Node ¡4 ¡ Writer ¡ 7 ¡ Writer ¡ 8 ¡ OST ¡1 ¡ OST ¡2 ¡ OST ¡3 ¡ OSS ¡1 ¡ OST ¡4 ¡ OST ¡5 ¡ OST ¡6 ¡ OSS ¡2 ¡ Process Group Process Group

SLIDE 26

Outline ¡

QMCPack ¡and ¡ADIOS ¡

– Overview ¡of ¡the ¡codes ¡

I/O ¡kernels ¡

– From ¡real ¡to ¡synthe<c ¡

ADIOS ¡phases ¡

– In-‑depth ¡overview ¡of ¡the ¡I/O ¡library ¡and ¡phases ¡

Understanding ¡I/O ¡impact ¡

– Does ¡the ¡irregular ¡I/O ¡indeed ¡impact ¡performance? ¡

Conclusion ¡and ¡future ¡work ¡

25

SLIDE 27

Test ¡Setup ¡

Oak ¡Ridge ¡Na<onal ¡Lab’s ¡Titan ¡

– 80GB/s ¡Lustre ¡Filesystem ¡ – 2048 ¡& ¡4096 ¡writers ¡ – 2 ¡writers ¡per ¡node ¡

Medium ¡and ¡large ¡data ¡sizes ¡

– 2.4MB ¡and ¡24MB ¡per ¡writer ¡

3 ¡distribu<ons: ¡uniform, ¡normal, ¡and ¡exponen<al ¡
15 ¡itera<ons ¡of ¡I/O ¡with ¡unique ¡values ¡for ¡each ¡itera<on ¡
3 ¡samples ¡for ¡each ¡run ¡(1 ¡per ¡filesystem) ¡

– Same ¡traces ¡used ¡for ¡each ¡sample ¡

26

SLIDE 28

List ¡of ¡Results ¡

No ¡threading ¡vs. ¡threading ¡

– Threading ¡is ¡be4er ¡because ¡it ¡eliminates ¡the ¡cost ¡of ¡the ¡open ¡phase ¡

Type ¡of ¡aggrega<on: ¡All-‑gather ¡vs. ¡brigade ¡

– Similar ¡performance ¡but ¡brigade ¡has ¡a ¡smaller ¡memory ¡footprint ¡

Data ¡size ¡and ¡I/O ¡pa4ern ¡

– Dealing ¡with ¡irregular ¡I/O ¡at ¡large ¡sizes ¡is ¡a ¡whole ¡different ¡ballgame ¡

Variable ¡number ¡of ¡aggregators ¡and ¡OSTs ¡

– Varying ¡the ¡number ¡of ¡aggregators ¡has ¡a ¡greater ¡impact ¡than ¡OSTs ¡

27

SLIDE 29

List ¡of ¡Results ¡

No ¡threading ¡vs. ¡threading ¡

– Threading ¡is ¡be4er ¡because ¡it ¡eliminates ¡the ¡cost ¡of ¡the ¡open ¡phase ¡

Type ¡of ¡aggrega<on: ¡All-‑gather ¡vs. ¡brigade ¡

– Similar ¡performance ¡but ¡brigade ¡has ¡a ¡smaller ¡memory ¡footprint ¡

Data ¡size ¡and ¡I/O ¡pa4ern ¡

– Dealing ¡with ¡irregular ¡I/O ¡at ¡large ¡sizes ¡is ¡a ¡whole ¡different ¡ballgame ¡

Variable ¡number ¡of ¡aggregators ¡and ¡OSTs ¡

– Varying ¡the ¡number ¡of ¡aggregators ¡has ¡a ¡greater ¡impact ¡than ¡OSTs ¡

28

SLIDE 30

Small ¡(2.4MB) ¡ Large ¡(24MB) ¡

Uniform ¡ Exponen9al ¡

29

From ¡Small ¡to ¡Large ¡Data ¡

SLIDE 31

4 ¡to ¡1 ¡ 16 ¡to ¡1 ¡

Uniform ¡ Exponen9al ¡

30

Variable ¡Aggrega<on ¡Ra<o ¡for ¡Large ¡Data ¡

SLIDE 32

4 ¡to ¡1 ¡ 16 ¡to ¡1 ¡

Uniform ¡ Exponen9al ¡

31

Variable ¡Aggrega<on ¡Ra<o ¡for ¡Large ¡Data ¡

SLIDE 33

Lessons ¡Learned ¡

I/O ¡kernels ¡can ¡be ¡an ¡effec<ve ¡tool ¡in ¡profiling ¡I/O ¡

performance ¡

Dealing ¡with ¡irregular ¡I/O ¡is ¡a ¡different ¡ballgame ¡

– I/O ¡pa4ern ¡must ¡be ¡considered ¡when ¡running ¡simula<ons ¡at ¡ ¡the ¡large ¡ scale ¡

Op<mal ¡I/O ¡parameters ¡for ¡one ¡I/O ¡pa4ern ¡are ¡not ¡

performing ¡similarly ¡for ¡other ¡I/O ¡pa4erns ¡

Searching ¡for ¡op<mal ¡parameter ¡is ¡<me ¡demanding ¡and ¡

imperfect ¡if ¡done ¡manually ¡ ¡

32

SLIDE 34

Work ¡in ¡Progress ¡

33

M. ¡Matheny, ¡S. ¡Herbein ¡et ¡al. ¡Using ¡

Surrogate-‑based ¡Modeling ¡to ¡ Predict ¡Op<mal ¡I/O ¡Parameters ¡of ¡ Applica<ons ¡at ¡the ¡Extreme ¡Scale. ¡ In ¡Proceedings ¡of ¡the ¡20th ¡IEEE ¡ Interna4onal ¡Conference ¡on ¡ Parallel ¡and ¡Distributed ¡Systems ¡ (ICPADS ¡2014). ¡Hsinchu, ¡Taiwan, ¡ December ¡16 ¡– ¡19, ¡2014 ¡

How ¡do ¡we ¡select ¡the ¡op<mal ¡ parameter ¡values? ¡

SLIDE 35

Acknowledgments ¡ ¡

GCLab ¡members: ¡

– Michael ¡Matheny ¡ – Ma4hew ¡Wezowicz ¡

Collaborators: ¡

– Jeremy ¡Logan ¡ – Norbert ¡Podhorszki ¡ – Jaron ¡Krogel ¡ – Jeongnim ¡Kim ¡

Advisors: ¡

– Michela ¡Taufer ¡ – Sco4 ¡Klasky ¡

34

Benchmarking the Performance of Scientific Applications with Irregular I/O at the Extreme Scale

Stephen ¡Herbein, ¡University ¡of ¡Delaware ¡ Sco4 ¡Klasky, ¡Oak ¡Ridge ¡Na<onal ¡Laboratory ¡ Michela ¡Taufer, ¡University ¡of ¡Delaware ¡ ¡

Mo<va<on ¡

What ¡are ¡the ¡factors ¡impac<ng ¡ the ¡I/O ¡performance? ¡ ¡ Could ¡irregular ¡I/O ¡be ¡causing ¡ the ¡increasing ¡I/O ¡overhead? ¡

Outline ¡

Outline ¡

QMCPack ¡

Carlo ¡applica<on ¡

In ¡theory, ¡QMCPack ¡is ¡the ¡ perfect ¡exascale ¡applica<on ¡

Scien<fic ¡Data: ¡From ¡Single ¡Sta<s<cs ¡To ¡Traces ¡

ADIOS ¡

use ¡APIs ¡ ¡and ¡metadata ¡stored ¡in ¡ external ¡XML ¡file ¡

effec<ve ¡I/O ¡method ¡

reliability, ¡and ¡usability ¡

QMCPack

I/O ¡Methods ¡

HDF ¡ ADIOS-­‑POSIX ¡

ADIOS-­‑AGGR ¡

QMCPack ¡I/O ¡Performance ¡

Are ¡we ¡at ¡the ¡Peak ¡I/O ¡Performance? ¡

¡16 ¡GB/s ¡= ¡29% ¡I/O ¡Overhead ¡

¡80 ¡GB/s ¡= ¡7.5% ¡I/O ¡Overhead ¡

Irregular ¡I/O ¡Pa4ern ¡= ¡Poor ¡Performance? ¡ ¡

¡

was ¡causing ¡I/O ¡imbalance ¡which ¡in ¡turn ¡was ¡causing ¡a ¡ slowdown ¡

pa4ern ¡of ¡QMCPack ¡and ¡other ¡scien<fic ¡applica<ons ¡

these ¡models ¡

What ¡is ¡the ¡key ¡factor ¡affec<ng ¡the ¡performance ¡of ¡QMCPack? ¡

Outline ¡

I/O ¡Kernels ¡

¡

I/O ¡Kernels ¡

¡

Flavors ¡of ¡Real ¡I/O ¡Pa4erns ¡

QMCPack-­‑like ¡Applica<ons: ¡Normal ¡Distribu<on ¡

¡ ¡

ENZO-­‑like ¡Applica<ons: ¡Exponen<al ¡Distribu<on ¡

Regular ¡Applica<ons: ¡Uniform ¡Distribu<on ¡

Genera<ng ¡the ¡I/O ¡Kernels ¡

Skel ¡

Executable ¡ I/O ¡Kernel ¡

Outline ¡

ADIOS ¡Times ¡

Open ¡ Write ¡ Close ¡

ADIOS ¡Times: ¡Open ¡

ADIOS ¡Times: ¡Write ¡

ADIOS ¡Times: ¡Simple ¡Close ¡

ADIOS ¡Times: ¡Brigade ¡Close ¡

Outline ¡

Test ¡Setup ¡

List ¡of ¡Results ¡

List ¡of ¡Results ¡

Small ¡(2.4MB) ¡ Large ¡(24MB) ¡

Uniform ¡ Exponen9al ¡

From ¡Small ¡to ¡Large ¡Data ¡

4 ¡to ¡1 ¡ 16 ¡to ¡1 ¡

Uniform ¡ Exponen9al ¡

Variable ¡Aggrega<on ¡Ra<o ¡for ¡Large ¡Data ¡

4 ¡to ¡1 ¡ 16 ¡to ¡1 ¡

Uniform ¡ Exponen9al ¡

Variable ¡Aggrega<on ¡Ra<o ¡for ¡Large ¡Data ¡

Lessons ¡Learned ¡

performance ¡

performing ¡similarly ¡for ¡other ¡I/O ¡pa4erns ¡

imperfect ¡if ¡done ¡manually ¡ ¡

Work ¡in ¡Progress ¡

How ¡do ¡we ¡select ¡the ¡op<mal ¡ parameter ¡values? ¡

Acknowledgments ¡ ¡

Sponsors:

Benchmarking the Performance of Scientific Applications with  Irregular I/O at the Extreme Scale

HDF ¡ ADIOS-‑POSIX ¡

ADIOS-‑AGGR ¡

QMCPack-‑like ¡Applica<ons: ¡Normal ¡Distribu<on ¡

ENZO-‑like ¡Applica<ons: ¡Exponen<al ¡Distribu<on ¡