A Foundation for Automated Placement of Data Douglass Otstott, Sean Williams, Latchesar Ionkov, Michael Lang, Ming Zhao LA-UR-17-22686 Managed by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Memory and Storage are Converging • Persistent storage on the memory bus (NVDIMMs) • Remote memory (GenZ) • Which memory bus? (DRAM, HBM, GPU memory, … ) Los Alamos National Laboratory 10/22/2019 2
Data Layouts are Different Dataset Memory N N ... ... ... ... pressure=5.1 M M ... ... ... N temp=33.1 ... ... ... ... density=0.4 ... temp pressure M ... N ... ... ... ... M ... density ... ... Storage row 1 row 2 row 3 row M data ... ... ... ... pressure=5.1 density=0.4 row 1 row 2 row 3 row M ... temperature ... ... ... ... 3
Data Sharing • With less distinction — more confusion • With more complicated workloads there are a lot of options • In situ, in transit, … • No generic way for sharing data in memory between applications • ad-hoc • in-memory file system • What data format? • data producer • data consumer 4
Need for Data Management Service • Handles all data that application shares • Moves data between the many memory and storage layers • Allows data layout transformations • This work • describes the foundations for building such service • allows data movement and transformation • doesn’t include the support for global data optimizations 5
Components • Name server • handles metadata • global • Runtime • runs on every node • handles local data • talks to runtimes on other nodes • Global/Local placement services (not included) • optimize data locality and format • Application (not included) 6
Data Model • Dataset • types • primitive types (integer, floating point, string) • structs • (multidimensional) arrays • variables • Fragments • subsets of a dataset • types - based on dataset types • variables - based on dataset variables • Versions • provide consistent view of distributed dataset 7
Declarative Data Language & Transformations • For the computers: transformation rules • For the user: define the abstract that convert data between dataset and dataset and subsets subsets fragment dataset { var p struct { S 0000 field a pa T 0000 a, b, c float64 } dest dest } S 0008 field a pba T 0004 dest field b dest S 0004 fragment default { dest viz var p = p } dest dest S 0000 field a fragment viz { p T 0000 fi e l d b S 0004 var pa { a } = p field c var pba { b, a } = p S 0008 } default 8
• API • Operations • create object • object registered in the name server • name • dataset description • runtime • attach fragment • finds the locations of necessary • dataset name fragments that contain the • fragment description relevant data and version • version • brings the data and transforms it to the required format • publish fragment • runtime • data pointer • registers the fragment version in • version the name server • keeps copy of the data in memory or local storage 9
F11 F12 F13 • Can be used for A communication between ranks F21 • Fragment can have read- only and read-write parts of complex geometry F22 F31 10
Results 140000 create_object attach • Synthetic benchmark 120000 publish 100000 Operations/sec • Evaluates the overhead 80000 of the operations 60000 • Single name server 40000 20000 • 16 ranks per node 0 16 32 64 128 256 512 1024 2048 Ranks 11
Results: SNAP checkpoint 100 • Original SNAP (no checkpoints) vs. adding 80 the checkpoint code 60 Time(s) • Evaluate the overhead 40 20 RT/NS SNAP SNAP 0 16 32 64 128 256 512 1024 2048 Ranks 12
RT/NS SNAP 3500 MPI-IO SNAP 3000 2500 Time(s) 2000 1500 1000 500 0 16 32 64 128 256 512 1024 2048 Ranks 13
120 100 80 Time(s) 60 40 20 N to N restart N to N over 2 restart N to N over 4 restart 0 4 8 16 32 64 128 256 512 1024 2048 Ranks 14
Results: VPIC 90 VPIC I/O RT/NS I/O 80 RT/NS No I/O 70 Percent Overhead 60 50 40 30 20 10 0 16 32 64 128 256 512 1024 Ranks 15
Conclusions • Scalable data service • Easy to use API • Future • Integration with data placement services • Additional applications (E3SM) • Scalable name server 16
Recommend
More recommend