understanding data motion in the modern hpc data center
play

Understanding Data Motion in the Modern HPC Data Center Glenn K. - PowerPoint PPT Presentation

Understanding Data Motion in the Modern HPC Data Center Glenn K. Lockwood Shane Snyder Suren Byna Philip Carns Nicholas J. Wright - 1 - Scientific computing is more than compute! Tape Tape Tape Tape Tape Tape Tape Tape GW Tape


  1. Understanding Data Motion in the Modern HPC Data Center Glenn K. Lockwood Shane Snyder Suren Byna Philip Carns Nicholas J. Wright - 1 -

  2. Scientific computing is more than compute! Tape Tape Tape Tape Tape Tape Tape Tape GW Tape Tape Tape Tape GW Science Gateway Science Gateway Center-wide GW CN CN CN CN CN ION Network SN GW CN CN CN CN CN ION SN WAN (ESnet) Storage Fabric SN CN CN CN CN CN ION SN Router SN CN CN CN CN CN ION Data Transfer SN Data Transfer SN CN ION ION CN CN CN CN Center-wide SN Data Transfer Fabric CN CN CN CN CN ION ION Data Transfer Data Transfer Data Transfer SN SN SN SN SN

  3. Goal: Understand data motion everywhere Tape Tape Tape Tape Tape Tape Tape Tape GW Tape Tape Tape Tape GW Science Gateway Science Gateway Center-wide GW CN CN CN CN CN ION Network SN GW CN CN CN CN CN ION SN WAN (ESnet) Storage Fabric SN CN CN CN CN CN ION SN Router SN CN CN CN CN CN ION Data Transfer SN Data Transfer SN CN ION ION CN CN CN CN Center-wide SN Data Transfer Fabric CN CN CN CN CN ION ION Data Transfer Data Transfer Data Transfer SN SN SN SN SN

  4. Our simplified model for data motion External Facilities Storage- Compute- External External Compute- Storage- Compute Storage Storage Compute Systems Systems Compute- Storage - 4 -

  5. Mapping this model to NERSC External Facilities Storage Compute Systems Systems - 5 -

  6. Relevant logs kicking around at NERSC Globus logs External no remote storage Facilities system info HPSS logs some remote storage system info missing Storage Compute Systems Systems Darshan data volumes come with caveats - 6 -

  7. Normalizing data transfer records Compute-Storage Storage Compute System System Parameter Example Source site, host, storage system NERSC, Cori, System Memory Destination site, host, storage system NERSC, Cori, cscratch1 (Lustre) Time of transfer start and finish June 4 @ 12:28 – June 4 @ 12:32 Volume of data transferred 34,359,738,368 bytes Tool that logged transfer Darshan, POSIX I/O module Owner of data transferred uname=glock, uid=69615 - 7 -

  8. What is possible with this approach? Outside World HPSS Science DTNs Gateway Gateways May 1 – August 1, 2019 • 194 million transfers • 78.6 PiB data moved burst archive cscratch project home buffer Cori ≤ 8 TiB/day > 16 TiB/day > 64 TiB/day > 8 TiB/day > 32 TiB/day > 128 TiB/day - 8 -

  9. Visualizing data motion as a graph • Job I/O is most Outside World voluminous • Home file system HPSS Science usage is least DTNs Gateway Gateways voluminous • Burst buffer is read- burst heavy archive cscratch project home buffer • Users prefer to access archive Cori directly from Cori than use DTNs ≤ 8 TiB/day > 16 TiB/day > 64 TiB/day > 8 TiB/day > 32 TiB/day > 128 TiB/day - 9 -

  10. Mapping this data to our model Outside World External Storage- Facilities External HPSS Science Storage- DTNs Gateway Gateways Storage Storage Compute Systems Systems burst Compute- archive cscratch project home buffer Storage Cori - 10 -

  11. Adding up data moved along each vector • Job I/O is significant Data Transferred (TiB) • Inter-tier is significant 512 TiB – I/O outside of jobs ~ job 128 TiB write traffic 32 TiB – Fewer tiers, fewer tears 8 TiB • HPC I/O is not just 2 TiB checkpoint-restart! 512 GiB External e e e N e g t g g u A a a a Storage- p Facilities r r W r o m o o t t t - S S o S e External - C - g - e e N a Storage- - t g e r u A a o g p W r t a Storage m o S r t o o S t C S Storage Compute Systems Systems Compute- Storage - 11 -

  12. Examining non-job I/O patterns Hypothesis: non-job I/O • 31 KiB 720 KiB 1,800 KiB 47,000 KiB is poorly formed 1 . 00 – Job I/O: optimized Cumulative fraction of – Others: fire-and-forget Users transfer larger • 0 . 75 total transfers files than they store (good) 0 . 50 Archive transfers are • Globus transfers largest (good) Darshan transfers 0 . 25 HPSS transfers WAN transfers are • Files at rest smaller than job I/O 0 . 00 files (less good) e s B B B B B B B B B t e i i i i i i i i i y t T T K K M M G G P y b b 1 2 1 1 1 2 1 2 1 2 3 2 3 3 3 3 Size of transfer - 12 -

  13. Few users resulted in the most transfers Amy/Darshan Eve/Darshan,Globus,HPSS • 1,562 unique users Bob/Darshan Frank/Darshan,Globus Carol/Darshan Gail/Darshan,Globus • Top 4 users = 66% of Dan/Darshan Henry/HPSS volume transferred total volume transferred Cumulative fraction of 1 . 00 • Users 5-8 = 5.8% 0 . 75 – All used multiple 0 . 50 transfer vectors 0 . 25 – Henry is a storage- 0 . 00 B B B B B B B only user i i i i i i i M M G G T T P 1 2 1 1 2 1 2 3 3 3 Size of transfer - 13 -

  14. Examining transfers along many dimensions • Break down Reads by FS transfers by r/w and file system Reads by User • Top users are Amy tmpfs read-heavy Write by FS Bob burst buffer Carol cscratch – Rereading same Dan homes Write by User files Other users project – Targeting cscratch 0 bytes 128 TiB 256 TiB 384 TiB 512 TiB 640 TiB 768 TiB (Lustre) - 14 -

  15. Tracing using users, volumes, and directions • Correlating reveals 2 Compute- workflow coupling Storage (TiB) 1 – S-S precedes C-S/S- C 2 Storage- – 2:1 RW ratio during Compute (TiB) 1 job 2 – Data reduction of Storage- archived data Storage (TiB) 1 • This was 0 admittedly an Apr 29, 2019 May 12, 2019 May 25, 2019 Jun 7, 2019 Jun 20, 2019 Jul 3, 2019 Jul 16, 2019 Jul 29, 2019 exceptional case - 15 -

  16. Is this the full story? Quantify the amount of Outside World transfers not captured Compare volume • HPSS Science transferred to system DTNs Gateway Gateways monitoring (storage systems) Compare bytes in to burst • archive cscratch project home buffer bytes out (transfer nodes) Cori - 16 -

  17. Not every data transfer was captured 100 • 100% true data volume Captured by Transfers % True Data Volume In/Write 80 should be captured by Out/Read transfers 60 • Missing lots of data— 40 why? 20 – Darshan logs not 0 generated; cp missing cscratch project Burst archive Outside – Globus-HPSS adapter Buffer World logs absent – Only Globus logged; rsync/bbcp absent - 17 -

  18. Identifying leaky transfer nodes • Incongruency (Δ) – data in vs. data out – FOM for how “leaky” a node is – Δ = 0 means all bytes in = all 1 . 5 1.27 bytes out Incongruency • Cori: expect >> 0 because 1 . 0 jobs generate data 0.613 0 . 5 • Science gateways > 0 0.137 because ??? 0.018 0 . 0 Cori HPSS DTNs Science Gateway Gateways - 18 -

  19. Towards Total Knowledge of I/O New profiling tools Better insight into what to capture I/O from is happening inside Outside World other transfer tools Docker containers (bbcp, scp, etc) HPSS Science DTNs Gateway Gateways Improve analysis process to handle complex transfers burst archive cscratch project home buffer More robust collection of job I/O data; cache- Cori aware I/O data (LDMS) - 19 -

  20. There’s more to HPC I/O than job I/O • Inter-tier I/O is too significant to ignore – need better monitoring of data transfer tools – users benefit from fewer tiers, strong connectivity between tiers – need to optimize non-job I/O patterns • Transfer-centric approaches yield new holistic insight into workflow I/O behavior – Possible to trace user workflows across a center – Humans in the loop motivate more sophisticated methods - 20 -

  21. We gratefully acknowledge the support of We’re hiring! • Damian Hazen (NERSC) • Ravi Cheema (NERSC) • Kristy Kallback-Rose (NERSC) • Jon Dugan (ESnet) • Nick Balthaser (NERSC) • Eli Dart (ESnet) • Lisa Gerhardt (NERSC) This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contracts DE- AC02- 05CH11231 and DE-AC02-06CH11357 . This research used resources and data generated from resources of the National Energy Research Scientific Computing Center , a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 and the Argonne Leadership Computing Facility , a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. - 21 -

  22. Few users result in the most transfers Amy/Darshan Eve/Darshan,Globus,HPSS 350 Bob/Darshan Frank/Darshan,Globus Carol/Darshan Gail/Darshan,Globus Dan/Darshan Henry/HPSS Number of users 250 total volume transferred Cumulative fraction of 1 . 00 0 . 75 150 0 . 50 0 . 25 50 0 . 00 10 B B B B B B B i i i i i i i M M G G T T P 10% 1% 0.1% 0.01% 0.001% 0.0001% 1 2 1 1 2 1 2 3 3 3 Size of transfer Percent total volume transferred - 22 -

Recommend


More recommend