initial characterization of i o in large scale deep
play

Initial Characterization of I/O in Large-Scale Deep Learning - PowerPoint PPT Presentation

Initial Characterization of I/O in Large-Scale Deep Learning Applications Fahim Chowdhury, Jialin Liu, Quincey Koziol, Thorsten Kurth, Steven Farrell, Suren Byna, Prabhat, and Weikuan Yu November 12, 2018 - 1 - Outline Objectives DL


  1. Initial Characterization of I/O in Large-Scale Deep Learning Applications Fahim Chowdhury, Jialin Liu, Quincey Koziol, Thorsten Kurth, Steven Farrell, Suren Byna, Prabhat, and Weikuan Yu November 12, 2018 - 1 -

  2. Outline  Objectives  DL Benchmarks at NERSC  Profiling Approaches  Experimental Results  Future Work - 2 -

  3. Outline  Objectives  DL Benchmarks at NERSC  Profiling Approaches  Experimental Results  Future Work - 3 -

  4. Objectives  Deep Learning (DL) applications demand large-scale computing facilities.  DL applications require efficient I/O support in the data processing pipeline to accelerate the training phase.  The goals of this project are  Exploring I/O patterns invoked through multiple DL applications running on HPC systems  Addressing possible bottlenecks caused by I/O in the training phase  Developing optimization strategies to overcome the possible I/O bottlenecks 4

  5. Objectives  Deep Learning (DL) applications demand large-scale computing facilities.  DL applications require efficient I/O support in the data processing pipeline to accelerate the training phase.  The goals of this project are  Exploring I/O patterns invoked through multiple DL applications running on HPC systems  Addressing possible bottlenecks caused by I/O in the training phase  Developing optimization strategies to overcome the possible I/O bottlenecks 5

  6. Outline  Objectives  DL Benchmarks at NERSC  Profiling Approaches  Experimental Results  Future Work - 6 -

  7. HEPCNNB Overview  High Energy Physics Deep Learning Convolutional Neural Network Benchmark (HEPCNNB)  Runs on distributed TensorFlow using Horovod  Can generate particle events that can be described by standard model physics and particle events with R-parity violating Supersymmetry  Uses a 496 GB dataset of 2048 HDF5 files representing particle collisions generated by a fast Monte-Carlo generator named Delphes at CERN 7

  8. CDB Overview  Climate Data Benchmark (CDB)  Runs on distributed TensorFlow using Horovod  Can act as an image recognition model to detect patterns for extreme weather  Uses a 3.5 TB dataset of 62738 HDF5 images representing climate data  Leverages TensorFlow Dataset API and python’s multiprocessing package for input pipelining 8

  9. Outline  Objectives  DL Benchmarks at NERSC  Profiling Approaches  Experimental Results  Future Work - 9 -

  10. Profiling Approaches  Develop TimeLogger tool based on python to profile application layer  Determine the total latency from merged interval list for each training component  Explore TensorFlow Runtime Tracing Metadata Visualization (TRTMV) tool developed at Google and extract I/O specific metadata  Working on integration of runtime metadata from application and framework layer  Work available in: https://github.com/NERSC/DL-Parallel-IO 10

  11. Outline  Objectives  DL Benchmarks at NERSC  Profiling Approaches  Experimental Results  Future Work - 11 -

  12. HEPCNNB Latency Breakdown 8.01% 3.60% 7.72% 3.08% 6.83% 3.17% 6.16% 2.91% 1.49% 1.44% Local Shuffle Global Shuffle  I/O takes more time when Global Shuffling is introduced  Global Shuffling affects I/O even for small dataset and only 5 epochs training  I/O bottleneck can become more severe with increasing epochs - 12 -

  13. HEPCNNB Read Bandwidth 194.99 187.99 91.53 44.20 30.86 9.91 21.69 8.76 15.90 4.33 Local Shuffle Global Shuffle  I/O takes more time when Global Shuffling is introduced  Global Shuffling affects I/O even for small dataset and only 5 epochs training  I/O bottleneck can become more severe with increasing epochs - 13 -

  14. CDB Latency and Read Bandwidth 8.73% 2.57 15.05% 1.14 10.63% 11.04% 0.33 0.30  The percentage of I/O in the training process is more when dataset is larger  The I/O percentage increases with the number of nodes  Training benefits more from the scaling than I/O - 14 -

  15. Outline  Objectives  DL Benchmarks at NERSC  Profiling Approaches  Experimental Results  Future Work - 15 -

  16. Future Work  To integrate TRTMV results with TimeLogger data for better profiling of highly parallelized I/O pipeline  To explore the I/O patterns and determine possible I/O bottlenecks in distributed TensorFlow  To develop an optimized cross-framework I/O strategy to overcome the possible I/O bottlenecks 16

  17. Thank You - 17 -

Recommend


More recommend