Initial Characterization of I/O in Large-Scale Deep Learning Applications Fahim Chowdhury, Jialin Liu, Quincey Koziol, Thorsten Kurth, Steven Farrell, Suren Byna, Prabhat, and Weikuan Yu November 12, 2018 - 1 -
Outline Objectives DL Benchmarks at NERSC Profiling Approaches Experimental Results Future Work - 2 -
Outline Objectives DL Benchmarks at NERSC Profiling Approaches Experimental Results Future Work - 3 -
Objectives Deep Learning (DL) applications demand large-scale computing facilities. DL applications require efficient I/O support in the data processing pipeline to accelerate the training phase. The goals of this project are Exploring I/O patterns invoked through multiple DL applications running on HPC systems Addressing possible bottlenecks caused by I/O in the training phase Developing optimization strategies to overcome the possible I/O bottlenecks 4
Objectives Deep Learning (DL) applications demand large-scale computing facilities. DL applications require efficient I/O support in the data processing pipeline to accelerate the training phase. The goals of this project are Exploring I/O patterns invoked through multiple DL applications running on HPC systems Addressing possible bottlenecks caused by I/O in the training phase Developing optimization strategies to overcome the possible I/O bottlenecks 5
Outline Objectives DL Benchmarks at NERSC Profiling Approaches Experimental Results Future Work - 6 -
HEPCNNB Overview High Energy Physics Deep Learning Convolutional Neural Network Benchmark (HEPCNNB) Runs on distributed TensorFlow using Horovod Can generate particle events that can be described by standard model physics and particle events with R-parity violating Supersymmetry Uses a 496 GB dataset of 2048 HDF5 files representing particle collisions generated by a fast Monte-Carlo generator named Delphes at CERN 7
CDB Overview Climate Data Benchmark (CDB) Runs on distributed TensorFlow using Horovod Can act as an image recognition model to detect patterns for extreme weather Uses a 3.5 TB dataset of 62738 HDF5 images representing climate data Leverages TensorFlow Dataset API and python’s multiprocessing package for input pipelining 8
Outline Objectives DL Benchmarks at NERSC Profiling Approaches Experimental Results Future Work - 9 -
Profiling Approaches Develop TimeLogger tool based on python to profile application layer Determine the total latency from merged interval list for each training component Explore TensorFlow Runtime Tracing Metadata Visualization (TRTMV) tool developed at Google and extract I/O specific metadata Working on integration of runtime metadata from application and framework layer Work available in: https://github.com/NERSC/DL-Parallel-IO 10
Outline Objectives DL Benchmarks at NERSC Profiling Approaches Experimental Results Future Work - 11 -
HEPCNNB Latency Breakdown 8.01% 3.60% 7.72% 3.08% 6.83% 3.17% 6.16% 2.91% 1.49% 1.44% Local Shuffle Global Shuffle I/O takes more time when Global Shuffling is introduced Global Shuffling affects I/O even for small dataset and only 5 epochs training I/O bottleneck can become more severe with increasing epochs - 12 -
HEPCNNB Read Bandwidth 194.99 187.99 91.53 44.20 30.86 9.91 21.69 8.76 15.90 4.33 Local Shuffle Global Shuffle I/O takes more time when Global Shuffling is introduced Global Shuffling affects I/O even for small dataset and only 5 epochs training I/O bottleneck can become more severe with increasing epochs - 13 -
CDB Latency and Read Bandwidth 8.73% 2.57 15.05% 1.14 10.63% 11.04% 0.33 0.30 The percentage of I/O in the training process is more when dataset is larger The I/O percentage increases with the number of nodes Training benefits more from the scaling than I/O - 14 -
Outline Objectives DL Benchmarks at NERSC Profiling Approaches Experimental Results Future Work - 15 -
Future Work To integrate TRTMV results with TimeLogger data for better profiling of highly parallelized I/O pipeline To explore the I/O patterns and determine possible I/O bottlenecks in distributed TensorFlow To develop an optimized cross-framework I/O strategy to overcome the possible I/O bottlenecks 16
Thank You - 17 -
Recommend
More recommend