Deep Learning on HPC: Performance Factors and Lessons Learned Weijia Xu Scalable Computational Intelligence Group Texas Advanced Computing Center The University of Texas at Austin BenchCounil’19 Denver, CO � 1
Outline • Motivating applications at TACC • Challenges for running deep learning on HPC. • Scalability and accuracy • Scalability and I/O • Memory error impact • Conclusions and Discussions � 2
Motivating Applications • Traffic Camera Video Analysis • In Collaboration with City of Austin • ~ 540MB/hour MPEG4 video from one camera • ~100GB for a typical study from a single camera [1] “Deep learning methods to leverage tra ffi c monitoring cameras for pedestrian data applications” Weijia Xu, Natalia Ruiz, Ruizhu Huang, Joel Meyer, and Jen Duthie, John Clary, 26th ITS Word Congress, (Best Technical Paper) [2] Detecting Pedestrian Crossing Events in Large Video Data form Tra ffi c Monitoring Cameras , Weijia Xu, Natalia Ruiz, Kelly Pierce, Ruizhu Huang, Joel Meyer, and Jen Duthie, to appear IEEE BigData2019 � 3
Motivating Applications • There are over 400 CCTV IP cameras within city limit of Austin • Mostly just used for manual monitoring • With deep learning, we can • Learn more about traffic pattern • Understand how road are used • Improving pedestrian safety. • A lot of unexpected… � 4
Motivating Applications • Neural image resolution enhancement with super resolution generative adversarial network. • In collaboration with Salk Institute • ~600 GB neural image dataset • Pytorch+FastAI, Each run of the early version takes ~24 hours on 16 NVIDIA V100 GPUs Biorixv’19 Fang, L., Monroe, F ., Novak, S.W., Kirk, L., Schiavon, C.R., Seungyoon, B.Y., Zhang, T., Wu, M., Kastner, K., Kubota, Y. and Zhang, Z. , 2019. Deep Learning-Based Point-Scanning Super-Resolution Imaging. bioRxiv , p.740548. � 5
Motivating Applications • Face recognition • In Collaboration with NASA JPL • ~100 GB image data • TensorFlow + Horovod • Each run takes ~12 hours on 16 NVIDIA GTX 1080 Ti GPUs [1]DLS’19-1 Mattmann, Chris A., Zhang, Z. . Deep Facial Recognition with TensorFlow, The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO [2] Courtesy image from: https://megapixels.cc/datasets/msceleb/ � 6
More DL Applications at TACC • Deep learning is both compute intensive and data intensive. � 7
Scale up vs. Scale out • Scale Up • Better and faster GPU cards / Specialized hardware, e.g. TPU • High acquisition cost to build large cluster • Scale Out • Using more computing nodes • Consistent with traditional HPC operations. • Specific challenges • Accuracy vs scalability • I/O issues � 8
The Race of ResNet50 • 90-epoch ResNet-50 training finished in 20 mins on 2,048 KNL with 74.9% top-1 accuracy • Against 8 GPUs baseline ResNet 50 ImageNet Training Acceleration from 2016~2019 900 800 791 700 600 500 466 400 300 264 200 100 116 1 29 28 56 0 He et al. Goyal et al. Cordeanu et You et al preferred Tencent & Sony Google (Microsoft) (Facebook) al. (SURF & (UBC & networks HKBU Research 1024 TPUv3 Intel) TACC) � 9
Accuracy vs Scalability • To yield high utilization at scale, we need to feed enough data (computation), which results in large batch size • Validation (test) accuracy is sensitive to batch size. • Large batch size can result in degraded validation accuracy • Layer-wise Adaptive Rate Scaling algorithm (LARS) 1 • Intuition: learning rate should be adjusted according to the norm of the weights in each layer [1] You, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. "ImageNet training in minutes." � 10 In Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM, 2018. Best paper
Scalable Training Algorithm • Using batch size of 32K while preserving validation accuracy � 11
Scalable Training Algorithm Using batch size of 32K on Intel Xeon Phi 7250 (KNL) and Intel Xeon • Platinum 8160 (SKX) nodes You, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. "ImageNet training in minutes." In Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM, 2018. Best paper � 12
Scalability vs Data I/O • ResNet-50 with ImageNet on 16 Nvidia 1080Ti GPUs, mini-batch of 64 per GPU Lustre Ideal 8704 8704 9000 6750 Tpt (imgs/sec) 4352 4352 4500 2176 2176 2250 2786 2467 544 544 447 270 0 1 4 8 16 Number of Nodes (4 GPUs per node) � 13
I/O on Lustre Dataset # files # dirs total_size file_size ImageNet 1.3 million 2002 140 GB KB-MB Neural 0.6 million 6 500 GB MB Image Reactor 0.17 1 65 GB KB Status million � 14
Deep Learning I/O • DL’s long lasting, repeated, high volume, and highly concurrent file access can easily saturate the metadata and data service of traditional shared file system. • ResNet-50 with Keras, TensorFlow, and Horovod on 16 nodes, each with 4 GPUs • 128K stat() and readdir() operations with 64-way concurrent access • 117M stat(), open(), close() operations with 256-way concurrent access • ~180M read() operations with same concurrency • ~ 8 hour duration [DLS’19-2] Zhang, Z. , Huang, L., Pauloski, J. G., Foster, Ian T., “Aggregating Local Storage for Scalable Deep Learning I/O”, The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO � 15
FanStore Design • FanStore is a transient runtime file system that optimizes I/O for distributed DL training. 1 • Data is partitioned (optionally compressed) and spread across local storage space • File access functions are intercepted and handled in user space • Remote file access is in the form of MPI round-trip message [1] Zhang, Z. , Huang, L., Pauloski, J. G., Foster, Ian T., “Aggregating Local Storage for Scalable Deep Learning I/O”, The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO � 16
ResNet-50 GPU Results • 4 Nvidia 1080Ti GPUs per node, mini-batch of 64 per GPU Lustre Ideal FanStore 8704 8704 9000 7867 6750 4352 4352 4500 4050 2176 2176 2250 2786 2467 544 544 1902 447 270 544 0 1 4 8 16 Number of Nodes (4 GPUs per node) � 17
ResNet-50 CPU Results • Intel Xeon Platinum 8160 nodes on Stampede2, mini- batch size of 256 per node FanStore Ideal 18000 16384 16384 13500 15109 8192 8192 9000 7710 4096 4096 4500 2048 2048 3901 32 32 1968 32 0 1 64 128 256 512 Number of Nodes � 18
Memory Error in Deep Learning • The impact of memory error on deep learning training is unclear, due to its stochastic nature and mathematical properties • Difficult for computing centers or individual researchers to make hardware procurement decisions • Difficult for users to estimate their confidence in training correctness on ECC-free processors • Potential performance gain by not using ECC • To quantify the impact of memory errors on deep learning training and to investigate alternative solutions for memory error detection [Cluster’19] Zhang, Z. Huang, L., Huang, R., Xu, W., Katz, D. S., “Quantifying the Impact of Memory Errors in Deep learning”, IEEE Cluster 2019, Albuquerque, NM � 19
Technical Approach • Focusing on impact from silent data corruption (SDC) • P(Failure) ≈ P(Failure, SDC) = P(Failure | SDC) x P(SDC) • To evaluate P(Failure | SDC) • Sampling in the experiment design space • Manually flipping the selected bit • Observing validation accuracy and training loss • Estimating P(Failure | SDC) via marginal probability � 20
Testing Applications Mem App SW Version Node Device Memory Run Time Usage nvcaffe 0.16.5 1 2x1080 Ti 11 GB 0.45 GB 4.5 mins ConvNet caffe 1.0.0 1 1x1080 Ti 11 GB 3.9 GB 16 mins LRCN Intel- 1.1.0 512 KNL 96 GB 18.4 GB 8 mins ResNet50 Caffe � 21
Example of ConvNet with Cifar10 ConvNet with Cifar10 dataset baseline • Parameter Value 50,000 training items/10,000 • 200, 10200, 20200, validation items Iteration 30200, 40200, 50200, 60000 Phase forward, backward 60,000 Iterations/120 epochs • Place data, model Batch size: 100 • Layers 1, 2, …, 15 Parameter Layers 1, 2, …, 7 Top-1 Test Accuracy Acceptable • range: 76.52% - 80.83% Data Position 0, mid, last Bit Position 31, 30, 29, 28, 27, 22 Training Loss Acceptable range: • 0.2594 - 0.4975 Repetition 3 � 22
Key Observations • Training failure is independent of iteration number • Used part of training process instead of complete runs. • Errors on less significant bits lead to less training failures • Convolution layers have the most training failures, so we estimate the worst case failure rate assuming every layer is a convolution layer • Training loss in the immediate next iteration is an effective signal to detect catastrophic SDCs. � 23
Memory Error Impact on DL Training Expected Scaling App P(SDC) P(F|SDC) P(F) Runs per Factor Failure 3.07 *10 -6 5.4 * 10 -8 ConvNet 1.76% 1 18.5 M ResNet50 5.89 *10 -2 1.22% 9 7.18 * 10 -4 1,610 LRCN 5.19 *10 -3 0.61% 110 3.17 * 10- 5 31,500 [Cluster’19] Zhang, Z. Huang, L., Huang, R., Xu, W., Katz, D. S., “Quantifying the Impact of Memory Errors in Deep learning”, IEEE Cluster 2019, Albuquerque, NM � 24
Recommend
More recommend