Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( 张雪萌 ) maggiez@nvidia.com
● Introduction ● Why do we need to scale training AGENDA ● How to achieve scaling
DL Training: from single GPU to multi-node 1.33 Minutes on MLPerf At Scale | 2019 6.3 Minutes on MLPerf DGX SuperPOD 52.7 Minutes on At Scale | 2018 MLPerf DGX Cluster DGX-2H | 2019 70 Minutes on MLPerf NVSwitch DGX-2H | 2018 NVSwitch 480 Mins (8 Hours) DGX-1V | 2017 Tensor Core 36000 Mins (25 1200 Mins (20 Hours) Days) DGX-1P | 2016 1xK80 | 2015 NVLink CUDA 2016 2017 2015 2018 2019 ResNet50 v1.5 training 3
The whole stack must be considered ● Compute ● Network ● Storage ● Frameworks & Libraries ● Numerical methods ● Training recipes 4
MLPerf: NVIDIA advancing AI training Time to Train From 8 Hours to 80 Seconds 5 2019 MLPerf ID (in order from top to bottom of chart): ResNet-50: 0.6-30 | Transformer: 0.6-28 | GNMT: 0.6-14 | SSD: 0.6-27 | Mini-Go: 0.6-11 | Mask R-CNN: 0.6-23
Largest TensorFlow model at scale Oak Ridge National Lab scales TensorFlow climate analytics model up to 27,360 V100 GPUs Source: https://arxiv.org/pdf/1810.01993.pdf 2018 Gordon Bell Prize Winner 6
● Introduction ● Why do we need to scale training AGENDA ● How to achieve scaling
Datasets getting larger ● Unlabeled data: ○ Language model: BooksCorpus (800M words), English Wikipedia (2.5B words), WebText (8M documents, 40 GB), C4 (Common Crawl, 745 GB) ○ GAN: unlabeled images and videos Reinforcement learning: unsupervised self-play generates unlimited data ○ ● Labeled data: ○ ImageNet (2012) - 1.3M images, 1000 categories Open Images (2019) - 9M images, 6000 categories ○ Semi-autonomous vehicles: 0.5-1.1TB of data for every 8h driving 8
DL models increasing in complexity Next-level use-cases require gigantic models NLP – Generative Tasks Chatbots E-mail auto-completion Speech Document Summarization Recognition NLP Q&A Sentiment Translation Translation Project Megatron Image Recognition Autonomous Vehicles Social Tagging 1.5Bn 8.3B parameters Visual Search Object 8-way Model Parallel Detection 340M 26M 64-way Data Parallel 24x larger than BERT 9 https://github.com/NVIDIA/Megatron-LM
● Introduction ● Why do we need to scale training AGENDA ● How to achieve scaling
Scaling == whack-a-mole ? Solving one bottleneck and another one pops up 11
Multi-node infrastructure requirements System Design Multi-Node Success Data Center SW Stack Management 12
Challenges of multi-node DL training ● Hardware GPU cluster design: ○ Compute: significant CPU to GPU ratio, interconnect with GPU ○ Storage: high speed NFS, multi-tier caching ○ Networking: topology and bandwidth, NVLINK, GPUDirect RDMA ● GPU cluster management: Scheduler: Slurm vs. Kubernetes ○ ○ Container technologies: Docker, Enroot, Singularity, etc. ● Integrated software stack: NVIDIA libraries: CUDA, cuDNN, NCCL ○ DL Framework scale-out optimization ○ Model scale-out implementation & optimization ○ 13
A basic recipe for deep learning scaling Step 1: Optimize your single GPU model Step 2: Scale to multiple GPUs on one node Step 3: Scale to multiple nodes 14
Case study Bidirectional Encoder Representations from Transformers BERT model scripts: • https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/ LanguageModeling/BERT/ https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Lan guageModeling/BERT Configurations for convergence, from 8 to 1500 GPUs, multi-node ready Clone and train your own BERT model on multi-node • Or download a pre-trained BERT model from NGC and fine-tune for your NLP task Super Human Question & Answering NVIDIA Deep Learning Examples have many model scripts with best practices for accuracy and performance 15
Why multi-node BERT training • Pre-training on non-labelled data opens up opportunities to using massive amounts of data: BooksCorpus (800 million words) • English Wikipedia (2.5 billion words), multi-language Wikipedia • WebText (OpenAI, 8M documents, 40 GB of text) • More data tends to lead to better accuracy • • BERT pre-training is computationally intensive and takes days to train even on the most powerful single node: BERT-Large (330M parameters) takes ~2.5 days to train on a single DGX-2 server with 16 V100 GPUs. 16
BERT multi-node pre-training performance Metric: Time to train DGX-1 GPUs Time to train DGX-2H GPUs Time to train (16 GB) (Hrs) (32 GB) (Hrs) 1 8 153.6 (6.3 1 16 58.4 (2.4 days) days) 4 64 15.4 4 32 39.3 16 256 3.9 16 128 10.4 64 1024 1.2 Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-loss-results * Above time to train is measured for Mixed precision, training loss 1.3 in PyTorch; with LAMB optimizer 17 ** Gradient accumulation is applied to DGX-2H 1,4,16 node
Step 1: Optimize model • Create efficient data pipeline • Enable mixed precision training • Enable XLA • Ensure latest GPU libraries • Develop model in container to facilitate scaling out 18
Step 1: Optimize model Data pipeline • Use tf.data to create performant input pipelines • Test I/O bottlenecks with a trivial model • NVIDIA DALI accelerates image-based input pipelines 19
d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files)) d = d.repeat() d = d.shuffle(buffer_size=len(input_files)) # `cycle_length` is the number of parallel files that get read. cycle_length = min(num_cpu_threads, len(input_files)) BERT d = d.apply( tf.contrib.data.parallel_interleave( Data pipeline tf.data.TFRecordDataset, cycle_length=cycle_length)) d = d.shuffle(buffer_size=100) TFRecord - fast binary format Parallel read, map, & batch d = d.apply( tf.contrib.data.map_and_batch( Fused map & batch op lambda record: _decode_record(record, name_to_features), batch_size=batch_size, num_parallel_batches=num_cpu_threads, drop_remainder=True if is_training else False)) https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/run_pretraining.py 20
Step 1: Optimize model Automatic Mixed Precision (AMP) • 1-line optimizer wrapper: opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt) • Up to 3x speed up in training on Tensor Cores with • Same accuracy • No change in hyperparameters • ½ memory bandwidth & footprint • Optimal on Volta and Turing GPUs 21
Step 1: Optimize model Automatic Mixed Precision (AMP) • Robust speedup across different TensorFlow workloads • https://arxiv.org/abs/1710.0 3740 22
Step 1: Optimize model XLA (Accelerated Linear Algebra) • TensorFlow XLA can accelerate models with minimal code changes • XLA optimizes graph, mostly by fusing compatible kernels • Set XLA optimization level: config.graph_options.optimizer_options .global_jit_level = tf.OptimizerOptions.ON_1 https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageMo deling/BERT/run_pretraining.py#L531 System config: Xeon E4-2698v4 CPU with 256GB system RAM, single V100 Tensor Core GPU 32GB. Tests run using NVIDIA 18.11 TensorFlow container. 23
Step 1: Optimize model Latest GPU optimizations • Latest compatible features and tuning from CUDA toolkit and Deep Learning Libraries (cuDNN, cuBLAS, NCCL) 24
Step 1: Optimize model Latest GPU optimizations • NGC containers : fully featured DL containers • DL frameworks compiled with latest GPU libraries • Portability of application libraries facilitates multi-node scale-out 25
26
Step 2: Scale to multiple GPUs • Understand Data Parallel training concepts • Ensure optimal inter-GPU communication • Apply high level API for multi-GPU training 27
Step 2: Scale to multiple GPUs Under the hood • Single GPU 28
Step 2: Scale to multiple GPUs Under the hood • Multiple GPU • Data parallel training • Allreduce algorithm • NCCL: N VIDIA C ollective C ommunication L ibrary 29
Step 2: Scale to multiple GPUs Under the hood • Inter-GPU communication: Effective bandwidth in GB/s 30
Step 2: Scale to multiple GPUs Under the hood • Full non-blocking bandwidth 31
Step 2: Scale to multiple GPUs Approach 1: Horovod • Popular approach to enable multi-GPU/multi-node in TensorFlow/Keras • Strong NCCL integration • Sample commands: • Single-node (4 GPUs): horovodrun -np 4 -H localhost:4 python train.py • Multi-node (4 nodes with 4 GPUs each): horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py 32
Recommend
More recommend