distributed deep learning
play

Distributed Deep Learning Mathew Salvaris What will be covered - PowerPoint PPT Presentation

Distributed Deep Learning Mathew Salvaris What will be covered Overview of Distributed Training What affects distributed training Network Model Data location Data format Deep Learning Model (CNN) penultimate layer Cat


  1. Distributed Deep Learning Mathew Salvaris

  2. What will be covered • Overview of Distributed Training • What affects distributed training • Network • Model • Data location • Data format

  3. Deep Learning Model (CNN) penultimate layer Cat Dog Mouse RGB Channels Convolution layer Pooling layer Fully connected layer of input image with Kernels

  4. Distributed training mode: Data parallelism Worker 1 Worker 2 Job manager CNN model Subset 1 CNN model CNN model Subset 2 Dataset

  5. Distributed training mode: Model parallelism Worker 1 Worker 2 Job manager CNN model CNN model Subset 1 Subset 1 CNN model Dataset

  6. Data parallelism vs model parallelism Data parallelism Model parallelism • Easier implementation • Better scalability of large models • Less memory on each GPU • Stronger fault tolerance • Higher cluster utilization

  7. Horovod: Ring All Reduce

  8. Effects of Network, Model and Precision

  9. Clusters of 8 nodes using K80 , P40 , P100 and V100 (4 GPUs per node+Infiniband) Two MPI configurations OpenMPI+NCCL and IntelMPI Setup

  10. 345 experiments across many different models including ResNet50 , MobileNet V2 etc. Using synthetic data Batch size remains 64 across all models and GPUs Use the benchmarking scripts from TensorFlow Experiments

  11. Distributed training with synthetic data Compute Pool I A I

  12. Single GPU Mathew Salvaris @msalvaris

  13. 32 GPUs

  14. 32 GPUs Mathew Salvaris @msalvaris

  15. MobileNet Mathew Salvaris @msalvaris

  16. MobileNet Mathew Salvaris @msalvaris

  17. GPU K80 P40 P100 V100 Time it takes to process batch on GPU Batch Execution Time it takes to transfer weights Data Transfer between GPUs

  18. ResNet50 Full Precision vs Mixed Precision [32 V100s] 90 25000 23,436 80 82 20000 70 60 SCALING EFFICIENCY 54 15000 IMAGES/SECOND 50 40 10000 30 6,629 20 5000 10 0 0 Full precision[64] Mixed precision [256] Images/second Scaling efficiency

  19. Effects of Storage

  20. Using ResNet50 across three frameworks [PyTorch, TensorFlow, Keras] Using real and synthetic data. Real data on local, NFS and Blob storage Batch size remains 64 across all configurations Uses V100 GPUs Experiments

  21. Distributed training with NFS Copy Data Compute Pool NFS I Share A I Mounted Fileshare

  22. Distributed training with blob storage Copy Data Compute Pool Mounted I Blob A I Mounted Fileshare

  23. Distributed training with local storage Copy Data Compute Pool I A I Mounted Fileshare

  24. ResNet50 - Relative performance across storage 1 0.8 0.6 0.4 0.2 0 TensorFlow Keras PyTorch Synthetic Local(SSD) NFS Premium Blob Blob

  25. Data Loaders and Preprocessors Keras Data Loader PyTorch Data Loader Simple with no parameters for buffering and Specify number of workers with num_workers parallelizing

  26. Highly configurable Many options : buffer, shuffle, cache TensorFlow and shard Daunting and easy to get wrong https://www.tensorflow.org/guide/performance/datasets

  27. Effects of Data Type

  28. TensorFlow Records • Binary data format created for TensorFlow – Recommended format for TensorFlow • Can aggregate number of examples to smaller number of TFRecords – efficient for transferring and reading in the cloud • Have to export data to format - Has to be tailored to use case

  29. ResNet50 – Data Type Performance [Average] 7,000 6,000 5,000 AVERAGE IMAGES/SECOND 4,000 3,000 2,000 1,000 0 8 16 32 Synthetic Images TFRecords

  30. ResNet50 – Data Format Performance [Maximum] 7,000 6,000 5,000 MAXIMUM IMAGES/SECOND 4,000 3,000 2,000 1,000 0 8 16 32 Synthetic Images TFRecords

  31. Asynchronous distributed training Tradeoff between batch size and other parameters Optimization of TensorFlow pipeline Things not Other data formats such as Parquet (Petastorm) discussed Transform libraries [albumentations] Distributed file systems BeeGFs and other storage GlusterFS, Lustre etc. Models other than CNN

  32. Do try to use enhanced networking wherever possible especially for the latest GPUs Training small models using distributed training is not recommended Summary Do use TFRecords or other columnar or row based data formats Not all data loaders are equal

  33. Thanks & Questions?

Recommend


More recommend