S9500 - Deep Learning Framework Container Optimizations Joey - PowerPoint PPT Presentation

S9500 - Deep Learning Framework Container Optimizations Joey Conway, Senior Product Manager of Deep Learning Software Michael O’Connor, Director of Software, Optimized Frameworks Cliff Woolley, Director of Engineering, Optimized Frameworks

Deep Learning Framework Container Highlights ● Deep Learning Framework Team ● Overview AGENDA ○ Best Performance ○ Latest Features ○ Best Practices ● Additional Resources 2

NVIDIA Deep Learning Frameworks Team Overview of community interactions Open-source Framework Community DGX and NGC customers Upstream work Containerize work Deep Learning Frameworks Team 3

Challenges with Deep Learning Performance Version Resources & Compatibility Best Practices 4

Deep Learning Frameworks Highlights Best NVIDIA Latest NVIDIA Best Practices & Performance Features QA Verified Deep Learning Latest NVIDIA Deep Improved Frameworks Learning libraries documentation with optimizations for incorporated best practices and NVIDIA hardware cuDNN, cuBLAS, and monthly release notes NCCL Thorough monthly Volta Tensor Cores quality assurance support for mixed- Automatic Mixed- testing precision (FP16) across Precision for TensorFlow, MXNet, TensorFlow, PyTorch Multi-node support PyTorch, and NVCaffe and MXNet updated 5

TensorFlow Performance on ResNet-50 with DGX Automatic Mixed-Precision (AMP) Performance Improvements 7

DGX Mixed-Precision Led MLPerf World’s Fastest Industry-Wide AI Benchmark Achieved on NVIDIA GPUs Time to Accuracy on Single Node Image Classification Object Detection Object Detection RN50 v.1.5 Mask R-CNN SSD MXNet PyTorch PyTorch 70 minutes 167 minutes 14 minutes Translation (recurrent) Translation (non-recurrent) Recommendation GNMT Transformer NCF PyTorch PyTorch PyTorch 10 minutes 19 minutes 0.4 minutes Test Platform: DGX-2H - Dual-Socket Xeon Platinum 8174 , 1.5TB system RAM, 16 x 32 GB Tesla V100 SXM-3 GPUs connected via NVSwitch 8

BERT Performance improvements from MLPerf Transformer carries over to BERT ● State-of-the-art model for NLP tasks ● Compute intensive Transformer-like workload ● Optimizations from MLPerf carry over in both PyTorch and TF ● TF Training scripts released here: https://github.com/NVIDIA/DeepLearningExamples/ tree/master/TensorFlow/LanguageModeling/BERT ○ Pretraining (Wikipedia) ○ Q&A fine-tuning (SQuAD) ● Mixed Precision using Tensor Cores 9

Tensor Core Examples: Developer Page https://developer.nvidia.com/deep-learning-examples New Deep Learning Training Scripts ● Tensor Core optimized performance ● State-of-the-art accuracy using Tensor Cores Serve as a quick start guide ● How we implemented mixed-precision ● Exposing hyperparameters for further adjustment Code examples on ● GitHub https://www.github.com/NVIDIA/deeple arningexamples ● NGC DL Framework containers ● NGC Model Scripts registry 10

Tensor Core Examples: Developer Page https://developer.nvidia.com/deep-learning-examples Available model training scripts ● Image Classification ○ ResNet-50v1.5 ● Object Detection: ○ SSD with RN50 ○ Mask R-CNN with RN50 ● Translation ○ GNMT ○ Transformer ● Recommender ○ NCF ● Text-to-Speech ○ Tacotron2 and Waveglow 11

PyTorch GNMT Performance https://developer.nvidia.com/deep-learning-examples DGX-1V 16G Time to Accuracy: 46 minutes BLEU score (accuracy): 24.45 Tokens per second: 387,282 Data set: WMT16 English to German NGC 19.01 PyTorch container Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/GNMT 12

PyTorch GNMT Performance https://developer.nvidia.com/deep-learning-examples DGX-2 32G Time to Accuracy: 26.3 minutes BLEU score (accuracy): 24.22 Tokens per second: 738,521 Data set: WMT16 English to German NGC 19.01 PyTorch container Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/GNMT 13

MXNet RN50 Performance https://developer.nvidia.com/deep-learning-examples DGX-1V 16G Time to Train: 3.3 hours Top 1% (accuracy): 76.49 Images per second: 10,263 Data set: ImageNet NGC 18.12 MXNet container Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5 14

PyTorch NCF Performance https://developer.nvidia.com/deep-learning-examples DGX-1V 16G Time to Accuracy: < 1 minute Hit Rate at 10 (accuracy): 0.96 Samples per second: 99,332,230 Data set: MovieLens 20M NGC 18.12 PyTorch container Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5 15

Tensor Core Examples: Coming next Top Use Cases Classification Top Use Cases Object detection ● Adding existing models to more frameworks Segmentation: Medical Imaging ● Optimizing more models for Tensor Cores Segmentation: Manufacturing ● Releasing externally and maintaining Audio speech recognition (ASR) More efforts in-progress! Text to speech (TTS) Natural Language Processing (NLP) Recommendation System 16

Latest NVIDIA Features MXNet PyTorch ● Multi-node support w/ Horovod ● PyTorch-AMP: unified mixed ● NHWC support precision interface ● MXNet-AMP ● Automatic fusion for ● Mixed Precision support to elementwise ops TensorRT TensorFlow Overall ● Mixed Precision tools ● TensorFlow-AMP ● Tensor Core optimized ● More TensorRT op coverage examples with trained models ● Added cuDNN RNN features ● TensorRT integration ● Added Jupyter & JupyterLab ● Jetson releases 18

AUTOMATIC MIXED PRECISION (AMP) Utilize Tensor Cores for Mixed Precision Training Insert two lines of code to introduce Automatic Mixed-Precision in your training layers for up to a 3x performance improvement . The Automatic Mixed Precision feature uses a graph optimization technique to determine FP16 operations and FP32 operations Available in TensorFlow, PyTorch and MXNet via our NGC Deep Learning Framework Containers More details: https://developer.nvidia.com/automatic-mixed-precision Unleash the next generation AI performance and get faster to the market! NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 19

Automatic Mixed-Precision Performance for Common Workloads TensorFlow Performance Improvements on 1 x V100 on DGX-1V w/XLA All models can be found at https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow, except for ssd-rn50-fpn-640. All performance collected on 1xV100-16GB, except bert-squadqa on 1xV100-32GB. Batch sizes measured as follows. rn50 (v1.5): 128 for FP32, 256 for AMP+XLA; ssd-rn50-fpn-640: 8 for FP32, 16 for AMP+XLA; ncf: 1M 20 for FP32 and AMP+XLA; bert-squadqa: 4 for FP32, 10 for AMP+XLA; gnmt: 128 for FP32, 192 for AMP.

CUDA COMPATIBILITY - CUDA 9.x Newer CUDA Version DID NOT Run on Older Display Driver CUDA 9.0 CUDA 8.0 CUDA driver API is backward compatible but not forward compatible CUDA 9.0 Apps, Libs & Plugins Each CUDA release has a minimum driver requirement Applications compiled against a particular version of CUDA API will work on later driver releases E.g. R375 R384 CUDA 8.0 needs >= R375 Driver Driver CUDA 9.0 needs >= R384 Compatible Incompatible 21

CUDA COMPATIBILITY - CUDA 10.x NEW Forward Compatibility Option Starting with CUDA 10.0 Upgrade only user-mode CUDA components* CUDA 10.0 CUDA 9.0 New compatibility platform upgrade path available CUDA Toolkit CUDA Toolkit Use newer CUDA toolkits on older driver installs Upgrade and Runtime and Runtime Compatibility only with specific older driver versions R384 Driver R410 Driver System requirements CUDA User CUDA User Mode Driver – Mode Driver – Upgrade Tesla GPU support only – no Quadro or GeForce libcuda.so libcuda.so Only available on Linux GPU Kernel GPU Kernel Mode Driver – Mode Driver – nvidia.ko nvidia.ko *requires new ‘cuda-compat-10-0’ package 22

S9500 - Deep Learning Framework Container Optimizations Joey - PowerPoint PPT Presentation

S9500 - Deep Learning Framework Container Optimizations Joey Conway, Senior Product Manager of Deep Learning Software Michael OConnor, Director of Software, Optimized Frameworks Cliff Woolley, Director of Engineering, Optimized Frameworks

DISASTER RELIEF CENTER 2x Accommodation Container 2x Sanitary Container 1x

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Container Library and FUSE Container File System Softwarepraktikum f ur Fortgeschrittene

Postcapitalism Jamie Dobson, GOTO Berlin, 2016 www.container-solutions.com |

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

Kubernetes Crossing the Chasm 05.03.2018 Ian Crosby @IanDCrosby info@container-solutions.com

Mini-Bulk/IBC Pesticide Container Collection Program EPA Sponsored California San Joaquin Valley

Container Live Migration Adrian Reber FOSDEM 2020, February 01 Red Hat Blog: Container

Implementing Data Layout Optimizations Implementing Data Layout Optimizations in the LLVM

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Welcome to EUROGATE Container Terminal Limassol Ltd. AGENDA 1. Introduction: EUROGATE Container

Lessons Learnt from Running a Container Native Cloud Xu Wang (@gnawux) CTO & Cofounder,

A Series Of Unfortunate Container Events Netflixs container platform lessons learned About the

Rol Haven in Wereldwijde Logistieke Ketens Hoe kunnen we duurzaam nieuwe toegevoegde waarde

Rich Lucente Senior Solutions Architect rlucente@redhat.com 11 Jan 2016 Software Disrupts

APH / PPHTD / Upriver Sister Ports Mid-West Logistic Alternative True Gateway Terminal in

Containerizing Deep Learning Frameworks with Singularity Rengan Xu, Frank Han, Nishanth

Project Plan SmartSat Satellite App Store The Capstone Experience Team Lockheed Martin Space

KLINE Investor Guidebook August 2004 Kawasaki Kisen Kaisha, Ltd. 1 Supplemental

Agenda Introduction Evolution of containerization Canadian context The Future

E c ovidr io, a unique mode l NAME POSIT ION ORGANISAT ION DAT E , 2019 T a b le o f

Sambuz

Useful Links

Newsletter

Mail Us

S9500 - Deep Learning Framework Container Optimizations Joey - PowerPoint PPT Presentation

S9500 - Deep Learning Framework Container Optimizations Joey Conway, Senior Product Manager of Deep Learning Software Michael OConnor, Director of Software, Optimized Frameworks Cliff Woolley, Director of Engineering, Optimized Frameworks

DISASTER RELIEF CENTER 2x Accommodation Container 2x Sanitary Container 1x

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Container Library and FUSE Container File System Softwarepraktikum f ur Fortgeschrittene

Postcapitalism Jamie Dobson, GOTO Berlin, 2016 www.container-solutions.com |

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

Kubernetes Crossing the Chasm 05.03.2018 Ian Crosby @IanDCrosby info@container-solutions.com

Mini-Bulk/IBC Pesticide Container Collection Program EPA Sponsored California San Joaquin Valley

Container Live Migration Adrian Reber FOSDEM 2020, February 01 Red Hat Blog: Container

Implementing Data Layout Optimizations Implementing Data Layout Optimizations in the LLVM

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Welcome to EUROGATE Container Terminal Limassol Ltd. AGENDA 1. Introduction: EUROGATE Container

Lessons Learnt from Running a Container Native Cloud Xu Wang (@gnawux) CTO &amp; Cofounder,

A Series Of Unfortunate Container Events Netflixs container platform lessons learned About the

Rol Haven in Wereldwijde Logistieke Ketens Hoe kunnen we duurzaam nieuwe toegevoegde waarde

Rich Lucente Senior Solutions Architect rlucente@redhat.com 11 Jan 2016 Software Disrupts

APH / PPHTD / Upriver Sister Ports Mid-West Logistic Alternative True Gateway Terminal in

Containerizing Deep Learning Frameworks with Singularity Rengan Xu, Frank Han, Nishanth

Project Plan SmartSat Satellite App Store The Capstone Experience Team Lockheed Martin Space

KLINE Investor Guidebook August 2004 Kawasaki Kisen Kaisha, Ltd. 1 Supplemental

Agenda Introduction Evolution of containerization Canadian context The Future

E c ovidr io, a unique mode l NAME POSIT ION ORGANISAT ION DAT E , 2019 T a b le o f

Sambuz

Useful Links

Newsletter

Mail Us

Lessons Learnt from Running a Container Native Cloud Xu Wang (@gnawux) CTO & Cofounder,