DIS DISTRI TRIBUT BUTED ED TRA TRAINI INING NG OF OF DEE DEEP LE P LEARNI ARNING NG MOD MODELS ELS Mathew Salvaris @msalvaris Ilia Karmanov @ikdeepl Miguel Fierro @miguelgfierro
Rosetta Stone of Deep Learning more info: https://github.com/ilkarman/DeepLearningFrameworks Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
ImageNet Competition error (%) ImageNet top-5 error 15.3% 7.3% 6.7% 5.1% (human) 3.8% 3.8% 3.6% 3.1% 2.4% Inception- ResNext ResNet NASNet AmoebaNet AlexNet VGG Inception ResNet Instagram (2015) (2017) (2017) (2012) (2014) (2015) (2016) (2018) Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training mode: Data parallelism Worker 1 Worker 2 Job manager CNN model Subset 1 CNN model Subset 2 CNN model Dataset Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training mode: Model parallelism Worker 1 Worker 2 Job manager Submodel 1 Submodel 2 CNN model Dataset Dataset Dataset Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Data parallelism vs model parallelism Da Data ta parallelism arallelism Mod odel el parallelism arallelism ▪ Easier implementation ▪ Better scalability of large models ▪ Stronger fault tolerance ▪ Less memory on each GPU ▪ Higher cluster utilization Why no Wh y not bo t both th? ? Da Data ta para aralleli llelism sm fo for r CN CNN lay N layers ers an and model del par aralle allelism lism in in FC FC la laye yers rs so source: ce: Alex ex Krizhevs zhevsky ky. . 2014. On One weird rd trick ck fo for pa paralleli lelizing zing co convolutio lutional nal neura ural l netwo works. rks. https ps://a //arxiv.o rxiv.org/a rg/abs bs/14 /1404.5 04.5997 Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Training strategies: parameter averaging Worker 1 Worker 2 Subset 1 CNN model Subset 2 CNN model Average of weights for each worker Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Training strategies: distributed gradient based Worker 1 Worker 2 Subset 1 CNN model Subset 2 CNN model Synchronous Gradients of each worker Asynchronous Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Overview of distributed training Install software and containers Schedule jobs Provision clusters of VMs Share results Distribute data Scale resources Handling failures Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Azure Distributed Platforms ▪ Batch AI ▪ Batch Shipyard ▪ DL Workspace Horovod Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Batch Shipyard • Supports Docker and Singularity: run your Docker and Singularity containers within the same job, side-by-side or even concurrently • Move data easily between locally accessible storage systems, remote filesystems, Azure Blob or File Storage, and compute nodes • Supports local storage, Azure Blob or File Storage, and NFS. • Low priority nodes https://github.com/Azure/batch-shipyard Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Batch AI • Supports running on Docker container as well as the Data Science Virtual Machine • Supports local storage, Azure Blob or File Storage, and NFS. • Low priority nodes https://github.com/Azure/BatchAI Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
DL Workspace • Runs jobs inside Docker • Uses Kubernetes • Can be deployed anywhere not just Azure • Supports local storage and NFS https://github.com/Microsoft/DLWorkspace Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Training with Batch AI 1) Create scripts to run on Batch AI 1 and transfer them to file storage I 2) Write the data to storage A I 2 3) Create the docker containers for each DL framework and transfer them to a container registry 3 Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
1) Create a Batch AI Pool I A I 2 2) Each job will pull in the 2 appropriate container, script and 1 Batch AI Pool load data from chosen storage 3) Once the job is completed all the 2 results will be written to the fileshare 3 Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Batch AI Interface CLI Python SDK az batchai cluster create --name nc24r --image UbuntuLTS --vm-size Standard_NC24rs_v3 --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key --nfs $NFS_NAME --nfs-mount-path nfs Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with NFS ▪ Batch AI cluster configuration with Copy Data NFS share Batch AI Pool NFS I Share A I az batchai cluster create Mounted --name nc24r Fileshare --image UbuntuLTS --vm-size Standard_NC24rs_v3 --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key --nfs $NFS_NAME --nfs-mount-path nfs Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with blob storage ▪ Batch AI cluster configuration with Copy Data mounted blob Batch AI Pool Mounted I Blob A I az batchai cluster create Mounted --name nc24r Fileshare --image UbuntuLTS --vm-size Standard_NC24rs_v3 --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --container-name $CONTAINER_NAME --container-mount-path extcn --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with local storage ▪ Batch AI cluster configuration with Copy Data copying the data to the nodes Batch AI Pool I A I az batchai cluster create --name nc24r Mounted --image UbuntuLTS Fileshare --vm-size Standard_NC24r --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --container-name $CONTAINER_NAME --container-mount-path extcn --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key Node preparation configuration -c cluster.json Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training Results images/second Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training Results images/second Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training Results images/second Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Horovod Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Horovod Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Horovod Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with PyTorch Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Chainer Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with CNTK 1-bit SGD with MPI Blocked Momentum with MPI Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Demo Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Acknowledgements Hongzhi Li Alex Sutton Alex Yukhanov Attribution of some images: http://morguefile.com/ Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Thanks! Mathew Salvaris @msalvaris Ilia Karmanov @ikdeepl Miguel Fierro @miguelgfierro
Recommend
More recommend