S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek - PowerPoint PPT Presentation

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek Tredak, Simon Layton

THE PROBLEM 2

CPU BOTTLENECK OF DL TRAINING CPU : GPU ratio Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores / sockets is very expensive • CPU to GPU ratio becomes lower: • DGX-1V: 40 cores / 8, 5 cores / GPU • • DGX-2: 48 cores / 16, 3 cores / GPU 3

CPU BOTTLENECK OF DL TRAINING Complexity of I/O pipeline Alexnet 224x224 crop 256x256 image Training and mirror ResNet 50 Color augment 224x224 crop Training 480p image Random resize and mirror 4

CPU BOTTLENECK OF DL TRAINING Increased complexity of Higher GPU to CPU ratio CPU-based I/O pipeline GPU Throughput CPU Time 5

LOTS OF FRAMEWORKS Lots of effort MXNet Caffe2 TensorFlow Manual graph ImageRecordIter ImageIO Python ImageInputOp Dataset construction Python Python Frameworks have their own I/O pipelines (often more than 1!) Lots of duplicated effort to optimize them all Training process is not portable even if the model is (e.g. via ONNX) 6

LOTS OF FRAMEWORKS Lots of effort Optimized I/O pipelines are not flexible and often unsuitable for research train = mx.io.ImageRecordIter( path_imgrec = args.data_train, path_imgidx = args.data_train_idx, label_width = 1, mean_r = rgb_mean[0], mean_g = rgb_mean[1], mean_b = rgb_mean[2], image, _ = mx.image.random_size_crop(image, data_name = 'data', (data_shape, data_shape), 0.08, (3/4., 4/3.)) label_name = 'softmax_label', data_shape = image_shape, image = mx.nd.image.random_flip_left_right(image) batch_size = 128, vs rand_crop = True, image = mx.nd.image.to_tensor(image) max_random_scale = 1, image = mx.nd.image.normalize(image, mean=(0.485, pad = 0, fill_value = 127, 0.456, 0.406), std=(0.229, 0.224, 0.225)) min_random_scale = 0.533, return mx.nd.cast(image, dtype), label max_aspect_ratio = args.max_random_aspect_ratio, random_h = args.max_random_h, random_s = args.max_random_s, random_l = args.max_random_l, max_rotate_angle = args.max_random_rotate_angle, max_shear_ratio = args.max_random_shear_ratio, rand_mirror = args.random_mirror, preprocess_threads = args.data_nthreads, shuffle = True, num_parts = 0, part_index = 1) Inflexible fast flexible slow 7

SOLUTION: ONE LIBRARY MXNet Caffe2 PyTorch TF etc. DALI Centralize the effort • Integrate into all frameworks • • Provide both flexibility and performance 8

DALI: OVERVIEW 9

DALI Flexible, high-performance image data pipeline • Plugin • Python / C++ frontends with C++ / CUDA backend Minimal (or no) changes to the frameworks required Framework • DALI • Full pipeline - from disk to GPU, ready to train OSS (soon) • 10

GRAPH WITHIN A GRAPH I/O in Frameworks today GPU CPU Data pipeline is just a (simple) graph Images Resize Decode Augment JPEG Labels Loader Training 11

GPU OPTIMIZED PRIMITIVES DALI GPU CPU High performance, GPU optimized implementations Images Resize Decode Augment JPEG Labels Loader Training 12

GPU ACCELERATED JPEG DECODE DALI with nvJPEG GPU CPU Hybrid approach to JPEG decoding – can move fully to GPU in the future Images Resize Decode Augment JPEG Labels Loader Training Hu 13

SET YOUR DATA FREE DALI List of JPEGs LMDB (Caffe, RecordIO TFRecord (PyTorch, Caffe2) (MXNet) (TensorFlow) others) Use any file format in any framework 14

BEHIND THE SCENES: PIPELINE 15

PIPELINE Overview Framework One pipeline per GPU The same logic for multithreaded and multiprocess frameworks 16

PIPELINE Overview Framework CPU Mixed Single direction 3 stages GPU CPU -> Mixed -> GPU 17

PIPELINE Overview 6 8 1 9 Framework 3 5 7 2 4 CPU Mixed Simple scheduling of operations GPU 18

PIPELINE CPU 1 1 1 1 1 2 2 2 2 2 1 3 3 3 3 3 3 5 4 4 4 4 4 2 4 5 5 5 5 5 5 5 5 5 5 Operations processed per-sample in a thread pool 19

PIPELINE GPU 8 8 9 9 9 Batched processing of data 20

PIPELINE Mixed Mixed 9 A bridge between CPU and GPU Per-sample input, batched output Used also for batching CPU data (for CPU outputs of the pipeline) 21

EXECUTOR Pipelining the pipeline time CPU 3 Mixed 4 GPU 2 GPU 1 CPU 2 Mixed 2 CPU 1 Mixed 1 CPU, Mixed and GPU stages need to be executed serially But each batch of data is independent… 22

EXECUTOR Pipelining the pipeline time CPU 1 CPU 2 CPU 3 … Mixed 1 Mixed 2 Mixed 3 GPU 1 GPU 2 GPU 3 Each stage is asynchronous Stages of given batch synchronized via events 23

OPERATORS Gallery 24

USING DALI 25

EXAMPLE: RESNET-50 PIPELINE Pipeline class import dali import dali.ops as ops class HybridRN50Pipe( dali . Pipeline ): def __init__ ( self , batch_size , num_threads , device_id, num_devices ): super ( HybridRN50Pipe , self ). __init__ ( batch_size , num_threads , device_id ) # define used operators def define_graph ( self ): # define graph of operations 26

EXAMPLE: RESNET-50 PIPELINE Defining operators def __init__ ( self , batch_size , num_threads , device_id, num_devices ): super ( HybridRN50Pipe , self ). __init__ ( batch_size , num_threads , device_id ) self.loader = ops .Caffe2Reader( path = lmdb_path , shard_id = dev_id , num_shards = num_devices ) self.decode = ops .HybridDecode( output_type = dali . types . RGB ) self.resize = ops .Resize( device = "gpu" , resize_a = 256 , resize_b = 480 , random_resize =True, image_type = types . RGB ) self.crop = ops .CropMirrorNormalize( device = "gpu" , random_crop =True, crop =( 224 , 224 ), mirror_prob = 0.5 , mean =[ 128. , 128. , 128. ], std =[ 1. , 1. , 1. ], output_layout = dali . types . NCHW ) 27

EXAMPLE: RESNET-50 PIPELINE Defining graph def define_graph ( self ): jpeg , labels = self . loader ( name = "Reader" ) images = self . decode ( jpeg ) resized_images = self . resize ( images ) cropped_images = self . crop ( resized_images ) return [ cropped_images , labels ] jpeg Decode Resize Crop Data Loader MakeContiguous Label labels 28

EXAMPLE: RESNET-50 PIPELINE Usage: MXNet import mxnet as mx from dali . plugin . mxnet import DALIIterator pipe = HybridRN50Pipe ( 128 , 2 , 0 , 1 ) pipe . build () train = DALIIterator ( pipe , pipe . epoch_size ( "Reader" )) model . fit ( train , # other parameters ) 29

EXAMPLE: RESNET-50 PIPELINE Usage: TensorFlow import tensorflow as tf from dali . plugin . tf import DALIIterator pipe = HybridRN50Pipe ( 128 , 2 , 0 , 1 ) serialized_pipe = pipe.serialize() train = DALIIterator() with tf . session () as sess : images , labels = train ( serialized_pipe ) # rest of the model using images and labels sess . run (...) 30

EXAMPLE: RESNET-50 PIPELINE Usage: Caffe 2 from caffe2.python import brew pipe = HybridRN50Pipe ( 128 , 2 , 0 , 1 ) serialized_pipe = pipe . serialize () data , label = brew . dali_input ( model , [ "data" , "label" ], serialized_pipe = serialized_pipe ) # Add the rest of your network as normal conv1 = brew.conv(model, dat a, “conv1”, …) 31

PERFORMANCE 32

PERFORMANCE I/O Pipeline Throughput, DGX-2, RN50 pipeline, Batch 128, NCHW 25000 23000 20000 Images / Second 14350 15000 10000 8000 5450 5150 5000 0 33

PERFORMANCE End-to-end training End-to-end DGX-2, RN50 training - MXNet, Batch 192 / GPU 18000 17000 15500 16000 14000 12000 images / second 10000 8000 8000 6000 4000 2000 0 Native DALI Synthetic 34

NEXT STEPS 35

NEXT: MORE WORKLOADS Segmentation def define_graph ( self ): images , masks = self . loader ( name = "Reader" ) images = self . decode ( images ) masks = self . decode ( masks ) # Apply identical transformations resized_images , resized_masks = self . resize ([ images , masks ], …) cropped_images , cropped_masks = self . crop ([ resized_images , resized_masks ], …) return [ cropped_images , cropped_masks ] 36

NEXT: MORE FORMATS What would be useful to you? PNG Video frames 37

NEXT++: MORE OFFLOADING Fully GPU-based decode HW-based via. NVDEC Transcode to video 38

SOON: EARLY ACCESS Looking for: Contact: Milind Kukanur mkukanur@nvidia.com General feedback New workloads New transformations 39

ACKNOWLEDGEMENTS Trevor Gale Andrei Ivanov Serge Panev Cliff Woolley DL Frameworks team @ NVIDIA 40

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek - PowerPoint PPT Presentation

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek Tredak, Simon Layton THE PROBLEM 2 CPU BOTTLENECK OF DL TRAINING CPU : GPU ratio Multi-GPU, dense systems are more common (DGX-1V, DGX-2) Using more cores / sockets is very

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D

Redis for Fast Data Ingest Agenda Fast Data Ingest and its challenges Redis for Fast

GPUs and Python: A Recipe for Lightning-Fast Data Pipelines Craig Warner Christopher Packham

Community Update MST T Fast st Facts cts MST T Fast st Facts cts MST T Fast st Facts

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Lurssen 32,9 A classic fast Lurssen 32,9 A classic fast A F T D E C K Lurssen 32,9 A

SUPER FAST 15 MINS SUPER FAST 15 MINS 1300 733 215 1300 733 215 UNLIMITED DATA UNLIMITED DATA

Jet multiplicities in a dense QCD medium Mueller and G. Soyez Introduction P. Caucal, E. Iancu,

Muiz Academy Bostons Dual Language High School Living and

Representability in DL-Lite R Knowledge Base Exchange Marcelo Arenas 1 Elena Botoeva 2 Diego

guidance for correct mail presentation How to get it right A Royal Mail guide to the

LTE Whats next Meik Kottkamp Meik.kottkamp@rohde-schwarz.com Technology Management Rohde

Jet fragmentation in a dense QCD medium Iancu, A.H. Mueller and G. Soyez P.R.L.,120, 2018 P.

Second Quarter Fiscal 2012 Results Presentation May 17, 2012 Safe Harbor Statement This

Student Success Strategies for Distance Learners Presented by Vince Rodriguez, Jorge Sanchez,

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek - PowerPoint PPT Presentation

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek Tredak, Simon Layton THE PROBLEM 2 CPU BOTTLENECK OF DL TRAINING CPU : GPU ratio Multi-GPU, dense systems are more common (DGX-1V, DGX-2) Using more cores / sockets is very

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&amp;D

Redis for Fast Data Ingest Agenda Fast Data Ingest and its challenges Redis for Fast

GPUs and Python: A Recipe for Lightning-Fast Data Pipelines Craig Warner Christopher Packham

Community Update MST T Fast st Facts cts MST T Fast st Facts cts MST T Fast st Facts

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Lurssen 32,9 A classic fast Lurssen 32,9 A classic fast A F T D E C K Lurssen 32,9 A

SUPER FAST 15 MINS SUPER FAST 15 MINS 1300 733 215 1300 733 215 UNLIMITED DATA UNLIMITED DATA

Jet multiplicities in a dense QCD medium Mueller and G. Soyez Introduction P. Caucal, E. Iancu,

Muiz Academy Bostons Dual Language High School Living and

Representability in DL-Lite R Knowledge Base Exchange Marcelo Arenas 1 Elena Botoeva 2 Diego

guidance for correct mail presentation How to get it right A Royal Mail guide to the

LTE Whats next Meik Kottkamp Meik.kottkamp@rohde-schwarz.com Technology Management Rohde

Jet fragmentation in a dense QCD medium Iancu, A.H. Mueller and G. Soyez P.R.L.,120, 2018 P.

Second Quarter Fiscal 2012 Results Presentation May 17, 2012 Safe Harbor Statement This

Student Success Strategies for Distance Learners Presented by Vince Rodriguez, Jorge Sanchez,

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D