s8906 fast data pipelines
play

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek - PowerPoint PPT Presentation

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek Tredak, Simon Layton THE PROBLEM 2 CPU BOTTLENECK OF DL TRAINING CPU : GPU ratio Multi-GPU, dense systems are more common (DGX-1V, DGX-2) Using more cores / sockets is very


  1. S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek Tredak, Simon Layton

  2. THE PROBLEM 2

  3. CPU BOTTLENECK OF DL TRAINING CPU : GPU ratio Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores / sockets is very expensive • CPU to GPU ratio becomes lower: • DGX-1V: 40 cores / 8, 5 cores / GPU • • DGX-2: 48 cores / 16, 3 cores / GPU 3

  4. CPU BOTTLENECK OF DL TRAINING Complexity of I/O pipeline Alexnet 224x224 crop 256x256 image Training and mirror ResNet 50 Color augment 224x224 crop Training 480p image Random resize and mirror 4

  5. CPU BOTTLENECK OF DL TRAINING Increased complexity of Higher GPU to CPU ratio CPU-based I/O pipeline GPU Throughput CPU Time 5

  6. LOTS OF FRAMEWORKS Lots of effort MXNet Caffe2 TensorFlow Manual graph ImageRecordIter ImageIO Python ImageInputOp Dataset construction Python Python Frameworks have their own I/O pipelines (often more than 1!) Lots of duplicated effort to optimize them all Training process is not portable even if the model is (e.g. via ONNX) 6

  7. LOTS OF FRAMEWORKS Lots of effort Optimized I/O pipelines are not flexible and often unsuitable for research train = mx.io.ImageRecordIter( path_imgrec = args.data_train, path_imgidx = args.data_train_idx, label_width = 1, mean_r = rgb_mean[0], mean_g = rgb_mean[1], mean_b = rgb_mean[2], image, _ = mx.image.random_size_crop(image, data_name = 'data', (data_shape, data_shape), 0.08, (3/4., 4/3.)) label_name = 'softmax_label', data_shape = image_shape, image = mx.nd.image.random_flip_left_right(image) batch_size = 128, vs rand_crop = True, image = mx.nd.image.to_tensor(image) max_random_scale = 1, image = mx.nd.image.normalize(image, mean=(0.485, pad = 0, fill_value = 127, 0.456, 0.406), std=(0.229, 0.224, 0.225)) min_random_scale = 0.533, return mx.nd.cast(image, dtype), label max_aspect_ratio = args.max_random_aspect_ratio, random_h = args.max_random_h, random_s = args.max_random_s, random_l = args.max_random_l, max_rotate_angle = args.max_random_rotate_angle, max_shear_ratio = args.max_random_shear_ratio, rand_mirror = args.random_mirror, preprocess_threads = args.data_nthreads, shuffle = True, num_parts = 0, part_index = 1) Inflexible fast flexible slow 7

  8. SOLUTION: ONE LIBRARY MXNet Caffe2 PyTorch TF etc. DALI Centralize the effort • Integrate into all frameworks • • Provide both flexibility and performance 8

  9. DALI: OVERVIEW 9

  10. DALI Flexible, high-performance image data pipeline • Plugin • Python / C++ frontends with C++ / CUDA backend Minimal (or no) changes to the frameworks required Framework • DALI • Full pipeline - from disk to GPU, ready to train OSS (soon) • 10

  11. GRAPH WITHIN A GRAPH I/O in Frameworks today GPU CPU Data pipeline is just a (simple) graph Images Resize Decode Augment JPEG Labels Loader Training 11

  12. GPU OPTIMIZED PRIMITIVES DALI GPU CPU High performance, GPU optimized implementations Images Resize Decode Augment JPEG Labels Loader Training 12

  13. GPU ACCELERATED JPEG DECODE DALI with nvJPEG GPU CPU Hybrid approach to JPEG decoding – can move fully to GPU in the future Images Resize Decode Augment JPEG Labels Loader Training Hu 13

  14. SET YOUR DATA FREE DALI List of JPEGs LMDB (Caffe, RecordIO TFRecord (PyTorch, Caffe2) (MXNet) (TensorFlow) others) Use any file format in any framework 14

  15. BEHIND THE SCENES: PIPELINE 15

  16. PIPELINE Overview Framework One pipeline per GPU The same logic for multithreaded and multiprocess frameworks 16

  17. PIPELINE Overview Framework CPU Mixed Single direction 3 stages GPU CPU -> Mixed -> GPU 17

  18. PIPELINE Overview 6 8 1 9 Framework 3 5 7 2 4 CPU Mixed Simple scheduling of operations GPU 18

  19. PIPELINE CPU 1 1 1 1 1 2 2 2 2 2 1 3 3 3 3 3 3 5 4 4 4 4 4 2 4 5 5 5 5 5 5 5 5 5 5 Operations processed per-sample in a thread pool 19

  20. PIPELINE GPU 8 8 9 9 9 Batched processing of data 20

  21. PIPELINE Mixed Mixed 9 A bridge between CPU and GPU Per-sample input, batched output Used also for batching CPU data (for CPU outputs of the pipeline) 21

  22. EXECUTOR Pipelining the pipeline time CPU 3 Mixed 4 GPU 2 GPU 1 CPU 2 Mixed 2 CPU 1 Mixed 1 CPU, Mixed and GPU stages need to be executed serially But each batch of data is independent… 22

  23. EXECUTOR Pipelining the pipeline time CPU 1 CPU 2 CPU 3 … Mixed 1 Mixed 2 Mixed 3 GPU 1 GPU 2 GPU 3 Each stage is asynchronous Stages of given batch synchronized via events 23

  24. OPERATORS Gallery 24

  25. USING DALI 25

  26. EXAMPLE: RESNET-50 PIPELINE Pipeline class import dali import dali.ops as ops class HybridRN50Pipe( dali . Pipeline ): def __init__ ( self , batch_size , num_threads , device_id, num_devices ): super ( HybridRN50Pipe , self ). __init__ ( batch_size , num_threads , device_id ) # define used operators def define_graph ( self ): # define graph of operations 26

  27. EXAMPLE: RESNET-50 PIPELINE Defining operators def __init__ ( self , batch_size , num_threads , device_id, num_devices ): super ( HybridRN50Pipe , self ). __init__ ( batch_size , num_threads , device_id ) self.loader = ops .Caffe2Reader( path = lmdb_path , shard_id = dev_id , num_shards = num_devices ) self.decode = ops .HybridDecode( output_type = dali . types . RGB ) self.resize = ops .Resize( device = "gpu" , resize_a = 256 , resize_b = 480 , random_resize =True, image_type = types . RGB ) self.crop = ops .CropMirrorNormalize( device = "gpu" , random_crop =True, crop =( 224 , 224 ), mirror_prob = 0.5 , mean =[ 128. , 128. , 128. ], std =[ 1. , 1. , 1. ], output_layout = dali . types . NCHW ) 27

  28. EXAMPLE: RESNET-50 PIPELINE Defining graph def define_graph ( self ): jpeg , labels = self . loader ( name = "Reader" ) images = self . decode ( jpeg ) resized_images = self . resize ( images ) cropped_images = self . crop ( resized_images ) return [ cropped_images , labels ] jpeg Decode Resize Crop Data Loader MakeContiguous Label labels 28

  29. EXAMPLE: RESNET-50 PIPELINE Usage: MXNet import mxnet as mx from dali . plugin . mxnet import DALIIterator pipe = HybridRN50Pipe ( 128 , 2 , 0 , 1 ) pipe . build () train = DALIIterator ( pipe , pipe . epoch_size ( "Reader" )) model . fit ( train , # other parameters ) 29

  30. EXAMPLE: RESNET-50 PIPELINE Usage: TensorFlow import tensorflow as tf from dali . plugin . tf import DALIIterator pipe = HybridRN50Pipe ( 128 , 2 , 0 , 1 ) serialized_pipe = pipe.serialize() train = DALIIterator() with tf . session () as sess : images , labels = train ( serialized_pipe ) # rest of the model using images and labels sess . run (...) 30

  31. EXAMPLE: RESNET-50 PIPELINE Usage: Caffe 2 from caffe2.python import brew pipe = HybridRN50Pipe ( 128 , 2 , 0 , 1 ) serialized_pipe = pipe . serialize () data , label = brew . dali_input ( model , [ "data" , "label" ], serialized_pipe = serialized_pipe ) # Add the rest of your network as normal conv1 = brew.conv(model, dat a, “conv1”, …) 31

  32. PERFORMANCE 32

  33. PERFORMANCE I/O Pipeline Throughput, DGX-2, RN50 pipeline, Batch 128, NCHW 25000 23000 20000 Images / Second 14350 15000 10000 8000 5450 5150 5000 0 33

  34. PERFORMANCE End-to-end training End-to-end DGX-2, RN50 training - MXNet, Batch 192 / GPU 18000 17000 15500 16000 14000 12000 images / second 10000 8000 8000 6000 4000 2000 0 Native DALI Synthetic 34

  35. NEXT STEPS 35

  36. NEXT: MORE WORKLOADS Segmentation def define_graph ( self ): images , masks = self . loader ( name = "Reader" ) images = self . decode ( images ) masks = self . decode ( masks ) # Apply identical transformations resized_images , resized_masks = self . resize ([ images , masks ], …) cropped_images , cropped_masks = self . crop ([ resized_images , resized_masks ], …) return [ cropped_images , cropped_masks ] 36

  37. NEXT: MORE FORMATS What would be useful to you? PNG Video frames 37

  38. NEXT++: MORE OFFLOADING Fully GPU-based decode HW-based via. NVDEC Transcode to video 38

  39. SOON: EARLY ACCESS Looking for: Contact: Milind Kukanur mkukanur@nvidia.com General feedback New workloads New transformations 39

  40. ACKNOWLEDGEMENTS Trevor Gale Andrei Ivanov Serge Panev Cliff Woolley DL Frameworks team @ NVIDIA 40

Recommend


More recommend