toward large scale image segmentation on summit
play

Toward Large-Scale Image Segmentation On Summit Sudip K. Seal , - PowerPoint PPT Presentation

Toward Large-Scale Image Segmentation On Summit Sudip K. Seal , Seung-Hwan Lim, Dali Wang, Jacob Hinkle, Dalton Lunga, Aristeidis Tsaris Oak Ridge National Laboratory, USA August 19, 2020 International Conference on Parallel Processing Alberta,


  1. Toward Large-Scale Image Segmentation On Summit Sudip K. Seal , Seung-Hwan Lim, Dali Wang, Jacob Hinkle, Dalton Lunga, Aristeidis Tsaris Oak Ridge National Laboratory, USA August 19, 2020 International Conference on Parallel Processing Alberta, Canada ORNL is managed by UT-Battelle, LLC for the US Department of Energy

  2. Introduction Semantic Segmentation of Images Input Image Ø Given an image with 𝑂×𝑂 pixels and a set of 𝑙 distinct classes, label each of the 𝑂 ! pixels with one of the 𝑙 distinct classes. Ø For example, given a 256 Γ—256 image of a car, road, buildings and people, a semantic segmentation of the image classifies each of the 256Γ—256 = 2 "# pixels into one of 𝑙 = 4 classes {car, road, building, people}. Semantic Segmentation Input image Segmented image Segmented Image conv 3 x 3, ReLU max pool, 2 x 2 up-conv 2 x 2 conv 1 x 1 copy and crop 2 2 Image credit: https://mc.ai/how-to-do-semantic-segmentation-using-deep-learning/

  3. The U-Net Model πœ— = 3 β‹… 2 $%! βˆ’ 1 π‘’π‘œ & Input Image Output Image 𝑂 𝑂 βˆ’ πœ— U-Net Architecture 𝑂 𝑂 βˆ’ πœ— β€’ Refer to this as the πœ— βˆ’ region (halo). β€’ Halo width ( πœ— ) is a function of U-Net architecture (depth, channel width, filter sizes, etc.). U-Net: Convolutional Networks for β€’ Halo width ( πœ— ) determines the Biomedical Image Segmentation Olaf Ronneberger, Philipp Fischer, Thomas Brox receptive field of the model. Medical Image Computing and Computer- Assisted Intervention (MICCAI), Springer, LNCS, β€’ Larger the receptive field, wider the Vol.9351: 234--241, 2015. length-scales of identifiable objects. 3 3

  4. Why Is It A Summit-scale Problem? Larger receptive fields require larger models Ø Satellite images collected at high-resolutions (30-50 cm) yield Model size very large 10,000 x 10,000 images. Ø Most computer vision workloads deal with images of 𝑃 ( 10 ! Γ— 10 ! ) resolution (for example, ImageNet). Sample size Data size 10000-fold larger Ø This work targets ultra-wide extent images with 𝑃 ( 10 ' Γ—10 ' ) image size resolution β‡’ 10,000-fold larger data samples ! Ø At present, requires many days to train a single model (even Multi-TB of data from DAQ systems. on special-purpose DL platforms like DGX boxes). Ø Hyperparameter tuning of these models take much longer. Ø Need accurate scalable high-speed training framework. Ø Large U-Net models are needed to resolve multi-scale objects (buildings, solar panels, land cover details). Ø Advanced DAQ systems generate vast amounts of high- resolution images β‡’ large data volume . 4 4

  5. Sample Parallelism - Taming Large Image Size Leveraging Summit’s Vast GPU Farm Blue dashed square Ø Given a 𝑂×𝑂 image, U-Net segments a 0 is segmented for 0 0 , 0 1 𝑂 βˆ’ πœ— Γ—(𝑂 βˆ’ πœ—) inset square. each appended tile. 𝑂 βˆ’ πœ— 𝑂 10,000 𝑂 βˆ’ πœ— Tile size chosen such that appended 𝑂 tile plus model parameters fit on a single Summit GPU. Ø Partition each 𝑂×𝑂 = 10000Γ—10000 image sample into non-overlapping tiles. Ø Append an extra halo region of width πœ— along each side of each tile. Ø Assign each appended tile to a Summit GPU. Use standard U-Net to segment appended tile. Ø Each GPU segments an area equal to that of the original non-overlapping tile. 5 5

  6. Performance of Sample-Parallel U-Net Training Ø Optimal tiling for each 10000Γ—10000 sample image was found to be 8Γ—8 . Ø Each 1250Γ—1250 tile was appended with a halo of width πœ— = 92 and assigned to a single Summit GPU. Ø 10 – 11 Summit nodes to train each 10000Γ— 10000 image sample. Ø A U-Net model was trained on a data set of 100 10000Γ—10000Γ—4 satellite images, collected at 30- 50 cm resolution. Ø The training time per epoch was shown to be ∼ 12 seconds using 1200 Summit GPUs compared to ∼ 1,740 seconds on a DGX-1 . Ø Initial testing revealed no appreciable loss of training/validation accuracy using the new parallel framework. +100X Faster U-Net Training 6 6

  7. Limitations of Sample Parallelism 𝐿 β†’ πΊπ‘—π‘šπ‘’π‘“π‘  𝑑𝑗𝑨𝑓 Β§ 𝑇 β†’ 𝑇𝑒𝑠𝑗𝑒𝑓 π‘šπ‘“π‘œπ‘•π‘’β„Ž Β§ 𝑄 β†’ π‘„π‘π‘’π‘’π‘—π‘œπ‘• π‘šπ‘—π‘¨π‘“ Β§ π‘œ ! β†’ 𝑂𝑝. 𝑝𝑔 π‘‘π‘π‘œπ‘€π‘‘ π‘žπ‘“π‘  π‘šπ‘“π‘€π‘“π‘š Β§ πœ— = 3 β‹… 2 $%! βˆ’ 1 π‘’π‘œ & 𝑒 = 9 :%" ;<%!= βˆ’ 1 𝑀 β†’ π‘œπ‘. 𝑝𝑔 𝑉𝑂𝑓𝑒 π‘šπ‘“π‘€π‘“π‘šπ‘‘ Β§ β€² π‘ˆ : 0 0 Γ— 0 β€² 𝑂×𝑂 = π‘Ÿ " (π‘ˆ # Γ—π‘ˆβ€²) , 0 π‘ˆ Β§ 1 = N π‘ˆβ€²Γ—π‘ˆβ€² Ø An image of size 𝑂× 𝑂 is partitioned into a π‘ŸΓ—π‘Ÿ array of π‘ˆ ( Γ—π‘ˆ ( tiles. π‘ˆΓ—π‘ˆ ) ! )*+,- .*-/01 *2 &*03/+,+4*56 317 +4-1 '8 Ø 𝐹 ∼ )*+,- .*-/01 *2 /612/- &*03/+,+4*56 317 +4-1 = 𝑃 ) "! ∼ 𝑃 1 + π‘Ÿ 𝑂 = 10,000 9 π‘ˆΓ—π‘ˆ Ø Ideally, 𝐹 = 1 . Ø Decreasing π‘Ÿ (increasing tile sizes) increases the memory 𝑂 = π‘Ÿπ‘ˆ ( requirement and quickly overtakes memory available per GPU. Ø Decreasing πœ— decreases the receptive field of the model. Ø On the other hand, the goal is to decrease π‘Ÿ and increase πœ—. Ø Decrease π‘Ÿ β‡’ increasing tile size π‘ˆβ€² and decreasing πœ— steers away from target receptive fields. Ø To satisfy both, larger U-Net models than can fit on a GPU needed. Ø Need model-parallel execution. 7 7

  8. Model-Parallelism - Taming Large Model Size Node-level Pipeline-Parallel Execution Single Summit Node No load balance GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 -- skip connections omitted for ease of presentation -- Memory needed/GPU = size(micro-batch) + size(partition) Number of consecutive layers mapped to a GPU is Ø called partition . $ ! $ " $ # $ $ $ ) $ * $ "! $ "" $ % $ & $ ' $ ( $ "# $ "$ $ "% $ "& Number of layers in each partition is called balance . Ø # $ ! ! ! ! " $$ " $# " $" " $! Update $! $" $# $$ Subdivide each mini-batch of tiles into smaller micro- ! ! ! ! " #$ " ## " #" " #! # # Ø Update #! #" ## #$ batches that are assigned to each partition. # " ! ! ! ! " "$ " "# " "" " "! Update "! "" "# "$ Micro-batches per partition ≑ π‘›π‘π‘žπ‘ž Ø # ! ! ! ! ! " !$ " !# " !" " !! Update !! !" !# !$ TorchGPipe: PyTorch implementation of Gpipe* Framework * Huang, Yanping, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le and Zhifeng Chen. β€œGPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism.” NeurIPS (2019). 8 8

  9. Model Parallel Experiments !Γ—! Image Samples #Γ—# Padded Tiles Single Node Execution !β€²Γ—!β€² 96 GB !Γ—! MODEL PARALLEL U-NET !β€²Γ—!β€² !Γ—! Benchmark U-Net Models !β€²Γ—!β€² !Γ—! SUMMIT NODE No. of Conv. Layers No. of Trainable !Γ—! SAMPLE PARALLEL Model 𝝑 Levels Per Level Parameters Small (Standard) 5 2 72,301,856 92 Medium-1 5 5 232,687,904 230 10Γ— larger number of trainable parameters. β€’ 4Γ— fold larger receptive field. β€’ Medium-2 6 2 289,357,088 188 Large 7 2 1,157,578,016 380 Medium -1 πŸ‘. πŸ—Γ— (192) , πŸ‘. πŸ”Γ— (512) and πŸ‘Γ— (1024) Speedup doubles (small: 1.97 ; medium-2: 2.01 ) speedup using 6 pipeline stages. as no. of pipeline stages increases from 1 to 6. 9 9

  10. Need for Performance Improvement Single Node Execution Ø Small, Medium-2 and Large Models: Ø Layers: 109, 129 and 149. Ø Balances: small {14, 24, 30, 22, 12, 7}; medium-2 {16, 26, 38, 26, 12, 11}; large {18, 30, 44, 30, 14, 13}. Ø Need load balanced pipelined execution. n c ! X ( I ` βˆ’ i Β· d ) 2 Ø Encoder memory: I 2 ` + 2 ` n f E ` = O i =1 n c !! Ø Decoder memory: 2 ` 0 n f X ( I ` 0 βˆ’ i Β· d ) 2 2 I 2 D ` 0 = O ` 0 + i =1 β„“ ( = 𝑀 βˆ’ β„“ Ø Memory profile: 𝐹 β„“ + 𝐸 β„“( vs. β„“ , 10 10

  11. Wrapping Up This Paper: Prototype Sample + Model Parallel Framework Ø Training image segmentation neural !Γ—! Image Samples network models become extremely #Γ—# Padded Tiles challenging when: !β€²Γ—!β€² 96 GB !Γ—! Ø Image sizes are very large MODEL PARALLEL U-NET !β€²Γ—!β€² !Γ—! Ø Desired receptive fields are large !β€²Γ—!β€² !Γ—! SUMMIT NODE Ø Volume of training data is large. !Γ—! SAMPLE PARALLEL Ø Fast training/inference needed for πŸπŸΓ— larger number of trainable parameters. β€’ πŸπŸπŸπŸπŸΓ— larger image size. β€’ πŸ“Γ— fold larger receptive field. β€’ geo-sensing applications –satellite imagery, disaster assessment, precision Load Balance Heuristics Data Parallelism agriculture, etc. Ø This work is a first step – can train 10Γ— Ongoing Work: Sample + Model + Data Parallel Framework larger U-Net models with 4Γ— larger receptive field on 10000Γ— larger SAMPLE PARALLEL 96 GB !Γ—! #Γ—# Padded Tiles images. MODEL PARALLEL U-NET !Γ—! Image Samples !Γ—! DATA PARALLEL Ø Ongoing efforts are underway to MODEL PARALLEL U-NET !β€²Γ—!β€² integrate load balancing heuristics !Γ—! !β€²Γ—!β€² !Γ—! MODEL PARALLEL U-NET and data-parallel execution to handle !β€²Γ—!β€² large volumes of training data !Γ—! !Γ—! MODEL PARALLEL U-NET efficiently. !Γ—! SUMMIT NODE 11 11

  12. THANK YOU 12 12

Recommend


More recommend