Toward Large-Scale Image Segmentation On Summit Sudip K. Seal , Seung-Hwan Lim, Dali Wang, Jacob Hinkle, Dalton Lunga, Aristeidis Tsaris Oak Ridge National Laboratory, USA August 19, 2020 International Conference on Parallel Processing Alberta, Canada ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Introduction Semantic Segmentation of Images Input Image Γ Given an image with πΓπ pixels and a set of π distinct classes, label each of the π ! pixels with one of the π distinct classes. Γ For example, given a 256 Γ256 image of a car, road, buildings and people, a semantic segmentation of the image classifies each of the 256Γ256 = 2 "# pixels into one of π = 4 classes {car, road, building, people}. Semantic Segmentation Input image Segmented image Segmented Image conv 3 x 3, ReLU max pool, 2 x 2 up-conv 2 x 2 conv 1 x 1 copy and crop 2 2 Image credit: https://mc.ai/how-to-do-semantic-segmentation-using-deep-learning/
The U-Net Model π = 3 β 2 $%! β 1 ππ & Input Image Output Image π π β π U-Net Architecture π π β π β’ Refer to this as the π β region (halo). β’ Halo width ( π ) is a function of U-Net architecture (depth, channel width, filter sizes, etc.). U-Net: Convolutional Networks for β’ Halo width ( π ) determines the Biomedical Image Segmentation Olaf Ronneberger, Philipp Fischer, Thomas Brox receptive field of the model. Medical Image Computing and Computer- Assisted Intervention (MICCAI), Springer, LNCS, β’ Larger the receptive field, wider the Vol.9351: 234--241, 2015. length-scales of identifiable objects. 3 3
Why Is It A Summit-scale Problem? Larger receptive fields require larger models Γ Satellite images collected at high-resolutions (30-50 cm) yield Model size very large 10,000 x 10,000 images. Γ Most computer vision workloads deal with images of π ( 10 ! Γ 10 ! ) resolution (for example, ImageNet). Sample size Data size 10000-fold larger Γ This work targets ultra-wide extent images with π ( 10 ' Γ10 ' ) image size resolution β 10,000-fold larger data samples ! Γ At present, requires many days to train a single model (even Multi-TB of data from DAQ systems. on special-purpose DL platforms like DGX boxes). Γ Hyperparameter tuning of these models take much longer. Γ Need accurate scalable high-speed training framework. Γ Large U-Net models are needed to resolve multi-scale objects (buildings, solar panels, land cover details). Γ Advanced DAQ systems generate vast amounts of high- resolution images β large data volume . 4 4
Sample Parallelism - Taming Large Image Size Leveraging Summitβs Vast GPU Farm Blue dashed square Γ Given a πΓπ image, U-Net segments a 0 is segmented for 0 0 , 0 1 π β π Γ(π β π) inset square. each appended tile. π β π π 10,000 π β π Tile size chosen such that appended π tile plus model parameters fit on a single Summit GPU. Γ Partition each πΓπ = 10000Γ10000 image sample into non-overlapping tiles. Γ Append an extra halo region of width π along each side of each tile. Γ Assign each appended tile to a Summit GPU. Use standard U-Net to segment appended tile. Γ Each GPU segments an area equal to that of the original non-overlapping tile. 5 5
Performance of Sample-Parallel U-Net Training Γ Optimal tiling for each 10000Γ10000 sample image was found to be 8Γ8 . Γ Each 1250Γ1250 tile was appended with a halo of width π = 92 and assigned to a single Summit GPU. Γ 10 β 11 Summit nodes to train each 10000Γ 10000 image sample. Γ A U-Net model was trained on a data set of 100 10000Γ10000Γ4 satellite images, collected at 30- 50 cm resolution. Γ The training time per epoch was shown to be βΌ 12 seconds using 1200 Summit GPUs compared to βΌ 1,740 seconds on a DGX-1 . Γ Initial testing revealed no appreciable loss of training/validation accuracy using the new parallel framework. +100X Faster U-Net Training 6 6
Limitations of Sample Parallelism πΏ β πΊπππ’ππ π‘ππ¨π Β§ π β ππ’π πππ πππππ’β Β§ π β πππππππ πππ¨π Β§ π ! β ππ. ππ ππππ€π‘ πππ πππ€ππ Β§ π = 3 β 2 $%! β 1 ππ & π = 9 :%" ;<%!= β 1 π β ππ. ππ ππππ’ πππ€πππ‘ Β§ β² π : 0 0 Γ 0 β² πΓπ = π " (π # Γπβ²) , 0 π Β§ 1 = N πβ²Γπβ² Γ An image of size πΓ π is partitioned into a πΓπ array of π ( Γπ ( tiles. πΓπ ) ! )*+,- .*-/01 *2 &*03/+,+4*56 317 +4-1 '8 Γ πΉ βΌ )*+,- .*-/01 *2 /612/- &*03/+,+4*56 317 +4-1 = π ) "! βΌ π 1 + π π = 10,000 9 πΓπ Γ Ideally, πΉ = 1 . Γ Decreasing π (increasing tile sizes) increases the memory π = ππ ( requirement and quickly overtakes memory available per GPU. Γ Decreasing π decreases the receptive field of the model. Γ On the other hand, the goal is to decrease π and increase π. Γ Decrease π β increasing tile size πβ² and decreasing π steers away from target receptive fields. Γ To satisfy both, larger U-Net models than can fit on a GPU needed. Γ Need model-parallel execution. 7 7
Model-Parallelism - Taming Large Model Size Node-level Pipeline-Parallel Execution Single Summit Node No load balance GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 -- skip connections omitted for ease of presentation -- Memory needed/GPU = size(micro-batch) + size(partition) Number of consecutive layers mapped to a GPU is Γ called partition . $ ! $ " $ # $ $ $ ) $ * $ "! $ "" $ % $ & $ ' $ ( $ "# $ "$ $ "% $ "& Number of layers in each partition is called balance . Γ # $ ! ! ! ! " $$ " $# " $" " $! Update $! $" $# $$ Subdivide each mini-batch of tiles into smaller micro- ! ! ! ! " #$ " ## " #" " #! # # Γ Update #! #" ## #$ batches that are assigned to each partition. # " ! ! ! ! " "$ " "# " "" " "! Update "! "" "# "$ Micro-batches per partition β‘ ππππ Γ # ! ! ! ! ! " !$ " !# " !" " !! Update !! !" !# !$ TorchGPipe: PyTorch implementation of Gpipe* Framework * Huang, Yanping, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le and Zhifeng Chen. βGPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism.β NeurIPS (2019). 8 8
Model Parallel Experiments !Γ! Image Samples #Γ# Padded Tiles Single Node Execution !β²Γ!β² 96 GB !Γ! MODEL PARALLEL U-NET !β²Γ!β² !Γ! Benchmark U-Net Models !β²Γ!β² !Γ! SUMMIT NODE No. of Conv. Layers No. of Trainable !Γ! SAMPLE PARALLEL Model π Levels Per Level Parameters Small (Standard) 5 2 72,301,856 92 Medium-1 5 5 232,687,904 230 10Γ larger number of trainable parameters. β’ 4Γ fold larger receptive field. β’ Medium-2 6 2 289,357,088 188 Large 7 2 1,157,578,016 380 Medium -1 π. πΓ (192) , π. πΓ (512) and πΓ (1024) Speedup doubles (small: 1.97 ; medium-2: 2.01 ) speedup using 6 pipeline stages. as no. of pipeline stages increases from 1 to 6. 9 9
Need for Performance Improvement Single Node Execution Γ Small, Medium-2 and Large Models: Γ Layers: 109, 129 and 149. Γ Balances: small {14, 24, 30, 22, 12, 7}; medium-2 {16, 26, 38, 26, 12, 11}; large {18, 30, 44, 30, 14, 13}. Γ Need load balanced pipelined execution. n c ! X ( I ` β i Β· d ) 2 Γ Encoder memory: I 2 ` + 2 ` n f E ` = O i =1 n c !! Γ Decoder memory: 2 ` 0 n f X ( I ` 0 β i Β· d ) 2 2 I 2 D ` 0 = O ` 0 + i =1 β ( = π β β Γ Memory profile: πΉ β + πΈ β( vs. β , 10 10
Wrapping Up This Paper: Prototype Sample + Model Parallel Framework Γ Training image segmentation neural !Γ! Image Samples network models become extremely #Γ# Padded Tiles challenging when: !β²Γ!β² 96 GB !Γ! Γ Image sizes are very large MODEL PARALLEL U-NET !β²Γ!β² !Γ! Γ Desired receptive fields are large !β²Γ!β² !Γ! SUMMIT NODE Γ Volume of training data is large. !Γ! SAMPLE PARALLEL Γ Fast training/inference needed for ππΓ larger number of trainable parameters. β’ πππππΓ larger image size. β’ πΓ fold larger receptive field. β’ geo-sensing applications βsatellite imagery, disaster assessment, precision Load Balance Heuristics Data Parallelism agriculture, etc. Γ This work is a first step β can train 10Γ Ongoing Work: Sample + Model + Data Parallel Framework larger U-Net models with 4Γ larger receptive field on 10000Γ larger SAMPLE PARALLEL 96 GB !Γ! #Γ# Padded Tiles images. MODEL PARALLEL U-NET !Γ! Image Samples !Γ! DATA PARALLEL Γ Ongoing efforts are underway to MODEL PARALLEL U-NET !β²Γ!β² integrate load balancing heuristics !Γ! !β²Γ!β² !Γ! MODEL PARALLEL U-NET and data-parallel execution to handle !β²Γ!β² large volumes of training data !Γ! !Γ! MODEL PARALLEL U-NET efficiently. !Γ! SUMMIT NODE 11 11
THANK YOU 12 12
Recommend
More recommend