Once for All: Train One Network and Specialize it for Efficient - PowerPoint PPT Presentation

Once for All: Train One Network and Specialize it for Efficient Deployment Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han Massachusetts Institute of Technology Once-for-All, ICLR’20

Challenge: Efficient Inference on Diverse Hardware Platforms Cloud AI Mobile AI Tiny AI (AIoT) less less resource resource • Memory: 32GB • Memory: 4GB • Memory: 100 KB 10 12 10 9 10 6 • Computation: FLOPS • Computation: FLOPS • Computation: < FLOPS • Different hardware platforms have different resource constraints. We need to customize our models for each platform to achieve the best accuracy-efficiency trade-off, especially on resource-constrained edge devices .

Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms Design Cost (GPU hours) 40K The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR . 2019. 3

Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms 2019 2017 2015 2013 Design Cost (GPU hours) 40K 160K The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR . 2019. 4

Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms … 10 12 10 9 10 6 Cloud AI ( FLOPS) Mobile AI ( FLOPS) Tiny AI ( FLOPS) Design Cost (GPU hours) 40K 160K 1600K The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR . 2019. 5

Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms … 10 12 10 9 10 6 Cloud AI ( FLOPS) Mobile AI ( FLOPS) Tiny AI ( FLOPS) Design Cost (GPU hours) 40K 11.4k lbs CO 2 emission → 45.4k lbs CO 2 emission 160K → 454.4k lbs CO 2 emission 1600K → 1 GPU hour translates to 0.284 lbs CO 2 emission according to 6 Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms … 10 12 10 9 10 6 Cloud AI ( FLOPS) Mobile AI ( FLOPS) Tiny AI ( FLOPS) Design Cost (GPU hours) 40K 11.4k lbs CO 2 emission → 45.4k lbs CO 2 emission 160K → 454.4k lbs CO 2 emission 1600K → ? 1 GPU hour translates to 0.284 lbs CO 2 emission according to 7 Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms … 10 12 10 9 10 6 Cloud AI ( FLOPS) Mobile AI ( FLOPS) Tiny AI ( FLOPS) Design Cost (GPU hours) 40K 11.4k lbs CO 2 emission → 45.4k lbs CO 2 emission 160K → 454.4k lbs CO 2 emission 1600K → Once-for-All Network 1 GPU hour translates to 0.284 lbs CO 2 emission according to 8 Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

Once-for-All Network: Decouple Model Training and Architecture Design once-for-all network 9

Once-for-All Network: Decouple Model Training and Architecture Design once-for-all network … 12

Progressive Shrinking for Training OFA Networks 10 19 • More than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution , kernel size , depth , width . • Directly optimizing the once-for-all network from scratch is much more challenging than training a normal neural network given so many sub-networks to support. 13

Progressive Shrinking 10 19 • More than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution , kernel size , depth , width . • Directly optimizing the once-for-all network from scratch is much more challenging than training a normal neural network given so many sub-networks to support. Progressive Shrinking Jointly fine-tune Train the Shrink the model once-for-all both large and full model (4 dimensions) network small sub-networks • Small sub-networks are nested in large sub-networks. • Cast the training process of the once-for-all network as a progressive shrinking and joint fine-tuning process. 14

Connection to Network Pruning Network Pruning Train the single pruned Shrink the model Fine-tune full model network (only width) the small net Progressive Shrinking Fine-tune Train the Shrink the model once-for-all both large and full model (4 dimensions) network small sub-nets • Progressive shrinking can be viewed as a generalized network pruning with much higher flexibility across 4 dimensions. 15

Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Randomly sample input image size for each batch 16

Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial 7x7 5x5 3x3 Transform Transform Matrix Matrix 25x25 9x9 Start with full kernel size Smaller kernel takes centered weights via a transformation matrix 17

Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial Partial O1 unit i unit i unit i O1 O2 O2 O3 train with full depth shrink the depth shrink the depth Gradually allow later layers in each unit to be skipped to reduce the depth 18

Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial Partial Partial channel channel importance importance 0.82 0.02 reorg. channel 0.11 0.15 reorg. channel sorting O3 sorting 0.46 0.85 O2 O2 0.63 O1 O1 O1 train with full width progressively shrink the width progressively shrink the width Gradually shrink the width Keep the most important channels when shrinking via channel sorting 19

Performances of Sub-networks on ImageNet w/o PS w/ PS 78 ImageNet Top-1 Acc (%) 3.5% 75 3.7% 3.4% 3.4% 3.3% 73 3.5% 2.8% 70 2.5% 67 D=2 D=2 D=2 D=2 D=4 D=4 D=4 D=4 W=3 W=3 W=6 W=6 W=3 W=3 W=6 W=6 K=3 K=7 K=3 K=7 K=3 K=7 K=3 K=7 Sub-networks under various architecture configurations D: depth, W: width, K: kernel size • Progressive shrinking consistently improves accuracy of sub-networks on ImageNet. 20

OFA: 80% Top-1 Accuracy on ImageNet 14x less computation 81 595M MACs Xception Once-for-All (ours) InceptionV3 80.0% Top-1 ResNetXt-50 79 EfficientNet NASNet-A DPN-92 ImageNet Top-1 accuracy (%) MBNetV3 ResNetXt-101 DenseNet-169 77 ProxylessNAS DenseNet-264 DenseNet-121 AmoebaNet ResNet-101 75 MBNetV2 PNASNet ResNet-50 InceptionV2 ShuffleNet DARTS 73 2M 4M 8M 16M 32M 64M IGCV3-D Model Size The higher the better → 71 MobileNetV1 (MBNetV1) Handcrafted AutoML → The lower the better 69 0 1 2 3 4 5 6 7 8 9 MACs (Billion) • Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under the mobile setting (< 600M MACs). 21

Comparison with EfficientNet and MobileNetV3 OFA EfficientNet OFA MobileNetV3 81 77 76.4 80.1 2.6x faster 74.9 80 75 79.8 Top-1 ImageNet Acc (%) Top-1 ImageNet Acc (%) 75.2 79.8 73.3 73.3 79 73 78.7 1.5x faster 78.8 71.4 3.8% higher accuracy 78 71 70.4 4% higher 77 69 accuracy 76.3 67.4 76 67 0 50 100 150 200 250 300 350 400 18 24 30 36 42 48 54 60 Google Pixel1 Latency (ms) Google Pixel1 Latency (ms) • Once-for-all is 2.6x faster than EfficientNet and 1.5x faster than MobileNetV3 on Google Pixel1 without loss of accuracy. 22

OFA for Fast Specialization on Diverse Hardware Platforms OFA MobileNetV3 MobileNetV2 77 77 77 76.4 76.3 75.8 Top-1 ImageNet Acc (%) 74.7 74.7 74.7 75 75 75 75.2 75.2 75.2 73.4 73.1 73.0 73 73.3 73.3 73 73 73.3 71.5 70.5 71.1 71 71 71 70.4 70.4 70.4 69 69 69 67.4 67.4 67.4 67 67 67 25 40 55 70 85 100 23 28 33 38 43 48 53 58 63 68 7 10 13 16 19 22 25 LG G8 Latency (ms) Samsung S7 Edge Latency (ms) Google Pixel2 Latency (ms) 77 77 77 76.4 75.7 75.3 74.6 73.8 73.7 Top-1 ImageNet Acc (%) 72.8 72.6 73 73 73 72.0 71.1 72.0 72.0 69.6 71.5 69.8 69 69.8 69 69 67.0 69.0 66 66 66 65.4 65.4 63.3 62 62 62 60.3 60.3 59.1 58 58 58 10 14 18 22 26 30 9 11 13 15 17 19 3.0 4.0 5.0 6.0 7.0 8.0 NVIDIA 1080Ti Latency (ms) Xilinx ZU3EG FPGA Latency (ms) Intel Xeon CPU Latency (ms) Batch Size = 64 Batch Size = 1 (Quantized) Batch Size = 1 23

OFA Saves Orders of Magnitude Design Cost • Geen AI is important. The computation cost of OFA stays constant with #hardware platforms, reducing the carbon footprint by 1,335x compared to MnasNet under 40 platforms. 24

Once for All: Train One Network and Specialize it for Efficient - PowerPoint PPT Presentation

Once for All: Train One Network and Specialize it for Efficient Deployment Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han Massachusetts Institute of Technology Once-for-All, ICLR20 Challenge: Efficient Inference on Diverse

23 Advanced Topics 5: Multi-lingual Models Up until now, we have assumed that in the case of

TOS Arno Puder 1 Objectives Introduce the train simulator Using the model train

A-train Commuter Rail Updated July 31, 2018 Presentation Overview DCTA A-train Commuter Rail

TRISTAN 2016, Aruba, June 2016 1 Real-time train rescheduling Train scheduling : routing and

Bethesda Big Train Partnership Presentation What is Big Train? Bethesda Big Train is a summer

Antwerp 50 by train Ghent 40 by

AutoML for TinyML with Once-for-All Network Song Han Massachusetts Institute of Technology

District 211 One-to-One Program One-to-One: Program Background 2012-2013 2016-2017 One-to-One

7.2 Ship Drive Train and Power Ship Drive Train System EHP Engine Reduction Screw Strut Gear

All Aboard the Type Train All Aboard the Type Train Kadi Kraman Kadi Kraman @kadikraman

Systems and Level Crossings Railway Signalling Seminars Phillip James Overview Train

TRAIN Demonstration TRAIN is a paperless Web based system designed to assist OSHPD employees

Train Smarter 1 The days of just training harder are over, you need to train smarter.

Sea Otters to Oregon Robert Bailey, Elakha Alliance Once, They W Once, The y Were ere Here.

OpenCms in the Telco industry - A Tale from Down Under OpenCMSdays 2009 Thomas Kutschi Once

From 2 days to 2 seconds - the birth of DevOps Dan North @tastapod Once upon a time

Graphs with no short cycle covers Edita M a cajov a Comenius University, Bratislava

Short cycle covers of graphs with minimum degree three Tom Kaiser, Daniel Krl, Bernard

Short covering codes in Hamming spaces Anderson N. Martinh ao Centro de Ci encias Exatas

Efficient Receipt-Free Ballot Casting Resistant to Covert Channels Ben Adida C. Andrew Neff

Decidable Problems for Counter Systems Day 3 Vector Addition Systems St ephane Demri

Construction of covering arrays from Outline m-sequences Covering arrays Definition Research

Requirements Engineering Software Engineering Software Engineering Andreas Zeller Saarland

Tight Space-Approximation Tradeoff for the Multi-Pass Streaming Set Cover Problem Sepehr Assadi