once for all train one network and specialize it for
play

Once for All: Train One Network and Specialize it for Efficient - PowerPoint PPT Presentation

Once for All: Train One Network and Specialize it for Efficient Deployment Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han Massachusetts Institute of Technology Once-for-All, ICLR20 Challenge: Efficient Inference on Diverse


  1. Once for All: Train One Network and Specialize it for Efficient Deployment Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han Massachusetts Institute of Technology Once-for-All, ICLR’20

  2. Challenge: Efficient Inference on Diverse Hardware Platforms Cloud AI Mobile AI Tiny AI (AIoT) less less resource resource • Memory: 32GB • Memory: 4GB • Memory: 100 KB 10 12 10 9 10 6 • Computation: FLOPS • Computation: FLOPS • Computation: < FLOPS • Different hardware platforms have different resource constraints. We need to customize our models for each platform to achieve the best accuracy-efficiency trade-off, especially on resource-constrained edge devices .

  3. Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms Design Cost (GPU hours) 40K The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR . 2019. 3

  4. Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms 2019 2017 2015 2013 Design Cost (GPU hours) 40K 160K The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR . 2019. 4

  5. Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms … 10 12 10 9 10 6 Cloud AI ( FLOPS) Mobile AI ( FLOPS) Tiny AI ( FLOPS) Design Cost (GPU hours) 40K 160K 1600K The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR . 2019. 5

  6. Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms … 10 12 10 9 10 6 Cloud AI ( FLOPS) Mobile AI ( FLOPS) Tiny AI ( FLOPS) Design Cost (GPU hours) 40K 11.4k lbs CO 2 emission → 45.4k lbs CO 2 emission 160K → 454.4k lbs CO 2 emission 1600K → 1 GPU hour translates to 0.284 lbs CO 2 emission according to 6 Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

  7. Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms … 10 12 10 9 10 6 Cloud AI ( FLOPS) Mobile AI ( FLOPS) Tiny AI ( FLOPS) Design Cost (GPU hours) 40K 11.4k lbs CO 2 emission → 45.4k lbs CO 2 emission 160K → 454.4k lbs CO 2 emission 1600K → ? 1 GPU hour translates to 0.284 lbs CO 2 emission according to 7 Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

  8. Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms … 10 12 10 9 10 6 Cloud AI ( FLOPS) Mobile AI ( FLOPS) Tiny AI ( FLOPS) Design Cost (GPU hours) 40K 11.4k lbs CO 2 emission → 45.4k lbs CO 2 emission 160K → 454.4k lbs CO 2 emission 1600K → Once-for-All Network 1 GPU hour translates to 0.284 lbs CO 2 emission according to 8 Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

  9. Once-for-All Network: Decouple Model Training and Architecture Design once-for-all network 9

  10. Once-for-All Network: Decouple Model Training and Architecture Design once-for-all network 10

  11. Once-for-All Network: Decouple Model Training and Architecture Design once-for-all network 11

  12. Once-for-All Network: Decouple Model Training and Architecture Design once-for-all network … 12

  13. Progressive Shrinking for Training OFA Networks 10 19 • More than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution , kernel size , depth , width . • Directly optimizing the once-for-all network from scratch is much more challenging than training a normal neural network given so many sub-networks to support. 13

  14. Progressive Shrinking 10 19 • More than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution , kernel size , depth , width . • Directly optimizing the once-for-all network from scratch is much more challenging than training a normal neural network given so many sub-networks to support. Progressive Shrinking Jointly fine-tune Train the Shrink the model once-for-all both large and full model (4 dimensions) network small sub-networks • Small sub-networks are nested in large sub-networks. • Cast the training process of the once-for-all network as a progressive shrinking and joint fine-tuning process. 14

  15. Connection to Network Pruning Network Pruning Train the single pruned Shrink the model Fine-tune full model network (only width) the small net Progressive Shrinking Fine-tune Train the Shrink the model once-for-all both large and full model (4 dimensions) network small sub-nets • Progressive shrinking can be viewed as a generalized network pruning with much higher flexibility across 4 dimensions. 15

  16. Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Randomly sample input image size for each batch 16

  17. Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial 7x7 5x5 3x3 Transform Transform Matrix Matrix 25x25 9x9 Start with full kernel size Smaller kernel takes centered weights via a transformation matrix 17

  18. Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial Partial O1 unit i unit i unit i O1 O2 O2 O3 train with full depth shrink the depth shrink the depth Gradually allow later layers in each unit to be skipped to reduce the depth 18

  19. Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial Partial Partial channel channel importance importance 0.82 0.02 reorg. channel 0.11 0.15 reorg. channel sorting O3 sorting 0.46 0.85 O2 O2 0.63 O1 O1 O1 train with full width progressively shrink the width progressively shrink the width Gradually shrink the width Keep the most important channels when shrinking via channel sorting 19

  20. Performances of Sub-networks on ImageNet w/o PS w/ PS 78 ImageNet Top-1 Acc (%) 3.5% 75 3.7% 3.4% 3.4% 3.3% 73 3.5% 2.8% 70 2.5% 67 D=2 D=2 D=2 D=2 D=4 D=4 D=4 D=4 W=3 W=3 W=6 W=6 W=3 W=3 W=6 W=6 K=3 K=7 K=3 K=7 K=3 K=7 K=3 K=7 Sub-networks under various architecture configurations D: depth, W: width, K: kernel size • Progressive shrinking consistently improves accuracy of sub-networks on ImageNet. 20

  21. OFA: 80% Top-1 Accuracy on ImageNet 14x less computation 81 595M MACs Xception Once-for-All (ours) InceptionV3 80.0% Top-1 ResNetXt-50 79 EfficientNet NASNet-A DPN-92 ImageNet Top-1 accuracy (%) MBNetV3 ResNetXt-101 DenseNet-169 77 ProxylessNAS DenseNet-264 DenseNet-121 AmoebaNet ResNet-101 75 MBNetV2 PNASNet ResNet-50 InceptionV2 ShuffleNet DARTS 73 2M 4M 8M 16M 32M 64M IGCV3-D Model Size The higher the better → 71 MobileNetV1 (MBNetV1) Handcrafted AutoML → The lower the better 69 0 1 2 3 4 5 6 7 8 9 MACs (Billion) • Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under the mobile setting (< 600M MACs). 21

  22. Comparison with EfficientNet and MobileNetV3 OFA EfficientNet OFA MobileNetV3 81 77 76.4 80.1 2.6x faster 74.9 80 75 79.8 Top-1 ImageNet Acc (%) Top-1 ImageNet Acc (%) 75.2 79.8 73.3 73.3 79 73 78.7 1.5x faster 78.8 71.4 3.8% higher accuracy 78 71 70.4 4% higher 77 69 accuracy 76.3 67.4 76 67 0 50 100 150 200 250 300 350 400 18 24 30 36 42 48 54 60 Google Pixel1 Latency (ms) Google Pixel1 Latency (ms) • Once-for-all is 2.6x faster than EfficientNet and 1.5x faster than MobileNetV3 on Google Pixel1 without loss of accuracy. 22

  23. OFA for Fast Specialization on Diverse Hardware Platforms OFA MobileNetV3 MobileNetV2 77 77 77 76.4 76.3 75.8 Top-1 ImageNet Acc (%) 74.7 74.7 74.7 75 75 75 75.2 75.2 75.2 73.4 73.1 73.0 73 73.3 73.3 73 73 73.3 71.5 70.5 71.1 71 71 71 70.4 70.4 70.4 69 69 69 67.4 67.4 67.4 67 67 67 25 40 55 70 85 100 23 28 33 38 43 48 53 58 63 68 7 10 13 16 19 22 25 LG G8 Latency (ms) Samsung S7 Edge Latency (ms) Google Pixel2 Latency (ms) 77 77 77 76.4 75.7 75.3 74.6 73.8 73.7 Top-1 ImageNet Acc (%) 72.8 72.6 73 73 73 72.0 71.1 72.0 72.0 69.6 71.5 69.8 69 69.8 69 69 67.0 69.0 66 66 66 65.4 65.4 63.3 62 62 62 60.3 60.3 59.1 58 58 58 10 14 18 22 26 30 9 11 13 15 17 19 3.0 4.0 5.0 6.0 7.0 8.0 NVIDIA 1080Ti Latency (ms) Xilinx ZU3EG FPGA Latency (ms) Intel Xeon CPU Latency (ms) Batch Size = 64 Batch Size = 1 (Quantized) Batch Size = 1 23

  24. OFA Saves Orders of Magnitude Design Cost • Geen AI is important. The computation cost of OFA stays constant with #hardware platforms, reducing the carbon footprint by 1,335x compared to MnasNet under 40 platforms. 24

Recommend


More recommend