AutoML for TinyML with Once-for-All Network Song Han Massachusetts Institute of Technology Once-for-All, ICLR’20
AutoML for TinyML with Once-for-All Network �������������� ������������� �������������� ������������� �������� �������� �������� �������� �������� ������� fewer engineers small model �������� ������� many engineers less computation large model Less Engineer Resources: AutoML Less Computational Resources: TinyML A lot of computation Once-for-All, ICLR’20
Challenge: Efficient Inference on Diverse Hardware Platforms Cloud AI Mobile AI Tiny AI (AIoT) less less resource resource • Memory: 32GB • Memory: 4GB • Memory: <100 KB • Computation: TFLOPS/s • Computation: GFLOPS/s • Computation: <MFLOPS/s • Different hardware platforms have different resource constraints. We need to customize our models for each platform to achieve the best accuracy-efficiency trade-off, especially on resource-constrained edge devices . Once-for-All, ICLR’20 3
Challenge: Efficient Inference on Diverse Hardware Platforms Design Cost (GPU hours) 200 for training iterations: forward-backward(); The design cost is calculated under the assumption of using MobileNet-v2. 4
Challenge: Efficient Inference on Diverse Hardware Platforms Design Cost (GPU hours) for search episodes: ( 1 ) 40K for training iterations: forward-backward(); if good_model: break; for post-search training iterations: forward-backward(); The design cost is calculated under the assumption of using MnasNet. Once-for-All, ICLR’20 5 [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR . 2019.
Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms 2019 2017 2015 2013 Design Cost (GPU hours) for devices: ( 2 ) for search episodes: ( 1 ) 40K for training iterations: 160K forward-backward(); if good_model: break; for post-search training iterations: forward-backward(); The design cost is calculated under the assumption of using MnasNet. Once-for-All, ICLR’20 6 [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR . 2019.
Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms … 10 12 10 9 10 6 Cloud AI ( FLOPS) Mobile AI ( FLOPS) Tiny AI ( FLOPS) Design Cost (GPU hours) for many devices: ( 2 ) for search episodes: ( 1 ) 40K for training iterations: 160K forward-backward(); if good_model: break; 1600K for post-search training iterations: forward-backward(); The design cost is calculated under the assumption of using MnasNet. Once-for-All, ICLR’20 7 [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR . 2019.
Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms … 10 12 10 9 10 6 Cloud AI ( FLOPS) Mobile AI ( FLOPS) Tiny AI ( FLOPS) Design Cost (GPU hours) for many devices: ( 2 ) for search episodes: ( 1 ) 40K 11.4k lbs CO 2 emission → for training iterations: 45.4k lbs CO 2 emission 160K → forward-backward(); if good_model: break; 454.4k lbs CO 2 emission 1600K → for post-search training iterations: forward-backward(); 1 GPU hour translates to 0.284 lbs CO 2 emission according to Once-for-All, ICLR’20 8 Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
Problem: TinyML comes at the cost of BigML (inference) (training/search) We need Green AI: Solve the Environmental Problem of NAS ICML’19, ACL’19 Evolved Transformer 4 orders of magnitude Ours 52 ACL’20 Hardware-Aware Transformer
OFA: Decouple Training and Search Once-for-All: Conventional NAS for OFA training iterations: for devices: ( 2 ) training forward-backward(); for search episodes: ( 1 ) decouple for training iterations: => for devices: search forward-backward(); for search episodes: if good_model: break; sample from OFA; for post-search training iterations: if good_model: break; forward-backward(); direct deploy without training; Once-for-All, ICLR’20 10
Challenge: Efficient Inference on Diverse Hardware Platforms Diverse Hardware Platforms … 10 12 10 9 10 6 Cloud AI ( FLOPS) Mobile AI ( FLOPS) Tiny AI ( FLOPS) for OFA training iterations: Design Cost (GPU hours) training forward-backward(); 40K 11.4k lbs CO 2 emission → decouple for devices: search 45.4k lbs CO 2 emission 160K → for search episodes: sample from OFA; 454.4k lbs CO 2 emission 1600K → if good_model: break; Once-for-All Network direct deploy without training; 1 GPU hour translates to 0.284 lbs CO 2 emission according to Once-for-All, ICLR’20 11 Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
Once-for-All Network: Decouple Model Training and Architecture Design once-for-all network Once-for-All, ICLR’20 12
Once-for-All Network: Decouple Model Training and Architecture Design once-for-all network Once-for-All, ICLR’20 13
Once-for-All Network: Decouple Model Training and Architecture Design once-for-all network Once-for-All, ICLR’20 14
Once-for-All Network: Decouple Model Training and Architecture Design once-for-all network … Once-for-All, ICLR’20 15
Challenge: how to prevent different subnetworks from interfering with each other? Once-for-All, ICLR’20 16
Solution: Progressive Shrinking 10 19 • More than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution , kernel size , depth , width . • Directly optimizing the once-for-all network from scratch is much more challenging than training a normal neural network given so many sub-networks to support. Once-for-All, ICLR’20 17
Solution: Progressive Shrinking 10 19 • More than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution , kernel size , depth , width . • Directly optimizing the once-for-all network from scratch is much more challenging than training a normal neural network given so many sub-networks to support. Progressive Shrinking Jointly fine-tune Train the Shrink the model once-for-all both large and full model (4 dimensions) network small sub-networks • Small sub-networks are nested in large sub-networks. • Cast the training process of the once-for-all network as a progressive shrinking and joint fine-tuning process. Once-for-All, ICLR’20 18
Connection to Network Pruning Network Pruning Train the single pruned Shrink the model Fine-tune full model network (only width) the small net Progressive Shrinking Fine-tune Train the Shrink the model once-for-all both large and full model (4 dimensions) network small sub-nets • Progressive shrinking can be viewed as a generalized network pruning with much higher flexibility across 4 dimensions. Once-for-All, ICLR’20 19
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Once-for-All, ICLR’20 20
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Once-for-All, ICLR’20 21
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Once-for-All, ICLR’20 22
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Once-for-All, ICLR’20 23
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Once-for-All, ICLR’20 24
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Once-for-All, ICLR’20 25
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial Once-for-All, ICLR’20 26
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial Once-for-All, ICLR’20 27
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial Once-for-All, ICLR’20 28
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial Once-for-All, ICLR’20 29
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial Once-for-All, ICLR’20 30
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial Once-for-All, ICLR’20 31
Progressive Shrinking Full Full Full Full Elastic Elastic Elastic Elastic Resolution Kernel Size Depth Width Partial Partial Partial Once-for-All, ICLR’20 32
Recommend
More recommend