MCUNet: Tiny Deep Learning on IoT Devices Ji Lin 1 Song Han 1 Wei-Ming Chen 1,2 Yujun Lin 1 John Cohn 3 Chuang Gan 3 1 MIT 2 National Taiwan University 3 MIT-IBM Watson AI Lab NeurIPS 2020 (spotlight)
Background: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power
Background: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth 50 #Units (Billion) 40 30 20 10 0 12 13 14 15F 16F 17F 18F 19F
Background: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth 50 #Units (Billion) 40 30 20 10 0 12 13 14 15F 16F 17F 18F 19F • Wide applications Smart Retail Personalized Healthcare Precision Agriculture Smart Home …
Challenge: Memory Too Small to Hold DNN Memory (Activation) Storage (Weights)
Challenge: Memory Too Small to Hold DNN Cloud AI Memory (Activation) 16GB Storage (Weights) ~TB/PB
Challenge: Memory Too Small to Hold DNN Cloud AI Mobile AI Memory (Activation) 16GB 4GB Storage (Weights) ~TB/PB 256GB
Challenge: Memory Too Small to Hold DNN Cloud AI Mobile AI Tiny AI Memory (Activation) 16GB 320kB 4GB Storage (Weights) ~TB/PB 256GB 1MB
Challenge: Memory Too Small to Hold DNN Cloud AI Mobile AI Tiny AI Memory (Activation) 16GB 320kB 4GB 13,000x Storage (Weights) ~TB/PB 256GB 1MB smaller 50,000x smaller
Challenge: Memory Too Small to Hold DNN Cloud AI Mobile AI Tiny AI Memory (Activation) 16GB 320kB 4GB 13,000x Storage (Weights) ~TB/PB 256GB 1MB smaller We need to reduce the peak activation size 50,000x AND the model size to fit a DNN into MCUs. smaller
Existing efficient network only reduces model size but NOT activation size! ~70% ImageNet Top-1 ResNet-18 MobileNetV2-0.75 MCUNet 50 40 30 4.6x 20 10 1.8x 0 Param (MB) Peak Activation (MB)
Challenge: Memory Too Small to Hold DNN Peak Memory (kB) ResNet-50 23x MobileNetV2 22x MobileNetV2 (int8) 5x 320kB 0 2000 4000 6000 8000 constraint
Challenge: Memory Too Small to Hold DNN Peak Memory (kB) ResNet-50 23x MobileNetV2 22x MobileNetV2 (int8) 5x 320kB 0 2000 4000 6000 8000 constraint
Challenge: Memory Too Small to Hold DNN Peak Memory (kB) ResNet-50 23x MobileNetV2 22x MobileNetV2 (int8) 5x 320kB 0 2000 4000 6000 8000 constraint MCUNet
MCUNet: System-Algorithm Co-design
MCUNet: System-Algorithm Co-design Library NAS (a) Search NN model on an existing library e.g., ProxylessNAS, MnasNet
MCUNet: System-Algorithm Co-design Library NN Model Library NAS (a) Search NN model on an existing library (b) Tune deep learning library given a NN model e.g., ProxylessNAS, MnasNet e.g., TVM
MCUNet: System-Algorithm Co-design Library NN Model Library NAS (a) Search NN model on an existing library (b) Tune deep learning library given a NN model e.g., ProxylessNAS, MnasNet e.g., TVM Efficient Neural Architecture TinyNAS MCUNet TinyEngine Efficient Compiler / Runtime (c) MCUNet : system-algorithm co-design
MCUNet: System-Algorithm Co-design Library NN Model Library NAS (a) Search NN model on an existing library (b) Tune deep learning library given a NN model e.g., ProxylessNAS, MnasNet e.g., TVM Efficient Neural Architecture TinyNAS TinyNAS MCUNet TinyEngine Efficient Compiler / Runtime (c) MCUNet : system-algorithm co-design
TinyNAS: Two-Stage NAS for Tiny Memory Constraints Search space design is crucial for NAS performance There is no prior expertise on MCU model design Full Network Space
TinyNAS: Two-Stage NAS for Tiny Memory Constraints Search space design is crucial for NAS performance There is no prior expertise on MCU model design Memory/Storage Constraints Full Network Space Optimized Search Space
TinyNAS: Two-Stage NAS for Tiny Memory Constraints Search space design is crucial for NAS performance There is no prior expertise on MCU model design Memory/Storage Constraints Full Network Space Model Specialization Optimized Search Space
TinyNAS: (1) Automated search space optimization Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth
TinyNAS: (1) Automated search space optimization Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth k=7 k=5 k=3
TinyNAS: (1) Automated search space optimization Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth dw pw1 pw2 e=6 k=7 e=4 k=5 k=3 e=2
TinyNAS: (1) Automated search space optimization Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth dw pw1 pw2 d=4 e=6 k=7 d=3 e=4 k=5 k=3 d=2 e=2
TinyNAS: (1) Automated search space optimization Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth Out of memory!
TinyNAS: (1) Automated search space optimization Extended search space to cover wide range of hardware capacity: S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W
TinyNAS: (1) Automated search space optimization Extended search space to cover wide range of hardware capacity: S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W Di ff erent R and W for di ff erent hardware capacity (i.e., di ff erent optimized sub-space) R= 224 , W= 1.0
TinyNAS: (1) Automated search space optimization Extended search space to cover wide range of hardware capacity: S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W Di ff erent R and W for di ff erent hardware capacity (i.e., di ff erent optimized sub-space) R= 260 , W= 1.4 * R= 224 , W= 1.0 * Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20
TinyNAS: (1) Automated search space optimization Extended search space to cover wide range of hardware capacity: S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W Di ff erent R and W for di ff erent hardware capacity (i.e., di ff erent optimized sub-space) R= 260 , W= 1.4 R= 224 , W= 1.0 R= ? , W= ? F412/F743/H746/… 256kB/320kB/512kB/…
TinyNAS: (1) Automated search space optimization Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy
TinyNAS: (1) Automated search space optimization Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy 320kB? 100% width-res. | mFLOPs Cumulative Probability w0.3-r160 | 32.5 75% w0.4-r144 | 46.9 50% 25% 0% 25 30 35 40 45 50 55 60 65 FLOPs (M)
TinyNAS: (1) Automated search space optimization Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy 100% width-res. | mFLOPs Cumulative Probability w0.3-r160 | 32.5 75% w0.4-r144 | 46.9 50% 25% 0% 25 30 35 40 45 50 55 60 65 FLOPs (M)
TinyNAS: (1) Automated search space optimization Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy 100% width-res. | mFLOPs Cumulative Probability w0.3-r160 | 32.5 p=80% (32.3M, 80%) (45.4M, 80%) 75% w0.4-r144 | 46.9 Bad design space p0.8 50% 25% 0% 25 30 35 40 45 50 55 60 65 FLOPs (M)
TinyNAS: (1) Automated search space optimization Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy 100% width-res. | mFLOPs Cumulative Probability w0.3-r160 | 32.5 p=80% (32.3M, 80%) (45.4M, 80%) 75% w0.4-r144 | 46.9 Bad design space p0.8 % best acc: 76.4% 2 . 4 50% 7 : c c a t s 25% e b 0% 25 30 35 40 45 50 55 60 65 FLOPs (M)
TinyNAS: (1) Automated search space optimization Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy 100% width-res. | mFLOPs Cumulative Probability w0.3-r160 | 32.5 p=80% (32.3M, 80%) (50.3M, 80%) 75% w0.4-r112 | 32.4 Bad design space w0.4-r128 | 39.3 % Good design space: likely to achieve best acc: 76.4% best acc: 78.7% 2 w0.4-r144 | 46.9 high FLOPs under memory constraint . 4 50% w0.5-r112 | 38.3 7 : w0.5-r128 | 46.9 c c w0.5-r144 | 52.0 a t s w0.6-r112 | 41.3 25% e b w0.7-r96 | 31.4 w0.7-r112 | 38.4 p0.8 0% 25 30 35 40 45 50 55 60 65 FLOPs (M)
TinyNAS: (2) Resource-constrained model specialization • One-shot NAS through weight sharing Super Network Random sample Jointly fine-tune (kernel size, multiple sub- expansion, depth) networks • Small sub-networks are nested in large sub-networks. * Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20
TinyNAS: (2) Resource-constrained model specialization • One-shot NAS through weight sharing Super Network Random sample Jointly fine-tune (kernel size, multiple sub- expansion, depth) networks Directly evaluate the accuracy of sub-nets …
TinyNAS: (2) Resource-constrained model specialization Elastic Elastic Elastic Kernel Size Depth Width 40
Recommend
More recommend