mcunet tiny deep learning on iot devices
play

MCUNet: Tiny Deep Learning on IoT Devices Ji Lin 1 Song Han 1 - PowerPoint PPT Presentation

MCUNet: Tiny Deep Learning on IoT Devices Ji Lin 1 Song Han 1 Wei-Ming Chen 1,2 Yujun Lin 1 John Cohn 3 Chuang Gan 3 1 MIT 2 National Taiwan University 3 MIT-IBM Watson AI Lab NeurIPS 2020 (spotlight) Background: The Era of AIoT on


  1. MCUNet: Tiny Deep Learning on IoT Devices Ji Lin 1 Song Han 1 Wei-Ming Chen 1,2 Yujun Lin 1 John Cohn 3 Chuang Gan 3 1 MIT 2 National Taiwan University 3 MIT-IBM Watson AI Lab NeurIPS 2020 (spotlight)

  2. Background: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power

  3. Background: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth 50 #Units (Billion) 40 30 20 10 0 12 13 14 15F 16F 17F 18F 19F

  4. Background: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth 50 #Units (Billion) 40 30 20 10 0 12 13 14 15F 16F 17F 18F 19F • Wide applications Smart Retail Personalized Healthcare Precision Agriculture Smart Home …

  5. Challenge: Memory Too Small to Hold DNN Memory (Activation) Storage (Weights)

  6. Challenge: Memory Too Small to Hold DNN Cloud AI Memory (Activation) 16GB Storage (Weights) ~TB/PB

  7. Challenge: Memory Too Small to Hold DNN Cloud AI Mobile AI Memory (Activation) 16GB 4GB Storage (Weights) ~TB/PB 256GB

  8. Challenge: Memory Too Small to Hold DNN Cloud AI Mobile AI Tiny AI Memory (Activation) 16GB 320kB 4GB Storage (Weights) ~TB/PB 256GB 1MB

  9. Challenge: Memory Too Small to Hold DNN Cloud AI Mobile AI Tiny AI Memory (Activation) 16GB 320kB 4GB 13,000x Storage (Weights) ~TB/PB 256GB 1MB smaller 50,000x smaller

  10. Challenge: Memory Too Small to Hold DNN Cloud AI Mobile AI Tiny AI Memory (Activation) 16GB 320kB 4GB 13,000x Storage (Weights) ~TB/PB 256GB 1MB smaller We need to reduce the peak activation size 50,000x AND the model size to fit a DNN into MCUs. smaller

  11. Existing efficient network only reduces model size but NOT activation size! ~70% ImageNet Top-1 ResNet-18 MobileNetV2-0.75 MCUNet 50 40 30 4.6x 20 10 1.8x 0 Param (MB) Peak Activation (MB)

  12. Challenge: Memory Too Small to Hold DNN Peak Memory (kB) ResNet-50 23x MobileNetV2 22x MobileNetV2 (int8) 5x 320kB 0 2000 4000 6000 8000 constraint

  13. Challenge: Memory Too Small to Hold DNN Peak Memory (kB) ResNet-50 23x MobileNetV2 22x MobileNetV2 (int8) 5x 320kB 0 2000 4000 6000 8000 constraint

  14. Challenge: Memory Too Small to Hold DNN Peak Memory (kB) ResNet-50 23x MobileNetV2 22x MobileNetV2 (int8) 5x 320kB 0 2000 4000 6000 8000 constraint MCUNet

  15. MCUNet: System-Algorithm Co-design

  16. MCUNet: System-Algorithm Co-design Library NAS (a) Search NN model on an existing library e.g., ProxylessNAS, MnasNet

  17. MCUNet: System-Algorithm Co-design Library NN Model Library NAS (a) Search NN model on an existing library (b) Tune deep learning library given a NN model e.g., ProxylessNAS, MnasNet e.g., TVM

  18. MCUNet: System-Algorithm Co-design Library NN Model Library NAS (a) Search NN model on an existing library (b) Tune deep learning library given a NN model e.g., ProxylessNAS, MnasNet e.g., TVM Efficient Neural Architecture TinyNAS MCUNet TinyEngine Efficient Compiler / Runtime (c) MCUNet : system-algorithm co-design

  19. MCUNet: System-Algorithm Co-design Library NN Model Library NAS (a) Search NN model on an existing library (b) Tune deep learning library given a NN model e.g., ProxylessNAS, MnasNet e.g., TVM Efficient Neural Architecture TinyNAS TinyNAS MCUNet TinyEngine Efficient Compiler / Runtime (c) MCUNet : system-algorithm co-design

  20. TinyNAS: Two-Stage NAS for Tiny Memory Constraints Search space design is crucial for NAS performance There is no prior expertise on MCU model design Full Network Space

  21. TinyNAS: Two-Stage NAS for Tiny Memory Constraints Search space design is crucial for NAS performance There is no prior expertise on MCU model design Memory/Storage Constraints Full Network Space Optimized Search Space

  22. TinyNAS: Two-Stage NAS for Tiny Memory Constraints Search space design is crucial for NAS performance There is no prior expertise on MCU model design Memory/Storage Constraints Full Network Space Model Specialization Optimized Search Space

  23. TinyNAS: (1) Automated search space optimization Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth

  24. TinyNAS: (1) Automated search space optimization Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth k=7 k=5 k=3

  25. TinyNAS: (1) Automated search space optimization Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth dw pw1 pw2 e=6 k=7 e=4 k=5 k=3 e=2

  26. TinyNAS: (1) Automated search space optimization Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth dw pw1 pw2 d=4 e=6 k=7 d=3 e=4 k=5 k=3 d=2 e=2

  27. TinyNAS: (1) Automated search space optimization Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth Out of memory!

  28. TinyNAS: (1) Automated search space optimization Extended search space to cover wide range of hardware capacity: S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

  29. TinyNAS: (1) Automated search space optimization Extended search space to cover wide range of hardware capacity: S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W Di ff erent R and W for di ff erent hardware capacity (i.e., di ff erent optimized sub-space) R= 224 , W= 1.0

  30. TinyNAS: (1) Automated search space optimization Extended search space to cover wide range of hardware capacity: S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W Di ff erent R and W for di ff erent hardware capacity (i.e., di ff erent optimized sub-space) R= 260 , W= 1.4 * R= 224 , W= 1.0 * Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20

  31. TinyNAS: (1) Automated search space optimization Extended search space to cover wide range of hardware capacity: S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W Di ff erent R and W for di ff erent hardware capacity (i.e., di ff erent optimized sub-space) R= 260 , W= 1.4 R= 224 , W= 1.0 R= ? , W= ? F412/F743/H746/… 256kB/320kB/512kB/…

  32. TinyNAS: (1) Automated search space optimization Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

  33. TinyNAS: (1) Automated search space optimization Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy 320kB? 100% width-res. | mFLOPs Cumulative Probability w0.3-r160 | 32.5 75% w0.4-r144 | 46.9 50% 25% 0% 25 30 35 40 45 50 55 60 65 FLOPs (M)

  34. TinyNAS: (1) Automated search space optimization Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy 100% width-res. | mFLOPs Cumulative Probability w0.3-r160 | 32.5 75% w0.4-r144 | 46.9 50% 25% 0% 25 30 35 40 45 50 55 60 65 FLOPs (M)

  35. TinyNAS: (1) Automated search space optimization Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy 100% width-res. | mFLOPs Cumulative Probability w0.3-r160 | 32.5 p=80% (32.3M, 80%) (45.4M, 80%) 75% w0.4-r144 | 46.9 Bad design space p0.8 50% 25% 0% 25 30 35 40 45 50 55 60 65 FLOPs (M)

  36. TinyNAS: (1) Automated search space optimization Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy 100% width-res. | mFLOPs Cumulative Probability w0.3-r160 | 32.5 p=80% (32.3M, 80%) (45.4M, 80%) 75% w0.4-r144 | 46.9 Bad design space p0.8 % best acc: 76.4% 2 . 4 50% 7 : c c a t s 25% e b 0% 25 30 35 40 45 50 55 60 65 FLOPs (M)

  37. TinyNAS: (1) Automated search space optimization Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy 100% width-res. | mFLOPs Cumulative Probability w0.3-r160 | 32.5 p=80% (32.3M, 80%) (50.3M, 80%) 75% w0.4-r112 | 32.4 Bad design space w0.4-r128 | 39.3 % Good design space: likely to achieve best acc: 76.4% best acc: 78.7% 2 w0.4-r144 | 46.9 high FLOPs under memory constraint . 4 50% w0.5-r112 | 38.3 7 : w0.5-r128 | 46.9 c c w0.5-r144 | 52.0 a t s w0.6-r112 | 41.3 25% e b w0.7-r96 | 31.4 w0.7-r112 | 38.4 p0.8 0% 25 30 35 40 45 50 55 60 65 FLOPs (M)

  38. TinyNAS: (2) Resource-constrained model specialization • One-shot NAS through weight sharing Super Network Random sample Jointly fine-tune (kernel size, multiple sub- expansion, depth) networks • Small sub-networks are nested in large sub-networks. * Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20

  39. TinyNAS: (2) Resource-constrained model specialization • One-shot NAS through weight sharing Super Network Random sample Jointly fine-tune (kernel size, multiple sub- expansion, depth) networks Directly evaluate the accuracy of sub-nets …

  40. TinyNAS: (2) Resource-constrained model specialization Elastic Elastic Elastic Kernel Size Depth Width 40

Recommend


More recommend