Efficient and Scalable Deep Leaning: Automated and Federated Ligeng - PowerPoint PPT Presentation

Efficient and Scalable Deep Leaning: Automated and Federated Ligeng Zhu

Brief Bio ● Was born in Taizhou, Zhejiang Province, China. ● Entered Zhejiang University to pursue study in CS. ● Dual degree program at Simon Fraser University, also major in CS. ● Intern at TuSimple in 17’s summer, love the weather in SD. ● Visit MIT (host: Song Han) in 18-19. ● Work as a Data Scientist at Intel AI Labs.

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware Han Cai, Ligeng Zhu,, Song Han

History of CNN Architectures

Generalization v.s. Specialization • Previously, people tend to design a single efficient CNN for all platforms and all datasets. ResNet • But, different dataset in fact has different features, e.g., Inception size of object, scale, rotation. DenseNet • But, different platform in fact has different properties, MobileNet e.g. degree of parallelism, cache size, #PE, memory BW.   ShuffleNet • Machine learning wants generalization   Hardware efficiency needs specialization   A generalized model to handle specialized is not ideal!

Case by case Design — Expensive! Different Platforms Different datasets

From Manual Design to Automatic Design Use Human Expertise Use Machine Learning Manual Automatic Architecture Architecture Design Search ResNet / DenseNet / Inception / … Reinforcement Learning / Mento Carlo / …

From General Design to Specialized CNN Our Work: Previous Paradigm: Customize CNN for each platform. One CNN for all platforms. ResNet Inception Proxyless DenseNet NAS MobileNet ShuffleNet

Design Automation for Hardware Efficient Nets + ProxylessNAS Design efficient neural networks Training Deploy Machine learning expert Non expert Hardware-Centric Hardware expert AutoML Hardware-Centric AutoML allows non-experts to efficiently design neural network architectures with a push-button solution that runs fast on a specific hardware.

Conventional NAS: Computationally Expensive Architecture Train a child Learner network to get accuracy Updates VERY EXPENSIVE . • NASNet: 48,000 GPU hours ≈ 5 years on single GPU • DARTS: 100Gb GPU memory * ≈ 9 times of modern GPU …….

Conventional NAS: Proxy-Based Architecture Transfer Proxy Target Task & Learner Task Hardware Updates Therefore, previous work have to utilize proxy tasks: Limitations of Proxy • CIFAR-10 -> ImageNet • Suboptimal for the target task • Small architecture space (e.g. low depth) -> large • Blocks are forced to share the same structure. architecture space • Cannot optimize for specific hardware. • Fewer epochs training -> full training

Our Work: Proxyless, Save GPU Hours by 200x Architecture Architecture Target Target Proxy Transfer Learner Task & Learner Task & Task Hardware Hardware Update Update Goal: Directly learn architectures on the target task and hardware , while allowing all blocks to have different structures. We achieved by 1. Reducing the cost of NAS (GPU hours and memory) to the same level of regular training . 2. Cooperating hardware feedback (e.g. latency) into the search process.

Model Compression Neural Architecture Search Pruning Binarization Save GPU hours Save GPU Memory

Save GPU Hours Pruning redundant paths based on architecture parameters Simplify NAS to be a single training process of a over-parameterized network. No meta controller. Stand on the shoulder of giants. Build the cumbersome network with all candidate paths

Save GPU Memory Binarize the architecture parameters and allow only one path of activation to be active in memory at run-time. We propose gradient-based and RL methods to update the binarized parameters . Thereby, the memory footprint reduces from O(N) to O(1) .

Results: ProxylessNAS on CIFAR-10 • Directly explore a huge space: 54 distinct blocks and possible architectures • State-of-the-art test error with 6X fewer params (Compared to AmeobaNet-B)

Results: ProxylessNAS on ImageNet, Mobile Platform • With >74.5% top-1 accuracy, ProxylessNAS is 1.8x faster than MobileNet-v2, the current industry standard.

Results: ProxylessNAS on ImageNet, Mobile Platform Hardware No No Search Model Top-1 Latency Aware Proxy Repeat Cost MobilenetV1 70.6 113ms - - x - Manually Designed MobilenetV2 72.0 75ms - - x - NASNet-A 74.0 183ms x x x 48000 NAS AmoebaNet-A 74.4 190ms x x x 75600 MNasNet 74.0 76ms yes x x 40000 ProxylessNAS-G 71.8 83ms yes yes yes 200 ProxylessNAS-G + LL 74.2 79ms yes Yes yes 200 ProxylessNAS ProxylessNAS-R 74.6 78ms yes Yes yes 200 ProxylessNAS-R + 75.1 78ms yes yes yes 200 MIXUP ProxylessNAS achieves state-of-the art accuracy (%) on ImageNet (under mobile latency constraint ≤ 80ms) with 200 × less search cost in GPU hours. “LL” indicates latency regularization loss.

Results: Proxyless-NAS on ImageNet, GPU Platform When targeting GPU platform, the accuracy is further improved to 75.1%. 3.1% higher than MobilenetV2.

The History of Architectures (1) The history of finding efficient Mobile model (2) The history of finding efficient CPU model (3) The history of finding efficient GPU model

3x224x224 3x224x224 3x224x224 Conv 3x3 Conv 3x3 Conv 3x3 40x112x112 40x112x112 32x112x112 MB1 3x3 MB1 3x3 MB1 3x3 24x112x112 24x112x112 32x112x112 MB3 5x5 MB6 3x3 MB3 5x5 32x56x56 32x56x56 32x56x56 MB3 7x7 MB3 3x3 MB3 3x3 (1) Efficient mobile architecture found by Proxy-less NAS. (3) Efficient GPU architecture found by Proxy-less NAS. 56x28x28 32x56x56 40x56x56 (2) Efficient CPU architecture found by Proxy-less NAS. MB3 3x3 MB3 3x3 MB3 7x7 56x28x28 32x56x56 40x28x28 MB6 7x7 MB3 3x3 MB3 3x3 Detailed Architectures 112x14x14 32x56x56 40x28x28 MB3 5x5 MB6 3x3 MB3 5x5 112x14x14 48x28x28 40x28x28 MB6 5x5 MB3 3x3 MB3 5x5 128x14x14 48x28x28 40x28x28 MB3 3x3 MB3 3x3 MB6 7x7 128x14x14 48x28x28 80x14x14 MB3 5x5 MB3 5x5 MB3 5x5 128x14x14 48x28x28 80x14x14 MB6 7x7 MB6 3x3 MB3 5x5 256x7x7 88x14x14 80x14x14 MB6 7x7 MB3 3x3 MB3 5x5 256x7x7 88x14x14 80x14x14 MB6 7x7 MB6 5x5 MB6 5x5 256x7x7 104x14x14 96x14x14 MB6 5x5 MB3 3x3 MB3 5x5 256x7x7 104x14x14 96x14x14 MB6 7x7 MB3 3x3 MB3 5x5 432x7x7 104x14x14 96x14x14 Pooling FC MB3 3x3 MB3 5x5 104x14x14 96x14x14 MB6 5x5 MB6 7x7 216x7x7 192x7x7 MB3 5x5 MB6 7x7 216x7x7 192x7x7 MB3 5x5 MB3 7x7 216x7x7 192x7x7 MB3 3x3 MB3 7x7 216x7x7 192x7x7 MB6 5x5 MB6 7x7 360x7x7 320x7x7 Pooling FC Pooling FC

ProxylessNAS for Hardware Specialization

Achievements of Design Automation • The first place in the Visual Wake-up Word Challenge@CVPR’19 • with <250KB model size, <250KB peak memory usage, <60M MAC • The third place in the classification track of the LPIRC@CVPR. • image classification within 30ms latency on a Pixel-2 phone. Both powered by Design Automation! 23

Embrace Open-source All codes are now public at https://github.com/MIT-HAN-LAB HAQ: Hardware-aware   Automated Quantization [CVPR 2019] Oral AMC: AutoML for Model Compression [ECCV 2018] Proxyless Neural Architecture Search [ICLR 2019] 24

Cloud Nvidia P4 Nvidia V100 Google TPU v1 Google TPU v2/v3 Microsoft Brainwave Intel Nervana NNP Xilinx Deephi Descartes Baidu Kunlun Baidu kunlun Alibaba Ali-NPU Alibaba Ali-NPU Training Inference Nvidia DLA ? Google Edge TPU Apple Bionic Huawei Kirin Xilinx Deephi Aristotle Edge 25

Distributed Training Across the World Ligeng Zhu, Yao Lu, Hangzhou Lin, Yujun Lin, Song Han 26

Conventional Distributed Training …… …… [2011] Niu et al. Hogwild!: A Lock-Free Approach to Parallelizing Stochastic [2012] Google. Large Scale Distributed Deep Networks [2012] Ahmed et al. Scalable inference in latent variable models. [2014] Li et al. Scaling Distributed Machine Learning with the Parameter Server. [2017] Facebook. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. …… …… Almost all of them are performed within a cluster. 27

Why distributed within a cluster? ● Scalability ● Network bandwidth > 10Gbps ● Network latency < 1ms ● Easy to manage ● Hardware failure ● System upgrade 28

Why distributed between clusters? ● Customization ● I.e., Different users will have a different tone for speech recognition Amazon Apple Google ● Security Alexa Home Pod Home ● Data cannot leave device because of security and regularization. 29

Limitation on Scalability (across clusters) Bandwidth Latency • Infinity band: up to 100 Gb/s • Infinity band: < 0.002 ms • Normal ethernet: up to 10 Gb/s • Normal ethernet: ~0.200 ms • Mobile network: 100 Mb/s (4G), 1Gb/s (5G) • Mobile network: ~50ms (4G) / ~10ms (5G) What we need • ResNet 50: 24.37MB, 0.3s / iter (v100) • At least 600 Mb/s bandwidth and 1ms latency. 30

Limitation on Scalability (across clusters) • Bandwidth can be always improved by • Hardware upgrade (Wired: fiber, Wireless: 5G) • Gradient sparsification (e.g., DGC, one-bit) • Latency is hard to reduce because physical laws . • I.e. Shanghai to Boston, 11,725km, even considering the speed of light, still takes 78ms. 31

Conventional Algos suffer from high latency 32

Scalability degrades quickly with latency What we have What we need 33

Efficient and Scalable Deep Leaning: Automated and Federated Ligeng - PowerPoint PPT Presentation

Efficient and Scalable Deep Leaning: Automated and Federated Ligeng Zhu Brief Bio Was born in Taizhou, Zhejiang Province, China. Entered Zhejiang University to pursue study in CS. Dual degree program at Simon Fraser University, also

3/18/2012 2011 ODU / CREED 2011 ODU / CREED INDUSTRIAL REVIEW INDUSTRIAL REVIEW LEANING INTO

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Deep Reasoning A Vision for Automated Deduction Stephan Schulz Deep Reasoning A Vision for

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

SMART MANUFACTURING With Apache Spark Streaming and Deep Leaning Prajod Vettiyattil, Wipro

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Machine Leaning: Synergy or Discord- a Case Study with MT, IR and Sentiment FIRE 2016 Pushpak

Searching for a new world New Physics at the LHC and beyond LianTao Wang U. Chicago

+ Predicting Fire Risk in Atlanta Data Science for Social Good Atlanta Fire Rescue Department

CS 126 Lecture A4: Sequential Circuits Midterm Statistics 21% Average: 42.5 Last Semester

Big Data and Internet Thinking Chentao Wu Associate Professor Dept. of Computer Science and

Hooking Stuff Together Programming the Cloud Programming the Cloud Gregor Hohpe

A Fully-Flexible Internet Architecture for the Next-Generation Tactile Internet Mohamed Faten

Fertilization strategies for externally fertilizing fishes Mohammad Abdul Momin Siddique

Randomized Network Algorithms: An Overview and Recent Results Balaji Prabhakar Departments of EE