Neural Architecture Ligeng Zhu May 4th 1 The Blooming of CNNs 2 - PowerPoint PPT Presentation

Neural Architecture Ligeng Zhu May 4th � 1

The Blooming of CNNs � 2

Bypass Connection x ` +1 = F ` ( x ` ) + x ` = F ` ( x ` ) + F ` − 1 ( x ` − 1 ) + x ` − 1 = F ` ( x ` ) + F ` − 1 ( x ` − 1 ) + ... + F 1 ( x 1 ) = y ` − 1 + y ` − 2 + ... + y 1 . Direct gradient flow between any two layer, makes optimizer easy to optimize. � 3

Cons of Residual Connection • Information loss during summation (especially in deep case) Cifar-10 param error 3 + 10 + 15 = 28 (easy) Res-32 0.46M 7.51 28 = ? + ? + ? (di ffi cult) Res-44 0.66M 7.17 Res-56 0.85M 6.97 Res-110 1.7M 6.43 Res-1202 19.4M 7.93 � 4

Improves of Residual Connection • Avoid information loss via replacing sum with concat 3 + 10 + 15 = 28 (easy) concat(3, 10, 15) = [3, 10, 15] 28 = ? + ? + ? (di ffi cult) [3, 10, 15] = concat(3, 10, 15) # ResNet pre-activation # DenseNet BC structure def ResidualBlock(x): def DenseBlock(x): x1 = BN_ReLU_Conv(x) x1 = BN_ReLU_Conv(x) x2 = BN_ReLU_Conv(x1) x2 = BN_ReLU_Conv(x1) return x + x2 return Concat([x, x2]) for i in range(N): for i in range(N): model.add(ResidualBlock) model.add(DenseBlock) Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700-4708). � 5

DenseNet • Concat is more parameter-efficient than sum. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700-4708). � 6

Cons of Concatenation • Disadvantage : • Exploding parameters in deep networks-> O(n^2) • Redundant inputs in deeper layers Dense-40-12 1.0M Dense-100-12 7.0M Dense-100-24 27.2M Dense-200-12 OOM � 7

Rethink about ResNet and DenseNet • Features are densely aggregated in both ResNet and DenseNet. x ` +1 = F ` ( x ` ) + x ` x ` +1 = F ` ( x ` ) ⊕ x ` = F ` ( x ` ) + F ` − 1 ( x ` − 1 ) + x ` − 1 = F ` ( x ` ) ⊕ F ` − 1 ( x ` − 1 ) ⊕ x ` − 1 = F ` ( x ` ) + F ` − 1 ( x ` − 1 ) + ... + F 1 ( x 1 ) = F ` ( x ` ) ⊕ F ` − 1 ( x ` − 1 ) ⊕ ... ⊕ F 1 ( x 1 ) = y ` − 1 + y ` − 2 + ... + y 1 . = y ` − 1 ⊕ y ` − 2 ⊕ ... ⊕ y 1 . � 8

Variations of dense aggregation (how to aggregate) ResNet DenseNet Mixed Link Dual Path � 9

Sum and Concat • ResNet and DenseNet are both dense aggregation structure. • Summation appears to be powerful on gradients, BUT • Information loss leads to parameter deficiency • Concat is a better way of aggregations, BUT • Blowing params and redundancy • Any way to utilize both advantages without bringing new troubles? � 10

Sparsely Aggregated Convolutional Networks • Instead of “ how to aggregate ”, consider “ what to aggregate ” • Only gather layers with exponential offsets Zhu, L., Deng, R., Maire, M., Deng, Z., Mori, G., & Tan, P. (2018). Sparsely aggregated convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 186-201). � 11

Params and Gradient Flow Analysis • The total skip connections (params) log c 1 + log c 2 + ... + log c N = log c N ! ≈ log c N N = O ( NlgN ) • The gradient flow between any two layers N o ff sets = > log c N × ( c − 1) steps • For example, when base is 2 23 o ff sets = > 10111 2 = > 4 steps 14 o ff sets = > 1110 2 = > 3 steps Zhu, L., Deng, R., Maire, M., Deng, Z., Mori, G., & Tan, P. (2018). Sparsely aggregated convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 186-201). � 12

Dense Concatenation and Sparse Aggregation (a) Dense Aggregation: Equivalent Exploded View of (a) F 0 F 1 F 2 F 3 F 4 F 5 F 6 F 7 F 8 × × × × × × × × ResNet & DenseNet: each layer takes all previous outputs. (b) Sparse Aggregation (Our Proposed Topology) F 0 F 1 F 2 F 3 F 4 F 5 F 6 F 7 F 8 × × × × × × × × SparseNet: each layer takes all outputs with exponential o ff set (e.g., i-1, i - 2, i - 4, i - 8 …) Zhu, L., Deng, R., Maire, M., Deng, Z., Mori, G., & Tan, P. (2018). Sparsely aggregated convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 186-201). � 13

Better parameter utilization Zhu, L., Deng, R., Maire, M., Deng, Z., Mori, G., & Tan, P. (2018). Sparsely aggregated convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 186-201). � 14

Better Param-Perform Curve Zhu, L., Deng, R., Maire, M., Deng, Z., Mori, G., & Tan, P. (2018). Sparsely aggregated convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 186-201). � 15

Remaining Question • What if, let the network self-choose what to aggregate? � 16

From Manual Design to Architecture Search Human Expertise Machine Learning Computational Resources Automatic Manual Architecture Architecture Search Design VGGNets Reinforcement Learning Inception Models Neuro-evolution ResNets Bayesian Optimization DenseNets Monte Carlo Tree Search …. … � 17

NASNet � 18

Everything is good, except the cost Learning Transferable Architectures for Scalable Image Recognition 4 days * 24 hours * 500 GPUs = 48,000 GPU hours � 19

Common Way: Proxy • Search on a small dataset, then transfer to large one(s). • e.g., CIFAR -> ImageNet • Search a subset(a single or few blocks), then repeats • Train only a few epochs instead fully train the model. Proxy leads to sub-optimal! � 20

Exploration on Efficient NAS � 21

Efficient Architecture Search by Network Transformation Net2Wider Net2Deeper � 22

Efficient Architecture Search by Network Transformation • Instead of sample a random layer, sample a equivalent transformation � 23

Exploration on Efficient NAS � 24

Understanding and Simplifying One-Shot Architecture Search 1. Train a larger network (with all candidates) 2. Sample a path, validate the performance. 3. Repeat step 2. 4. Choose the one with highest performance. � 25

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware Han Cai, Ligeng Zhu, Song Han Massachusetts Institute of Technology � 26

From General Design to Specialized CNN Previous Paradigm: Our Work: One CNN for all datasets. Customize CNN for each dataset. ResNet Inception Proxyless DenseNet NAS MobileNet ShuffleNet � 27

From General Design to Specialized CNN Our Work: Previous Paradigm: � 28 Customize CNN for each platform. One CNN for all platforms. ResNet Inception DenseNet Proxyless NAS MobileNet ShuffleNet

Conventional NAS: Computation Expensive Architecture Transfer Proxy Target Task & Learner Task Hardware Updates Current neural architecture search (NAS) is VERY EXPENSIVE . • NASNet: 48,000 GPU hours ≈ 5 years on single GPU • DARTS: 100Gb GPU memory * ≈ 9 times of modern GPU ……. *if directly search on ImageNet, like us Therefore, previous work have to utilize proxy tasks: • CIFAR-10 -> ImageNet • Small architecture space (e.g. low depth) -> large architecture space • Fewer epochs training -> full training � 29

Conventional NAS: proxy-based Architecture Transfer Proxy Target Task & Learner Task Hardware Updates Proxies: • CIFAR-10 -> ImageNet • Small architecture space (e.g. low depth) -> large architecture space • Fewer epochs training -> full training Limitations of Proxy • Suboptimal for the target task • Blocks are forced to share the same structure . • Cannot optimize for specific hardware . � 30

Our Work: proxyless, save GPU hours by 200x Architecture Architecture Target Target Proxy Transfer Learner Task & Learner Task & Task Hardware Hardware Update Update Goal: Directly learn architectures on the target task and hardware , while allowing all blocks to have different structures. We achieved by 1. Reducing the cost of NAS (GPU hours and memory) to the same level of regular training. 2. Cooperating hardware feedback (e.g. latency) into the search process. � 31

To make NAS 200x more Efficient poor equipment, smart algorithm Google, Facebook, NVIDIA High-end GPU cluster Many Engineers poor weapon but smart students AI research institutes: Less GPUs but: Good weapon (GPU cluster) we have more efficient algorithm Many Engineers � 32

Model Compression Neural Architecture Search Pruning Binarization Save GPU hours Save GPU Memory � 33

Save GPU Hours Pruning redundant paths based on architecture parameters Simplify NAS to be a single training process of a over-parameterized network. No meta controller. Stand on the shoulder of giants. Build the cumbersome network with all candidate paths � 34

Save GPU Memory Binarize the architecture parameters and allow only one path of activation to be active in memory at run-time. We propose gradient-based and RL methods to update the binarized parameters . Thereby, the memory footprint reduces from O(N) to O(1) . � 35

Search Cost � 36

FLOPs != Latency 60% latency difference 10% FLOPs difference � 37

Neural Architecture Ligeng Zhu May 4th 1 The Blooming of CNNs 2 - PowerPoint PPT Presentation

Neural Architecture Ligeng Zhu May 4th 1 The Blooming of CNNs 2 Bypass Connection x ` +1 = F ` ( x ` ) + x ` = F ` ( x ` ) + F ` 1 ( x ` 1 ) + x ` 1 = F ` ( x ` ) + F ` 1 ( x ` 1 ) + ... + F 1 ( x 1 ) = y ` 1 + y

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Architecture Search Yu Cao What is Neural Architecture Search (NAS) Selecting the optimal

A Neural Network Architecture for Detec2ng Gramma2cal Errors in SMT A Neural Network Architecture

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Overview Understanding the neural code Neural Encoding Encoding: Prediction of neural response to

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

8 Neural MT 2: Attentional Neural MT In the past chapter, we described a simple model for neural

Financial Results for the Fiscal Year Ended March 2018 May 10, 2018 President Makoto Takahashi

Responsible Shark Use Glenn Sant - TRAFFIC CITES Symposium, Tokyo 2013 Top 20 shark catchers,

CFCH Managing Board May 13, 2020 Call to Order Welcome and Introductions Meeting Minutes Old

Second quarter results 2020 Disclaimer This presentation contains forward-looking statements that

ASX Release 2 October 2019 RETRACTION OF ASX RELEASES INFORMATION ADX refers to the following

Company Update AGM 22 November 2019 Disclaimer This presentation includes certain statements

Q1 2020 PRESENTATION April 22, 2020 Highlights - Q1 2020 Revenue in line with last year in a

NEW DEMONSTRATIONS OF ALTERNATIVES TO MB FOR STRAWBERRY IN SPAIN. 2003 RESULTS. L. Miranda