Soft Threshold Weight Reparameterization for Learnable Sparsity - PowerPoint PPT Presentation

Soft Threshold Weight Reparameterization for Learnable Sparsity Aditya Kusupati Vivek Ramanujan * , Raghav Somani * , Mitchell Wortsman * Prateek Jain, Sham Kakade and Ali Farhadi 1

Motivation • Deep Neural Networks • Highly accurate • Millions of parameters & Billions of FLOPs • Expensive to deploy • Sparsity • Reduces model size & inference cost • Maintains accuracy • Deployment on CPUs & weak single-core devices Privacy preserving Billions of mobile smart glasses devices 2

Motivation • Existing sparsification methods • Focus on model size vs accuracy – very little on inference FLOPs • Global, uniform or heuristic sparsity budget across layers Layer 1 Layer 2 Layer 3 Total # Params 20 100 1000 1120 250K FLOPs 100K 100K 50K Sparsity – Method 1 # Params 20 100 100 220 100K 100K 5K 205K FLOPs Sparsity – Method 2 # Params 10 10 200 220 50K 10K 10K 70K FLOPs 3

Motivation • Non-uniform sparsity budget – Layer-wise • Very hard to search in deep networks • Sweet spot – Accuracy vs FLOPs vs Sparsity • Existing techniques • Heuristics – increase FLOPs • Use RL – expensive to train “Can we design a robust efficient method to learn non- uniform sparsity budget across layers?” 4

Overview • STR – S oft T hreshold R eparameterization 𝑇𝑈𝑆 𝐗 𝑚 , 𝛽 𝑚 = sign 𝐗 𝑚 ∙ ReLU( 𝐗 𝑚 − 𝛽 𝑚 ) • Learns layer-wise non-uniform sparsity budgets • Same model size; Better accuracy; Lower inference FLOPs • SOTA on ResNet50 & MobileNetV1 for ImageNet-1K • Boosts accuracy by up to 10% in ultra-sparse (98-99%) regime • Extensions to structured, global & per-weight (mask-learning) sparsity 5

Existing Methods Sparsity SOTA; Hard to train; Dense training cost Lower training cost Dense-to-sparse Sparse-to-sparse Hybrid training training • DNW & DPF Non-uniform Non-uniform Uniform sparsity sparsity sparsity • • • Gradual Magnitude Heuristics – ERK DSR, SNFS, RigL etc., • • Global Pruning/Sparsity Heuristics – ERK Pruning (GMP) • STR - some gains from • Re-allocation using magnitude/gradient sparse-to-sparse 6

STR - Method 𝛽 = 2 𝑦 − 𝛽; 𝑦 > 𝛽 𝐼𝑈 𝑦, 𝛽 = ቊ 𝑦; 𝑦 > 𝛽 0; 𝑦 ≤ 𝛽 𝑇𝑈 𝑦, 𝛽 = ቐ 0; 𝑦 ≤ 𝛽 𝑦 + 𝛽; 𝑦 < −𝛽 7

STR - Method 𝑇𝑈 𝑦, 𝛽 = sign 𝑦 ∙ ReLU( 𝑦 − 𝛽) = sign 𝑦 ∙ ReLU( 𝑦 − 𝑕(𝑡)) 𝑀 𝑀 L- layer DNN, 𝒳 = 𝐗 𝑚 𝑚=1 , 𝐭 = 𝑡 𝑚 𝑚=1 and a function 𝑕(. ) 𝒯 𝑕 𝐗 𝑚 , 𝑡 𝑚 = sign 𝐗 𝑚 ∙ ReLU( 𝐗 𝑚 − 𝑕(𝑡 𝑚 )) 𝒳 ← 𝒯 𝑕 (𝒳 , s) Type equation here. 8

STR - Training 𝑀 2 + 𝑡 𝑚 2 2 min 𝒳,𝐭 ℒ 𝒯 𝑕 𝒳, 𝐭 , 𝒠 + 𝜇 ෍ 𝐗 𝑚 2 𝑚=1 • Regular training with reparameterized weights 𝒯 𝑕 𝒳, 𝐭 • Same weight-decay parameter ( 𝜇 ) for both 𝒳, 𝐭 • Controls the overall sparsity • Initialize 𝑡 ; 𝑕 𝑡 ≈ 0 • Finer sparsity and dense training control • Choice of 𝑕 . • Unstructured sparsity : Sigmoid Type equation here. • Structured sparsity : Exponential 9

STR - Training • STR learns the SOTA hand-crafted heuristic of GMP Overall sparsity vs Epochs – 90% sparse ResNet50 on ImageNet-1K • STR learns diverse non-uniform layer-wise sparsities Type equation here. Layer-wise sparsity – 90% sparse ResNet50 on ImageNet-1K 10

STR - Experiments • Unstructured sparsity - CNNs • Dataset : ImageNet-1K • Models : ResNet50 & MobileNetV1 • Sparsity range : 80 - 99% • Ultra-sparse regime: 98 - 99% • Structured sparsity – Low rank in RNNs • Datasets: Google-12 (keyword spotting) , HAR-2 (activity recognition) • Model : FastGRNN • Additional • Transfer of learnt budgets to other sparsification techniques • STR for global, per-weight sparsity & filter/kernel pruning 11

Unstructured vs Structured Sparsity • Unstructured sparsity • Typically magnitude based pruning with global or layer-wise thresholds • Structured sparsity • Low-rank & neuron/filter/kernel pruning 12

STR Unstructured Sparsity: ResNet50 • STR requires 20% lesser FLOPs with same accuracy for 80-95% sparsity • STR achieves 10% higher accuracy than baselines in 98-99% regime 13

STR Unstructured Sparsity: MobileNetV1 • STR maintains accuracy for 75% sparsity with 62M lesser FLOPs • STR has ∼ 50% lesser FLOPs for 90% sparsity with same accuracy 14

STR Sparsity Budget: ResNet50 Layer-wise sparsity and FLOPs budgets for 90% sparse ResNet50 on ImageNet-1K • STR learns sparser initial layers than the non-uniform sparsity baselines • STR makes last layers denser than all baselines • STR produces sparser backbones for transfer learning • STR adjusts the FLOPs across layers such that it has lower total inference cost than the baselines 15

STR Sparsity Budget: MobileNetV1 Layer-wise sparsity and FLOPs budgets for 90% sparse MobileNetV1 on ImageNet-1K • STR automatically keeps depth-wise separable conv layers denser than rest of the layers • STR’s budget results in 50% lesser FLOPs than GMP 16

STRConv 17

STR Structured Sparsity: Low rank 𝐗 𝐗 𝟐 ∑ 𝐗 𝟑 Typical low-rank Train with STR on ∑ parameterization ෩ ෩ 𝐗 𝟐 ∑ 𝐗 𝟑 𝐗 𝟐 𝐗 𝟑 18

STR – Critical Design Choices • Weight-decay 𝜇 • Controls overall sparsity • Larger 𝜇 → higher sparsity at the cost of some instability • Initialization of 𝑡 𝑚 • Controls finer sparsity exploration • Controls duration of dense training • Careful choice of 𝑕(. ) • Drives the training dynamics • Better functions which consistently revive dead weights 19

STR - Conclusions • STR enables stable end-to-end training (with no additional cost) to obtain sparse & accurate DNNs • STR efficiently learns per-layer sparsity budgets • Reduces FLOPs by up to 50% for 80-95% sparsity • Up to 10% more accurate than baselines for 98-99% sparsity • Transferable to other sparsification techniques • Future work • Formulation to explicitly minimize FLOPs • Stronger guarantees in standard sparse regression setting • Code, pretrained models and sparsity budgets available at https://github.com/RAIVNLab/STR 20

Vivek* Raghav* Aditya Mitchell* Thank You Prateek Sham Ali 21

Soft Threshold Weight Reparameterization for Learnable Sparsity - PowerPoint PPT Presentation

Soft Threshold Weight Reparameterization for Learnable Sparsity Aditya Kusupati Vivek Ramanujan * , Raghav Somani * , Mitchell Wortsman * Prateek Jain, Sham Kakade and Ali Farhadi 1 Motivation Deep Neural Networks Highly accurate

cProbLog: Restricting the Possible Worlds of Probabilistic Logic Programs Dimitar Shterionov

Implicit Reparameterization Gradients Michael Figurnov, Shakir Mohamed, Andriy Mnih Poster: Room

Learning Automatic Schedulers through Projective Reparameterization Ajay Jain Saman Amarasinghe

Reparameterization Gradient for Non-differentiable Models Wonyeol Lee Hangyeol Yu Hongseok

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Watershed Below TMDL Threshold At TMDL Threshold Above TMDL Threshold Water Quality Overview

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

Gemstones a Unit of Weight Gemstones a Unit of Weight The historical unit of weight

INTRODUCING Connecting Weight Loss Patients Directly to your Weight Loss Center Physicians Weight

Formulation and development of foods for weight management Paola Vitaglione Weight control and

/k Content 2/15 1. Introduction 2. Hamming weight 3. Rank weight 4. Extended rank weight

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Enter the Threshold The NIST Threshold Cryptography Project National Institute of Standards and

Threshold resummation far from threshold GGI, Firenze, September 7 th , 2011 Giovanni Ridolfi

Polynomial threshold functions and Boolean threshold circuits Kristoffer Arnsfelt Hansen 1

Energy Storage for Rural Affordable Housing: The McKnight Lane Redevelopment Project September

Auto ma te d DR Sc re e ning a nd So la r Ma p b y: Ob je c tive a nd Go a ls T ra nspa re

Modus Ponens Claudia Chirita School of Informatics, University of Edinburgh Based on slides by:

in the binomial random graph Tel Aviv University Wojciech Samotij joint work with Matan Harel Frank

Quality Thinking in other Industries Dominic Parry Inspired Pharma Training WEB

Why you shouldn't write cryptographic algorithms yourself Experience why writing your own crypto

Almost Gorenstein rings of higher dimension . Naoki Taniguchi Meiji University Joint work with

GRAVITATIONAL LENSING LESSON 1 - DEFLECTION OF LIGHT Massimo Meneghetti AA 2016-2017 TEACHER