Faster Machine Learning via Low-Precision Communication & - PowerPoint PPT Presentation

Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

2 How many bits do you need to represent a single number in machine learning systems? Takeaways 8 Training Neural Networks 4 bits is enough for co communica cation 3bit Training Linear Models 4 4 bits is enough en end-to to-en end 32bit floating point Beyond Empirical Rigorous theoretical guarantees 0 1 10 100

3 First Example: GPUs What happens in practice? Regular model • GPUs have plenty of compute • Yet, bandwidth relatively limited Compute Exchange Update • PCIe or (newer) NVLINK gradient gradient params Trend towards large models and datasets • Vision : ImageNet ( 1.8M images) • ResNet-152 model [He+15]: 60M parameters (~240 MB) • Speech : NIST2000 2000 hours • LACEA [Yu+16]: 65M parameters (~300 MB) Minibatch 2 Minibatch 1 Gradient transmission is expensive .

4 First Example: GPUs What happens in practice? Bigger model • GPUs have plenty of compute • Yet, bandwidth relatively limited Compute Exchange Update • PCIe or (newer) NVLINK gradient gradient params General trend towards large models • Vision : ImageNet ( 1.8M images) • ResNet-152 model [He+15]: 60M parameters (~240 MB) • Speech : NIST2000 2000 hours • LACEA [Yu+16]: 65M parameters (~300 MB) Minibatch 1 Minibatch Gradient transmission is expensive .

5 First Example: GPUs What happens in practice? Biggerer model • GPUs have plenty of compute • Yet, bandwidth relatively limited • PCIe or (newer) NVLINK Compute Exchange gradient gradient General trend towards large models • Vision : ImageNet ( 1.8M images) • ResNet-152 model [He+15]: 60M parameters (~240 MB) • Speech : NIST2000 2000 hours • LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive .

6 First Example: GPUs Compression [Seide et al., Microsoft CNTK] • GPUs have plenty of compute • Yet, bandwidth relatively limited • PCIe or (newer) NVLINK General trend towards large models • Vision : ImageNet ( 1.8M images) • ResNet-152 model [He+15]: 60M parameters (~240 MB) • Speech : NIST2000 2000 hours • LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive . Minibatch 1

The Key Question Can lossy compression provide speedup , while preserving convergence ? Yes. Quantized SGD (QSGD) can converge as fast as SGD, with considerably less bandwidth. > 2x faster Top-1 accuracy for AlexNet (ImageNet). Top-1 accuracy vs Time for AlexNet (ImageNet).

8 Why does QSGD work?

Notation in One Slide argmin 𝒚 𝑔 𝒚 Solved via optimization procedure. 7 𝑔 𝑦 =(1/𝑁)0 𝑚𝑝𝑡𝑡(𝑦, 𝑓𝑗) Notion of “quality” 89: Model 𝑦 E.g., image classification Task Data (M examples)

Background on Stochastic Gradient Descent ▪ St Stochastic Gradient Descent: Goal : find argmin 𝒚 𝑔 𝒚 . Go Let 𝒉 A(𝒚) = 𝒚 ‘s gradient at a ra point . randomly chosen da data po Iteration: A(𝒚𝒖) , where 𝜡[𝑕 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 𝒉 K(𝑦𝑢) ] = 𝛼𝑔 𝑦 𝑢 . 𝟑 ≤ 𝝉 𝟑 (variance bound) Let 𝜡 𝒉 A 𝒚 − 𝜶𝒈 𝒚 Theorem [ Informal ] : Given 𝑔 nice (e.g., convex and smooth) , and 𝑆 2 = ||𝑦 0 − 𝑦 ∗ || 2 . To converge within 𝜻 of optimal it is sufficient to run for 𝑼 = 𝓟( 𝑺 𝟑 𝟑 𝝉 𝟑 𝜻 𝟑 ) iterations. Higher variance = more iterations to convergence.

11 Data Flow: Data-Parallel Training (e.g. GPUs) Standard SGD step: A 𝒚 𝒖 at step t. 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 𝜶𝒉 12000 10000 8000 6000 4000 𝑹( ) ] 𝒚 𝒖 ] 𝒚 𝒖 𝜶𝒉 𝟐 𝜶𝒉 𝟑 𝑹( ) 2000 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Data Data GPU 1 GPU 2 ] 𝒚 𝒖 + 𝜶𝒉 𝟑 ] 𝒚 𝒖 ) Model 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 (𝜶𝒉 𝟐 Model 𝒚 𝒖 Quantized SGD step: A 𝒚 𝒖 ) at step t. 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 𝑹(𝜶𝒉

How Do We Quantize? 1 Pr[ 1 ] = 𝒘 𝒋 = 0.7 ▪ or 𝒘 of on 𝒐 , no Gradient = vec vector of dimen ension normalized 𝑤 𝑗 = 0.7 ▪ Quantization function Pr[ 0 ] = 1 - 𝒘 𝒋 = 0.3 𝑅[𝑤 𝑗 ] = 𝜊 8 𝑤 𝑗 ⋅ sgn 𝑤 8 0 where 𝜊 8 𝑤 𝑗 = 𝟐 with probability 𝒘 𝒋 , and 0 , otherwise. ▪ unbiased estimator: 𝑭 𝑹 𝒘 = 𝒘. Quantization is an unb ▪ Why do this? float float float float float v1 v2 v3 v4 ⋯ vn n bits and signs float Compression rate > 15x scaling +1 0 0 -1 -1 0 1 1 1 0 0 0 -1

� � 13 Gradient Compression 1 • We apply stochastic rounding to gradients • The SGD iteration becomes: 0.857 A 𝒚 𝒖 ) at step t. 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 𝑹(𝒉 0.714 0.571 Theorem [QSGD: Alistarh, Grubic, Li, Tomioka, Vojnovic, 2016] 0.429 Given dimension n , QSGD guarantees the following: 0.286 1. Convergence : If SGD converges, then QSGD converges . 0.143 2. Convergence speed: 0 If SGD converges in T iterations, QSGD converges in ≤ 𝒐 T iterations. 3. Bandwidth cost : Each gradient can be coded using ≤ 𝟑 𝒐 log 𝒐 bits . The Gamble : The benefit of reduced communication will outweigh the Generalizes to arbitrarily many performance hit because of extra iterations/variance and coding/decoding . quantization levels.

14 ually work? Does it act actual

Experimental Setup Whe Where ? ▪ Am Amazon p16xLarge ( 16 16 x NVIDIA K80 80 GPUs ) ▪ Microsoft CN CNTK v2.0 , with MP MPI-ba based d communic icatio ion (no NVIDIA NCCL) Wha What? ▪ ImageNet) and sp Ta Tasks: image cl classifica cation (Im speech recognition (C (CMU AN4) ▪ Ne Nets : Re ResNet , VG VGG , In Ince ception, Al AlexNet, , respectively LS LSTM ▪ Wi With h defaul ult pa parameters Why Why? ▪ Ac Accuracy vs. Speed/Scalability Op Open-so source implementation, as s well as s do docker co containers.

Experiments: Communication Cost ▪ AlexNet x Im Al ImageNet-1K 1K x 2 2 GP GPUs 60% Compute 95% Compute 40% Communicate 5% Communicate SGD vs QSGD on AlexNet.

Experiments: “Strong” Scaling 2.3x 3.5x 1.6x 1.3x

Experiments: A Closer Look at Accuracy 4bit: - 0.2% 8bit: + 0.3% 2.5x ResNet50 on ImageNet 3-Layer LSTM on CMU AN4 (Speech) Across all networks we tried, 4 bits are sufficient for full accuracy. (QSGD arxiv tech report contains full numbers and comparisons.)

19 How many bits do you need to represent a single number in machine learning systems? Takeaways 8 Training Neural Networks 4 bits is enough for co communica cation 3bit Training Linear Models 4 4 bits is enough en end-to to-en end 32bit floating point 0 1 10 100

20 Data Flow in Machine Learning Systems Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM Database 1 2 CPU Cache Computation Device GPU, CPU FPGA

21 ZipML 1 Naive solution: nearest rounding (=1) => Converge to a different solution […. 0.7 ….] Stochastic rounding: 0 with prob 0.3, 1 with prob 0.7 0 Gradient: dot(A r, x)A r Expectation matches => OK! 3 (Over-simplified, need to be careful about variance!) Data Source Storage Device Data A r Model x Sensor DRAM Database 1 2 CPU Cache Computation Device FPGA GPU, CPU NIPS’15

22 ZipML Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM Database 1 2 CPU Cache […. 0.7 ….] Computation Loss: ( ax - b) 2 Device 0 (p=0.3) Gradient: 2 a ( ax -b) FPGA 1 (p=0.7) GPU, CPU Expectation matches => OK!

23 ZipML Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM Database 1 2 CPU Cache […. 0.7 ….] Computation Expectation matches Device 0 (p=0.3) => OK? FPGA NO!! 1 (p=0.7) GPU, CPU Why? Gradient 2 a ( ax -b) is not linear in a .

24 ZipML: “Double Sampling” How many more bits do we need to store the second sample? How to generate samples for a to get Not 2x Overhead! an unbiased estimator for 2 a ( ax -b)? 2 a 1 ( a 2 x -b) TWO 3bits to store the first sample Independent 2nd sample only have 3 choices: First Second - up, down, same Samples! Sample Sample => 2bits to store We can do even better—Samples are symmetric! 15 different possibilities => 4bits to store 2 samples 1bit Overhead arXiv’16

25 It works!

26 Experiments 32bit Floating Points Tomographic Reconstruction 12bit Fixed Points Linear regression with fancy regularization (but a 240GB model)

28 It works, but is what we are doing optimal?

29 Not Really Data-optimal Quantization Strategy a b Probability of quantizing to A: P A = b / (a+b) Probability of quantizing to B: P B = a / (a+b) Expectation of quantization error for A, B (variance) = a P A + b P B = 2ab / (a+b) Intuitively, shouldn’t we put more markers here?

Faster Machine Learning via Low-Precision Communication & - PowerPoint PPT Presentation

Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine learning systems? Takeaways

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Mixed Precision Training PAI Overview What is mixed-precision

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Faster arbitrary-precision dot product and matrix multiplication Fredrik Johansson Inria

MIXED PRECISION TRAINING Michael OConnor MIXED PRECISION What is the benefit? Using mixed

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Monte Carlo Estimation for Imprecise Probabilities Basic Properties Arne Decadt Gert de Cooman

in the forestry sector Julia Garritt, FCS Policy adviser for water, soils and species

"A Request to Revoke the City of Kenai's Zoning Power" Presented By: Charles Winegarden

T r a nsition Issue s Pre se ntatio n to F ORA BOARD No ve mbe r 17, 2017 S te ve E ndsle

The parametric g-formula in SAS JESSICA G. YOUNG CIMPOD 2017

Me a sure s o f Va ria b ility L E CT URE 4 Ob je c tive s De fine te rms. Dia g ra

Hajo Zeeb Leibniz Institute for Prevention Research and Epidemiology BIPS, Bremen, Germany

The California 2012 ELD Standards: Building Capacity and Internal Accountability for ELD

Faster Machine Learning via Low-Precision Communication & - PowerPoint PPT Presentation

Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine learning systems? Takeaways

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Mixed Precision Training PAI Overview What is mixed-precision

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Faster arbitrary-precision dot product and matrix multiplication Fredrik Johansson Inria

MIXED PRECISION TRAINING Michael OConnor MIXED PRECISION What is the benefit? Using mixed

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Monte Carlo Estimation for Imprecise Probabilities Basic Properties Arne Decadt Gert de Cooman

in the forestry sector Julia Garritt, FCS Policy adviser for water, soils and species

&quot;A Request to Revoke the City of Kenai's Zoning Power&quot; Presented By: Charles Winegarden

T r a nsition Issue s Pre se ntatio n to F ORA BOARD No ve mbe r 17, 2017 S te ve E ndsle

The parametric g-formula in SAS JESSICA G. YOUNG CIMPOD 2017

Me a sure s o f Va ria b ility L E CT URE 4 Ob je c tive s De fine te rms. Dia g ra

Hajo Zeeb Leibniz Institute for Prevention Research and Epidemiology BIPS, Bremen, Germany

The California 2012 ELD Standards: Building Capacity and Internal Accountability for ELD

"A Request to Revoke the City of Kenai's Zoning Power" Presented By: Charles Winegarden