Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
2 How many bits do you need to represent a single number in machine learning systems? Takeaways 8 Training Neural Networks 4 bits is enough for co communica cation 3bit Training Linear Models 4 4 bits is enough en end-to to-en end 32bit floating point Beyond Empirical Rigorous theoretical guarantees 0 1 10 100
3 First Example: GPUs What happens in practice? Regular model • GPUs have plenty of compute • Yet, bandwidth relatively limited Compute Exchange Update • PCIe or (newer) NVLINK gradient gradient params Trend towards large models and datasets • Vision : ImageNet ( 1.8M images) • ResNet-152 model [He+15]: 60M parameters (~240 MB) • Speech : NIST2000 2000 hours • LACEA [Yu+16]: 65M parameters (~300 MB) Minibatch 2 Minibatch 1 Gradient transmission is expensive .
4 First Example: GPUs What happens in practice? Bigger model • GPUs have plenty of compute • Yet, bandwidth relatively limited Compute Exchange Update • PCIe or (newer) NVLINK gradient gradient params General trend towards large models • Vision : ImageNet ( 1.8M images) • ResNet-152 model [He+15]: 60M parameters (~240 MB) • Speech : NIST2000 2000 hours • LACEA [Yu+16]: 65M parameters (~300 MB) Minibatch 1 Minibatch Gradient transmission is expensive .
5 First Example: GPUs What happens in practice? Biggerer model • GPUs have plenty of compute • Yet, bandwidth relatively limited • PCIe or (newer) NVLINK Compute Exchange gradient gradient General trend towards large models • Vision : ImageNet ( 1.8M images) • ResNet-152 model [He+15]: 60M parameters (~240 MB) • Speech : NIST2000 2000 hours • LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive .
6 First Example: GPUs Compression [Seide et al., Microsoft CNTK] • GPUs have plenty of compute • Yet, bandwidth relatively limited • PCIe or (newer) NVLINK General trend towards large models • Vision : ImageNet ( 1.8M images) • ResNet-152 model [He+15]: 60M parameters (~240 MB) • Speech : NIST2000 2000 hours • LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive . Minibatch 1
The Key Question Can lossy compression provide speedup , while preserving convergence ? Yes. Quantized SGD (QSGD) can converge as fast as SGD, with considerably less bandwidth. > 2x faster Top-1 accuracy for AlexNet (ImageNet). Top-1 accuracy vs Time for AlexNet (ImageNet).
8 Why does QSGD work?
Notation in One Slide argmin 𝒚 𝑔 𝒚 Solved via optimization procedure. 7 𝑔 𝑦 =(1/𝑁)0 𝑚𝑝𝑡𝑡(𝑦, 𝑓𝑗) Notion of “quality” 89: Model 𝑦 E.g., image classification Task Data (M examples)
Background on Stochastic Gradient Descent ▪ St Stochastic Gradient Descent: Goal : find argmin 𝒚 𝑔 𝒚 . Go Let 𝒉 A(𝒚) = 𝒚 ‘s gradient at a ra point . randomly chosen da data po Iteration: A(𝒚𝒖) , where 𝜡[ 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 𝒉 K(𝑦𝑢) ] = 𝛼𝑔 𝑦 𝑢 . 𝟑 ≤ 𝝉 𝟑 (variance bound) Let 𝜡 𝒉 A 𝒚 − 𝜶𝒈 𝒚 Theorem [ Informal ] : Given 𝑔 nice (e.g., convex and smooth) , and 𝑆 2 = ||𝑦 0 − 𝑦 ∗ || 2 . To converge within 𝜻 of optimal it is sufficient to run for 𝑼 = 𝓟( 𝑺 𝟑 𝟑 𝝉 𝟑 𝜻 𝟑 ) iterations. Higher variance = more iterations to convergence.
11 Data Flow: Data-Parallel Training (e.g. GPUs) Standard SGD step: A 𝒚 𝒖 at step t. 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 𝜶𝒉 12000 10000 8000 6000 4000 𝑹( ) ] 𝒚 𝒖 ] 𝒚 𝒖 𝜶𝒉 𝟐 𝜶𝒉 𝟑 𝑹( ) 2000 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Data Data GPU 1 GPU 2 ] 𝒚 𝒖 + 𝜶𝒉 𝟑 ] 𝒚 𝒖 ) Model 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 (𝜶𝒉 𝟐 Model 𝒚 𝒖 Quantized SGD step: A 𝒚 𝒖 ) at step t. 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 𝑹(𝜶𝒉
How Do We Quantize? 1 Pr[ 1 ] = 𝒘 𝒋 = 0.7 ▪ or 𝒘 of on 𝒐 , no Gradient = vec vector of dimen ension normalized 𝑤 𝑗 = 0.7 ▪ Quantization function Pr[ 0 ] = 1 - 𝒘 𝒋 = 0.3 𝑅[𝑤 𝑗 ] = 𝜊 8 𝑤 𝑗 ⋅ sgn 𝑤 8 0 where 𝜊 8 𝑤 𝑗 = 𝟐 with probability 𝒘 𝒋 , and 0 , otherwise. ▪ unbiased estimator: 𝑭 𝑹 𝒘 = 𝒘. Quantization is an unb ▪ Why do this? float float float float float v1 v2 v3 v4 ⋯ vn n bits and signs float Compression rate > 15x scaling +1 0 0 -1 -1 0 1 1 1 0 0 0 -1
� � 13 Gradient Compression 1 • We apply stochastic rounding to gradients • The SGD iteration becomes: 0.857 A 𝒚 𝒖 ) at step t. 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 𝑹(𝒉 0.714 0.571 Theorem [QSGD: Alistarh, Grubic, Li, Tomioka, Vojnovic, 2016] 0.429 Given dimension n , QSGD guarantees the following: 0.286 1. Convergence : If SGD converges, then QSGD converges . 0.143 2. Convergence speed: 0 If SGD converges in T iterations, QSGD converges in ≤ 𝒐 T iterations. 3. Bandwidth cost : Each gradient can be coded using ≤ 𝟑 𝒐 log 𝒐 bits . The Gamble : The benefit of reduced communication will outweigh the Generalizes to arbitrarily many performance hit because of extra iterations/variance and coding/decoding . quantization levels.
14 ually work? Does it act actual
Experimental Setup Whe Where ? ▪ Am Amazon p16xLarge ( 16 16 x NVIDIA K80 80 GPUs ) ▪ Microsoft CN CNTK v2.0 , with MP MPI-ba based d communic icatio ion (no NVIDIA NCCL) Wha What? ▪ ImageNet) and sp Ta Tasks: image cl classifica cation (Im speech recognition (C (CMU AN4) ▪ Ne Nets : Re ResNet , VG VGG , In Ince ception, Al AlexNet, , respectively LS LSTM ▪ Wi With h defaul ult pa parameters Why Why? ▪ Ac Accuracy vs. Speed/Scalability Op Open-so source implementation, as s well as s do docker co containers.
Experiments: Communication Cost ▪ AlexNet x Im Al ImageNet-1K 1K x 2 2 GP GPUs 60% Compute 95% Compute 40% Communicate 5% Communicate SGD vs QSGD on AlexNet.
Experiments: “Strong” Scaling 2.3x 3.5x 1.6x 1.3x
Experiments: A Closer Look at Accuracy 4bit: - 0.2% 8bit: + 0.3% 2.5x ResNet50 on ImageNet 3-Layer LSTM on CMU AN4 (Speech) Across all networks we tried, 4 bits are sufficient for full accuracy. (QSGD arxiv tech report contains full numbers and comparisons.)
19 How many bits do you need to represent a single number in machine learning systems? Takeaways 8 Training Neural Networks 4 bits is enough for co communica cation 3bit Training Linear Models 4 4 bits is enough en end-to to-en end 32bit floating point 0 1 10 100
20 Data Flow in Machine Learning Systems Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM Database 1 2 CPU Cache Computation Device GPU, CPU FPGA
21 ZipML 1 Naive solution: nearest rounding (=1) => Converge to a different solution […. 0.7 ….] Stochastic rounding: 0 with prob 0.3, 1 with prob 0.7 0 Gradient: dot(A r, x)A r Expectation matches => OK! 3 (Over-simplified, need to be careful about variance!) Data Source Storage Device Data A r Model x Sensor DRAM Database 1 2 CPU Cache Computation Device FPGA GPU, CPU NIPS’15
22 ZipML Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM Database 1 2 CPU Cache […. 0.7 ….] Computation Loss: ( ax - b) 2 Device 0 (p=0.3) Gradient: 2 a ( ax -b) FPGA 1 (p=0.7) GPU, CPU Expectation matches => OK!
23 ZipML Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM Database 1 2 CPU Cache […. 0.7 ….] Computation Expectation matches Device 0 (p=0.3) => OK? FPGA NO!! 1 (p=0.7) GPU, CPU Why? Gradient 2 a ( ax -b) is not linear in a .
24 ZipML: “Double Sampling” How many more bits do we need to store the second sample? How to generate samples for a to get Not 2x Overhead! an unbiased estimator for 2 a ( ax -b)? 2 a 1 ( a 2 x -b) TWO 3bits to store the first sample Independent 2nd sample only have 3 choices: First Second - up, down, same Samples! Sample Sample => 2bits to store We can do even better—Samples are symmetric! 15 different possibilities => 4bits to store 2 samples 1bit Overhead arXiv’16
25 It works!
26 Experiments 32bit Floating Points Tomographic Reconstruction 12bit Fixed Points Linear regression with fancy regularization (but a 240GB model)
28 It works, but is what we are doing optimal?
29 Not Really Data-optimal Quantization Strategy a b Probability of quantizing to A: P A = b / (a+b) Probability of quantizing to B: P B = a / (a+b) Expectation of quantization error for A, B (variance) = a P A + b P B = 2ab / (a+b) Intuitively, shouldn’t we put more markers here?
Recommend
More recommend