SLID IDE : In In Defen ense e of Smart Algorithms over er Ha Hardware e Accel eler eration for Large-Scale e Deep eep Lea earning System ems Beidi Chen Collaborators: Tharun Medini * , James Farwell † , Sameh Gobriel † , Charlie Tai † , Anshumali Shrivastava * * Rice University, † Intel MLSys 2020
Ou Our SL SLIDE Sy System ( C++ from scratch) on a 44 44 co core CPU be beats TF on n V100 (1 (1 ho hour urs vs 3. 3.5 ho hour urs). ). 100+ million parameter networks. TF on same CPU is 16 hours with all HPC optimization (Intel MKL-DNN). 3.5x faster on CPU than TF on V100 (Log Scale in Time) 2
Th The Age ge of Large ge Networks • More Data • Large Models • Tons of Engineering • Backpropagation (Aka Simple Gradient Descent) 3
Fully Fully Connec nnected ed NN Giant Matrix Multiplication for every data point in each epoch (Forward + Backward) 𝑔(𝑋 $ 𝑦) 1 1 … 1 1 2 5 2 2 3 3 3 … 4 … … 9 … Hidden 1 Hidden 2 … Input Output 4
Ch Challenges Do we really need all the computations? No!! Good News: Only high activations are important • Sampling few neurons in proportion of activations is enough ( Adaptive Dropouts ) (Ba et al. Neurips 13 , Makhzani et al. Neurips 15) • Relu filtered negative activations (50% sparsity by design) • Softmax Bad News: We need to compute all to identify (or sample) the high activation neurons. NO SAVINGS 5
Th The Fundamental Sampling Puzzle Given N fixed sampling weights, 𝑥 ( , 𝑥 * , … , 𝑥 , . • Task: Sample 𝑦 - with probability 𝑥 - • Cost of 1 sample 𝑃(𝑂) . • Cost of K samples 𝑃(𝑂) . 0 . 0 , 𝑥 * 0 , … , 𝑥 , Given N time-varying sampling weights (activations) 𝑥 ( 0 • Task: At time t, sample 𝑦 - with probability 𝑥 - • Cost of sampling O(N), at every time t. 0 = 𝑔(𝑡𝑗𝑛(𝜄 0 , 𝑦 - )) , for a • Last Few years of work in Locality Sensitive Hashing: If 𝑥 - specific set of f and sim, then 𝑃(1) every time after and initial preprocessing cost of 𝑃(𝑂) . 6
Te Textbook Hashing (Dictionary) Hashing: Function h that maps a given data point ( 𝑦 ∈ 𝑆 9 ) to an integer key ℎ ∶ 𝑆 9 ↦ 0, 1, 2, … , 𝑂 . ℎ(𝑦) serves as a discrete fingerprint. Property (Ideal Hash Functions): • If x = 𝑧 , then ℎ 𝑦 = ℎ(𝑧) • If x ≠ 𝑧 , then ℎ 𝑦 ≠ ℎ(𝑧) 7
Pr Probabilistic Fingerprinting (Hashing) (late ate 90s) Hashing: Function (Randomized) h that maps a given data point ( 𝑦 ∈ 𝑆 9 ) to an integer key ℎ ∶ 𝑆 9 ↦ 0, 1, 2, … , 𝑂 . ℎ(𝑦) serves as a discrete fingerprint. Locality Sensitive Property: • If x = 𝑧 S𝑗𝑛(𝑦, 𝑧) is high, then ℎ 𝑦 = ℎ 𝑧 Pr(ℎ 𝑦 = ℎ 𝑧 ) is high • If x ≠ 𝑧 S𝑗𝑛(𝑦, 𝑧) is low, then ℎ 𝑦 ≠ ℎ(𝑧) Pr(ℎ 𝑦 = ℎ 𝑧 ) is low Unlikely Likely 8
Ex Exampl ple 1: Si : Signed R Random P m Project ction ( (SR SRP) X X H 2 + H 1 - + Y Y Pr ( h ( x ) = h ( y )) = 1 − 1 monotonic in 𝜄 π cos − 1 ( θ ) A classical result from Goemans-Williamson (95) 9
Ex Exampl ple 2: ( : (De Densif ifie ied) ) Wi Winne nner Ta Take Al All Original Vectors: K=3 WTA hash codes: (ICCV 2011) DWTA hash codes: (UAI 2018) Yagnik (ICCV11), Chen and Shrivastava (UAI 18) 10
<latexit sha1_base64="T1X8MzxWNtEUHQEsV/R6e5uIKQ=">ACEnicbZDLSsNAFIYnXmu9V26GSxCAqUkVdCFQsGNywr2AmkIk+mkGTq5MDORhtBncOruHGhiFtX7nwbp20W2vrDwDf/OYeZ83sJo0Ka5re2srq2vrFZ2ipv7+zu7VcODjsiTjkmbRyzmPc8JAijEWlLKhnpJZyg0GOk641upvXuA+GCxtG9zBLihGgYUZ9iJXlVowWd4O+R4d2oI8NeA0DPTOmd0exrwsa6uNaZhg16FaqZt2cCS6DVUAVFGq5la/+IMZpSCKJGRLCtsxEOjnikmJGJuV+KkiC8AgNia0wQiERTj5baQJPlTOAfszViScub8nchQKkYWe6gyRDMRibWr+V7NT6V86OY2SVJIzx/yUwZlDKf5wAHlBEuWKUCYU/VXiAPEZYqxbIKwVpceRk6jbp1Vm/cnVebV0UcJXAMToAOLHABmuAWtEAbYPAInsEreNOetBftXfuYt65oxcwR+CPt8wdf5q3</latexit> Pr Probabilisti tic Has Hash h Tables ables Given: 𝑔 is monotonic. ⇥ ⇤ h ( x ) = h ( y ) = f ( sim ( x, y )) , Pr h • Given query, if ℎ ( 𝑟 = 11 and ℎ * 𝑟 = 01 , then probe bucket with index 1101 . It is a good bucket !! • (Locality Sensitive) ℎ - 𝑟 = ℎ - (𝑦) noisy indicator of high similarity. • Doing better than random !! 11
LSH LSH f for Se or Search ch ( (Known) Theory • Super-linear 𝑃(𝑂 (FG ) memory • Sub-linear query time, O( 𝑂 G ) • 𝜍 < 1 but generally large (close to 1) and often hard to determine Practical Issues • Needs lot of hash tables and distance computations for good accuracy on near-neighbors • Buckets can be quite heavy. Poor randomness, or unfavorable data distributions 12
New Vi View: Data Structures for Efficient Sampling! Is LSH really a search algorithm? • Given the query 𝜄 0 , LSH samples 𝑦 - from the dataset, with probability 0 = 1 − 1 − p x L , 𝜄 0 M N 𝑥 - 0 is proportional to p x L , 𝜄 0 M and the some similarity of x L , 𝜄 0 • 𝑥 - • LSH is considered a black box for nearest-neighbor search. It is not!! 13
LSH LSH a as Sa Samp mplers We can pre-process the dataset D, such that • Given any query q, we can sample 𝑦 ∈ 𝐸 with probability 𝐷𝑝𝑜𝑡𝑢× 1 − 1 − 𝑞 𝑟, 𝑦 V W in KL hash computation and L bucket probes. • Even K = 1, L =1 is adaptive. So O(1) time adaptive. • Adaptive : x is sampled with higher probability than y • if and only if sim(q,x) > sim(q,y) We can exactly compute the sampling probability. • Const = No of elements sampled/ No of elements in Buckets (Chen et al. NeurIPS 2019) Sufficient for Importance Sampling Estimations. Sampling cost O(1). 14
SLI SLIDE : S ub- LI near D eep learning E ngine 1 1 Step 1 – Build the hash tables by 1 |3 1 |1 processing the weights of the H1 2 | 1,4 2 | 2,4 H2 hidden layers (initialization). 1 3 | 2 3 | 3 1 … 1 1 2 5 2 2 3 Subtlety : Neurons (vectors) in 3 3 … 4 hash tables are not the data 4 4 9 5 vectors. Reorganizing neurons. Hidden 1 Hidden 2 … Input Output 15
SLI SLIDE : S ub- LI near D eep learning E ngine Step 2 – Hash the input to any 1 |3 1 |1 given layer using its randomized H1 2 | 1,4 2 | 2,4 H2 hash function. 1 2 3 | 2 3 | 3 1 … 1 1 2 5 2 2 3 3 3 … 4 4 4 9 5 Hidden 1 Hidden 2 … Input Output 16
SLI SLIDE : S ub- LI near D eep learning E ngine Step 3 – Query the hidden layer's 3 1 |3 1 |1 hash table(s) for the active set H1 2 | 1,4 2 | 2,4 H2 using integer fingerprint. 1 3 | 2 3 | 3 Sample neurons in proportion to 1 … 1 1 their activations. 2 5 2 2 3 3 3 … 4 4 4 9 5 Hidden 1 Hidden 2 … Input Output 17
SLI SLIDE : S ub- LI near D eep learning E ngine Step 4 – Perform forward and 1 |3 1 |1 back propagation only on the H1 2 | 1,4 2 | 2,4 H2 nodes in the active set. 1 3 | 2 3 | 3 Computation is in the same order 4 1 … 1 1 of active neurons. 2 5 2 2 3 3 3 … 4 4 4 9 5 Hidden 1 Hidden 2 … Input Output 18
SLI SLIDE : S ub- LI near D eep learning E ngine 5 5 Step 5 – Update hash tables by 1 |3 1 |1 rehashing the updated node H1 2 | 1,4 2 | 2,4 H2 weights. 1 3 | 2 3 | 3 Computation is in the same order 1 … 1 1 of active neurons. 2 5 2 2 3 3 3 … 4 4 4 9 5 Hidden 1 Hidden 2 … Input Output 19
We We can go very sparse if Adaptive • Reduce both training and inference cost by 95%! • Significantly more for larger networks. (The wider the better) • 2 Hidden Layers • 1000 Nodes Per Layer 20
Sp Sparsity + + R Randomn omness à As Asynchronous Updates • 3 Hidden Layers • 1000 Nodes Per Layer 21
Les ess Computations + Asynchronous Parallel elism • Each update is computationally very small (100x+ reduction in computation and energy) • Updates are near-independent, very low chance of conflict. Hence, parallel SGD! 22
SLI SLIDE : S ub- LI near D eep learning E ngine Neuron 1 Network Layer BatchSize 1 … 1 1 Hash Table L Hash Table 1 Active Inputs 2 1 5 1 0 1 …… 2 2 " … " Buckets $ … $ Buckets ℎ " ℎ # ℎ " ℎ # 3 … 1 … Activation for each Inputs … 1 9 … 3 3 00 00 00 00 … … … … 2 …… … 2 … 9 0.1 0.2 0.5 4 00 01 00 01 4 4 9 … … Empty Empty 10 10 00 00 Accumulated Gradients 5 Hidden 1 Hidden 2 … … … … Active Inputs … … … … 5 …… … 0.3 0.8 0.7 … Input … … … … 11 11 11 11 Weights … -0.3 0.8 -0.5 Output Previous Layer Size 23
Pa Parallelism with OpenMP Node Batchsize Parallel across training samples in a batch Active Inputs (Extreme sparsity and randomness in gradient updates) 1 0 1 …… Activation for each Inputs …… 0.1 0.2 0.5 Accumulated Gradients Thanks to the theory of HOGWILD ! …… 0.3 0.8 0.7 (Recht et al. Neurips 11) Weights …… -0.3 0.8 -0.5 24
Fle Flexible ible cho hoic ices es of Has Hash h Func Functio tions ns SLIDE supports four different LSH hash functions • Simhash (cosine similarity) • Winner-take-all Hashing (order) • Densified Winner-take-all Hashing ( for sparse data ) ∗ • Minhash (jaccard similarity) Easily add more! 25
Recommend
More recommend