SLID IDE : In In Defen ense e of Smart Algorithms over er Ha - PowerPoint PPT Presentation

SLID IDE : In In Defen ense e of Smart Algorithms over er Ha Hardware e Accel eler eration for Large-Scale e Deep eep Lea earning System ems Beidi Chen Collaborators: Tharun Medini * , James Farwell † , Sameh Gobriel † , Charlie Tai † , Anshumali Shrivastava * * Rice University, † Intel MLSys 2020

Ou Our SL SLIDE Sy System ( C++ from scratch) on a 44 44 co core CPU be beats TF on n V100 (1 (1 ho hour urs vs 3. 3.5 ho hour urs). ). 100+ million parameter networks. TF on same CPU is 16 hours with all HPC optimization (Intel MKL-DNN). 3.5x faster on CPU than TF on V100 (Log Scale in Time) 2

Th The Age ge of Large ge Networks • More Data • Large Models • Tons of Engineering • Backpropagation (Aka Simple Gradient Descent) 3

Fully Fully Connec nnected ed NN Giant Matrix Multiplication for every data point in each epoch (Forward + Backward) 𝑔(𝑋 $ 𝑦) 1 1 … 1 1 2 5 2 2 3 3 3 … 4 … … 9 … Hidden 1 Hidden 2 … Input Output 4

Ch Challenges Do we really need all the computations? No!! Good News: Only high activations are important • Sampling few neurons in proportion of activations is enough ( Adaptive Dropouts ) (Ba et al. Neurips 13 , Makhzani et al. Neurips 15) • Relu filtered negative activations (50% sparsity by design) • Softmax Bad News: We need to compute all to identify (or sample) the high activation neurons. NO SAVINGS 5

Th The Fundamental Sampling Puzzle Given N fixed sampling weights, 𝑥 ( , 𝑥 * , … , 𝑥 , . • Task: Sample 𝑦 - with probability 𝑥 - • Cost of 1 sample 𝑃(𝑂) . • Cost of K samples 𝑃(𝑂) . 0 . 0 , 𝑥 * 0 , … , 𝑥 , Given N time-varying sampling weights (activations) 𝑥 ( 0 • Task: At time t, sample 𝑦 - with probability 𝑥 - • Cost of sampling O(N), at every time t. 0 = 𝑔(𝑡𝑗𝑛(𝜄 0 , 𝑦 - )) , for a • Last Few years of work in Locality Sensitive Hashing: If 𝑥 - specific set of f and sim, then 𝑃(1) every time after and initial preprocessing cost of 𝑃(𝑂) . 6

Te Textbook Hashing (Dictionary) Hashing: Function h that maps a given data point ( 𝑦 ∈ 𝑆 9 ) to an integer key ℎ ∶ 𝑆 9 ↦ 0, 1, 2, … , 𝑂 . ℎ(𝑦) serves as a discrete fingerprint. Property (Ideal Hash Functions): • If x = 𝑧 , then ℎ 𝑦 = ℎ(𝑧) • If x ≠ 𝑧 , then ℎ 𝑦 ≠ ℎ(𝑧) 7

Pr Probabilistic Fingerprinting (Hashing) (late ate 90s) Hashing: Function (Randomized) h that maps a given data point ( 𝑦 ∈ 𝑆 9 ) to an integer key ℎ ∶ 𝑆 9 ↦ 0, 1, 2, … , 𝑂 . ℎ(𝑦) serves as a discrete fingerprint. Locality Sensitive Property: • If x = 𝑧 S𝑗𝑛(𝑦, 𝑧) is high, then ℎ 𝑦 = ℎ 𝑧 Pr(ℎ 𝑦 = ℎ 𝑧 ) is high • If x ≠ 𝑧 S𝑗𝑛(𝑦, 𝑧) is low, then ℎ 𝑦 ≠ ℎ(𝑧) Pr(ℎ 𝑦 = ℎ 𝑧 ) is low Unlikely Likely 8

Ex Exampl ple 1: Si : Signed R Random P m Project ction ( (SR SRP) X X H 2 + H 1 - + Y Y Pr ( h ( x ) = h ( y )) = 1 − 1 monotonic in 𝜄 π cos − 1 ( θ ) A classical result from Goemans-Williamson (95) 9

Ex Exampl ple 2: ( : (De Densif ifie ied) ) Wi Winne nner Ta Take Al All Original Vectors: K=3 WTA hash codes: (ICCV 2011) DWTA hash codes: (UAI 2018) Yagnik (ICCV11), Chen and Shrivastava (UAI 18) 10

<latexit sha1_base64="T1X8MzxWNtEUHQEsV/R6e5uIKQ=">ACEnicbZDLSsNAFIYnXmu9V26GSxCAqUkVdCFQsGNywr2AmkIk+mkGTq5MDORhtBncOruHGhiFtX7nwbp20W2vrDwDf/OYeZ83sJo0Ka5re2srq2vrFZ2ipv7+zu7VcODjsiTjkmbRyzmPc8JAijEWlLKhnpJZyg0GOk641upvXuA+GCxtG9zBLihGgYUZ9iJXlVowWd4O+R4d2oI8NeA0DPTOmd0exrwsa6uNaZhg16FaqZt2cCS6DVUAVFGq5la/+IMZpSCKJGRLCtsxEOjnikmJGJuV+KkiC8AgNia0wQiERTj5baQJPlTOAfszViScub8nchQKkYWe6gyRDMRibWr+V7NT6V86OY2SVJIzx/yUwZlDKf5wAHlBEuWKUCYU/VXiAPEZYqxbIKwVpceRk6jbp1Vm/cnVebV0UcJXAMToAOLHABmuAWtEAbYPAInsEreNOetBftXfuYt65oxcwR+CPt8wdf5q3</latexit> Pr Probabilisti tic Has Hash h Tables ables Given: 𝑔 is monotonic. ⇥ ⇤ h ( x ) = h ( y ) = f ( sim ( x, y )) , Pr h • Given query, if ℎ ( 𝑟 = 11 and ℎ * 𝑟 = 01 , then probe bucket with index 1101 . It is a good bucket !! • (Locality Sensitive) ℎ - 𝑟 = ℎ - (𝑦) noisy indicator of high similarity. • Doing better than random !! 11

LSH LSH f for Se or Search ch ( (Known) Theory • Super-linear 𝑃(𝑂 (FG ) memory • Sub-linear query time, O( 𝑂 G ) • 𝜍 < 1 but generally large (close to 1) and often hard to determine Practical Issues • Needs lot of hash tables and distance computations for good accuracy on near-neighbors • Buckets can be quite heavy. Poor randomness, or unfavorable data distributions 12

New Vi View: Data Structures for Efficient Sampling! Is LSH really a search algorithm? • Given the query 𝜄 0 , LSH samples 𝑦 - from the dataset, with probability 0 = 1 − 1 − p x L , 𝜄 0 M N 𝑥 - 0 is proportional to p x L , 𝜄 0 M and the some similarity of x L , 𝜄 0 • 𝑥 - • LSH is considered a black box for nearest-neighbor search. It is not!! 13

LSH LSH a as Sa Samp mplers We can pre-process the dataset D, such that • Given any query q, we can sample 𝑦 ∈ 𝐸 with probability 𝐷𝑝𝑜𝑡𝑢× 1 − 1 − 𝑞 𝑟, 𝑦 V W in KL hash computation and L bucket probes. • Even K = 1, L =1 is adaptive. So O(1) time adaptive. • Adaptive : x is sampled with higher probability than y • if and only if sim(q,x) > sim(q,y) We can exactly compute the sampling probability. • Const = No of elements sampled/ No of elements in Buckets (Chen et al. NeurIPS 2019) Sufficient for Importance Sampling Estimations. Sampling cost O(1). 14

SLI SLIDE : S ub- LI near D eep learning E ngine 1 1 Step 1 – Build the hash tables by 1 |3 1 |1 processing the weights of the H1 2 | 1,4 2 | 2,4 H2 hidden layers (initialization). 1 3 | 2 3 | 3 1 … 1 1 2 5 2 2 3 Subtlety : Neurons (vectors) in 3 3 … 4 hash tables are not the data 4 4 9 5 vectors. Reorganizing neurons. Hidden 1 Hidden 2 … Input Output 15

SLI SLIDE : S ub- LI near D eep learning E ngine Step 2 – Hash the input to any 1 |3 1 |1 given layer using its randomized H1 2 | 1,4 2 | 2,4 H2 hash function. 1 2 3 | 2 3 | 3 1 … 1 1 2 5 2 2 3 3 3 … 4 4 4 9 5 Hidden 1 Hidden 2 … Input Output 16

SLI SLIDE : S ub- LI near D eep learning E ngine Step 3 – Query the hidden layer's 3 1 |3 1 |1 hash table(s) for the active set H1 2 | 1,4 2 | 2,4 H2 using integer fingerprint. 1 3 | 2 3 | 3 Sample neurons in proportion to 1 … 1 1 their activations. 2 5 2 2 3 3 3 … 4 4 4 9 5 Hidden 1 Hidden 2 … Input Output 17

SLI SLIDE : S ub- LI near D eep learning E ngine Step 4 – Perform forward and 1 |3 1 |1 back propagation only on the H1 2 | 1,4 2 | 2,4 H2 nodes in the active set. 1 3 | 2 3 | 3 Computation is in the same order 4 1 … 1 1 of active neurons. 2 5 2 2 3 3 3 … 4 4 4 9 5 Hidden 1 Hidden 2 … Input Output 18

SLI SLIDE : S ub- LI near D eep learning E ngine 5 5 Step 5 – Update hash tables by 1 |3 1 |1 rehashing the updated node H1 2 | 1,4 2 | 2,4 H2 weights. 1 3 | 2 3 | 3 Computation is in the same order 1 … 1 1 of active neurons. 2 5 2 2 3 3 3 … 4 4 4 9 5 Hidden 1 Hidden 2 … Input Output 19

We We can go very sparse if Adaptive • Reduce both training and inference cost by 95%! • Significantly more for larger networks. (The wider the better) • 2 Hidden Layers • 1000 Nodes Per Layer 20

Sp Sparsity + + R Randomn omness à As Asynchronous Updates • 3 Hidden Layers • 1000 Nodes Per Layer 21

Les ess Computations + Asynchronous Parallel elism • Each update is computationally very small (100x+ reduction in computation and energy) • Updates are near-independent, very low chance of conflict. Hence, parallel SGD! 22

SLI SLIDE : S ub- LI near D eep learning E ngine Neuron 1 Network Layer BatchSize 1 … 1 1 Hash Table L Hash Table 1 Active Inputs 2 1 5 1 0 1 …… 2 2 " … " Buckets $ … $ Buckets ℎ " ℎ # ℎ " ℎ # 3 … 1 … Activation for each Inputs … 1 9 … 3 3 00 00 00 00 … … … … 2 …… … 2 … 9 0.1 0.2 0.5 4 00 01 00 01 4 4 9 … … Empty Empty 10 10 00 00 Accumulated Gradients 5 Hidden 1 Hidden 2 … … … … Active Inputs … … … … 5 …… … 0.3 0.8 0.7 … Input … … … … 11 11 11 11 Weights … -0.3 0.8 -0.5 Output Previous Layer Size 23

Pa Parallelism with OpenMP Node Batchsize Parallel across training samples in a batch Active Inputs (Extreme sparsity and randomness in gradient updates) 1 0 1 …… Activation for each Inputs …… 0.1 0.2 0.5 Accumulated Gradients Thanks to the theory of HOGWILD ! …… 0.3 0.8 0.7 (Recht et al. Neurips 11) Weights …… -0.3 0.8 -0.5 24

Fle Flexible ible cho hoic ices es of Has Hash h Func Functio tions ns SLIDE supports four different LSH hash functions • Simhash (cosine similarity) • Winner-take-all Hashing (order) • Densified Winner-take-all Hashing ( for sparse data ) ∗ • Minhash (jaccard similarity) Easily add more! 25

SLID IDE : In In Defen ense e of Smart Algorithms over er Ha - PowerPoint PPT Presentation

SLID IDE : In In Defen ense e of Smart Algorithms over er Ha Hardware e Accel eler eration for Large-Scale e Deep eep Lea earning System ems Beidi Chen Collaborators: Tharun Medini * , James Farwell , Sameh Gobriel , Charlie

PO PO WE R E D S L IDE S WE R E D S L IDE S PO POWE R E D S L IDE S PO WE R E D S L IDE S

Ca Catch ch M Me If Ca e If Can: A Cl A Clou oud-En -Enabled ed DDoS DDoS De Defen ense

North Lake Mirror Redevelopment E LAKELAND ECONOMIC DEVELOPMENT COUNCIL 2 SLID North Lake

P re P re s s ide ide nts nts C o C o m m m m e e ndatio ndatio n S che n

SMART ENERGY SMART ASSET SMART SMART SMART & CUSTOMER ASSET PURPOSE PEOPLE

Sl Slid ide e Sh Show ow The Pursuit Of Atonement By Maxwell Somtochukwu Nwabunike

Rene newi wing ng You our Licens ense e and nd The he Prof ofessi ession onal l Grow

MASTER IDE INTRODUCING EMERGING TECHNOLOGY DESIGN ETD prof. dr. ir. E. van der Heide GENERAL

What's new in Eclipse IDE and the ecosystem around it Sopot Cela Mickael Istria The plan

Building an IDE on top of a Build System The tale of a Haskell IDE How to write a compiler? +

Open Komodo: An Open Source IDE For Open Languages For Open Languages Own Your IDE Eric

Smart and Adaptive Cyber-Physical Systems Chapters 1,2 Cyber-Physical Systems Smart mobility

Veg egetation M Management & & De Defen ensible e Space Ordinance i e in El D

The e European pean St Stru ructu ctural ral Fund nds s in t in the e De Defen ence

Memor ory D y Defen enses es The Elevation from Obscurity to Headlines Rajeev Balasubramonian

Quality of Life - Smart Mobility - Smart Infrastructure - Smart People, Smart Living ARC 590

Safe and Robust Deep Learning Gagandeep Singh PhD Student Department of Computer Science 1

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong , Hieu Pham,

Simulating neural computation and information processing with Brian Marcel Stimberg Institut de

CS480/680 Lecture 9: June 5, 2019 Perceptrons, Neural Networks [D] Chapt. 4, [HTF] Chapt. 11,

1/16/2014 HIV Disease and Distal Sensory Polyneuropathy (DSP) David M. Kietrys, PT, PhD, OCS

neural circuits Lennart Oettl 2.11.2018 Oxytocin Discovered by Sir Henry H. Dale in 1906

S OCIAL COGNITION IN PEOPLE WITH AUTISM : R OLE OF N EUROPEPTIDES Karen J. Parker, PhD Department

SLID IDE : In In Defen ense e of Smart Algorithms over er Ha - PowerPoint PPT Presentation

SLID IDE : In In Defen ense e of Smart Algorithms over er Ha Hardware e Accel eler eration for Large-Scale e Deep eep Lea earning System ems Beidi Chen Collaborators: Tharun Medini * , James Farwell , Sameh Gobriel , Charlie

PO PO WE R E D S L IDE S WE R E D S L IDE S PO POWE R E D S L IDE S PO WE R E D S L IDE S

Ca Catch ch M Me If Ca e If Can: A Cl A Clou oud-En -Enabled ed DDoS DDoS De Defen ense

North Lake Mirror Redevelopment E LAKELAND ECONOMIC DEVELOPMENT COUNCIL 2 SLID North Lake

P re P re s s ide ide nts nts C o C o m m m m e e ndatio ndatio n S che n

SMART ENERGY SMART ASSET SMART SMART SMART &amp; CUSTOMER ASSET PURPOSE PEOPLE

Sl Slid ide e Sh Show ow The Pursuit Of Atonement By Maxwell Somtochukwu Nwabunike

Rene newi wing ng You our Licens ense e and nd The he Prof ofessi ession onal l Grow

MASTER IDE INTRODUCING EMERGING TECHNOLOGY DESIGN ETD prof. dr. ir. E. van der Heide GENERAL

What's new in Eclipse IDE and the ecosystem around it Sopot Cela Mickael Istria The plan

Building an IDE on top of a Build System The tale of a Haskell IDE How to write a compiler? +

Open Komodo: An Open Source IDE For Open Languages For Open Languages Own Your IDE Eric

Smart and Adaptive Cyber-Physical Systems Chapters 1,2 Cyber-Physical Systems Smart mobility

Veg egetation M Management &amp; &amp; De Defen ensible e Space Ordinance i e in El D

The e European pean St Stru ructu ctural ral Fund nds s in t in the e De Defen ence

Memor ory D y Defen enses es The Elevation from Obscurity to Headlines Rajeev Balasubramonian

Quality of Life - Smart Mobility - Smart Infrastructure - Smart People, Smart Living ARC 590

Safe and Robust Deep Learning Gagandeep Singh PhD Student Department of Computer Science 1

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong , Hieu Pham,

Simulating neural computation and information processing with Brian Marcel Stimberg Institut de

CS480/680 Lecture 9: June 5, 2019 Perceptrons, Neural Networks [D] Chapt. 4, [HTF] Chapt. 11,

1/16/2014 HIV Disease and Distal Sensory Polyneuropathy (DSP) David M. Kietrys, PT, PhD, OCS

neural circuits Lennart Oettl 2.11.2018 Oxytocin Discovered by Sir Henry H. Dale in 1906

S OCIAL COGNITION IN PEOPLE WITH AUTISM : R OLE OF N EUROPEPTIDES Karen J. Parker, PhD Department

SMART ENERGY SMART ASSET SMART SMART SMART & CUSTOMER ASSET PURPOSE PEOPLE

Veg egetation M Management & & De Defen ensible e Space Ordinance i e in El D