http://cs246.stanford.edu Would like to do prediction: estimate a - - PowerPoint PPT Presentation

http cs246 stanford edu would like to do prediction
SMART_READER_LITE
LIVE PREVIEW

http://cs246.stanford.edu Would like to do prediction: estimate a - - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a


slide-1
SLIDE 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

slide-2
SLIDE 2

¡ Would like to do prediction:

estimate a function f(x) so that y = f(x)

¡ Where y can be:

§ Real number: Regression § Categorical: Classification § Complex object:

§ Ranking of items, Parse tree, etc.

¡ Data is labeled:

§ Have many pairs {(x, y)}

§ x … vector of binary, categorical, real valued features § y … class: {+1, -1}, or a real number

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 2/27/20

slide-3
SLIDE 3

¡ Task: Given data (X,Y) build a model f to predict

Y’ based on X’

¡ Strategy: Estimate 𝒛 = 𝒈 𝒚

  • n (𝒀, 𝒁).

Hope that the same 𝒈(𝒚) also works to predict unknown 𝒁’

§ The “hope” is called generalization

§ Overfitting: If f(x) predicts Y well but is unable to predict Y’

§ We want to build a model that generalizes well to unseen data

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3

X Y X’ Y ’ Test data Training data

2/27/20

slide-4
SLIDE 4

¡ 1) Training data is drawn independently at

random according to unknown probability distribution 𝑄(𝒚, 𝑧)

¡ 2) The learning algorithm analyzes the

examples and produces a classifier 𝒈

¡ Given new data 𝒚, 𝑧 drawn from 𝑸, the

classifier is given 𝒚 and predicts . 𝒛 = 𝒈(𝒚)

¡ The loss 𝓜(.

𝒛, 𝒛) is then measured

¡ Goal of the learning algorithm:

Find 𝒈 that minimizes expected loss 𝑭𝑸[𝓜]

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4

training points

2/27/20

slide-5
SLIDE 5

5

𝑄(𝒚, 𝑧) (𝒚, 𝑧) Training set 𝑻 Learning algorithm 𝑔 test data 𝒚 loss function 𝑧 𝑧 ) 𝑧 training data ℒ() 𝑧, 𝑧)

Why is it hard? We estimate 𝒈 on training data but want the 𝒈 to work well on unseen future (i.e., test) data

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2/27/20

slide-6
SLIDE 6

¡ Goal: Minimize the expected loss

min

"

𝔽#[𝓜]

¡ But we don’t have access to 𝑸 -- we only

know the training sample 𝑬: min

"

𝔽$[𝓜]

¡ So, we minimize the average loss on

the training data:

min

!

𝐾 𝑔 = min

!

1 𝑂 )

"#$ %

ℒ 𝑔(𝑦"), 𝑧"

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6

Problem: Just memorizing the training data gives us a perfect model (with zero loss)

2/27/20

slide-7
SLIDE 7

¡ Given:

§ A set of N training examples

§ {(𝑦!, 𝑧!), (𝑦", 𝑧"), … , (𝑦#, 𝑧#)}

§ A loss function 𝓜

¡ Choose the model: 𝒈𝒙 𝒚 = 𝒙 ⋅ 𝒚 + 𝒄 ¡ Find:

§ The weight vector 𝑥 that minimizes the expected loss on the training data 𝐾 𝑔 = 1 𝑂 )

"#$ %

ℒ 𝑥 ⋅ 𝑦" + 𝑐, 𝑧"

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7 2/27/20

slide-8
SLIDE 8

¡ Problem: Step-wise Constant Loss function

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8

  • 1

1 2 3 4 5 6

  • 4
  • 2

2 4 Loss fw(x)

Derivative is either 0 or ∞

2/27/20

slide-9
SLIDE 9

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9

When 𝑧 = 1:

¡ Approximating the expected loss by a

smooth function

§ Replace the original objective function by a surrogate loss function. E.g., hinge loss: 6 𝐾 𝒙 = 1 𝑂 )

"#$ %

max 0, 1 − 𝑧 " 𝑔(𝒚 " )

2/27/20

y*f(x)

slide-10
SLIDE 10
slide-11
SLIDE 11

¡ Want to separate “+” from “-” using a line

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11

Data:

¡ Training examples:

§ (x1, y1) … (xn, yn)

¡ Each example i:

§ xi = ( xi(1),… , xi(d) )

§ xi(j) is real valued

§ yi Î { -1, +1 }

¡ Inner product:

𝒙 ⋅ 𝒚 = ∑&'(

)

𝑥(&) ⋅ 𝑦(&) + + + + + +

  • Which is best linear separator (defined by w,b)?
slide-12
SLIDE 12

+ + + + + + + + +

  • A

B C

¡ Distance from the

separating hyperplane corresponds to the “confidence”

  • f prediction

¡ Example:

§ We are more sure about the class of A and B than of C

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12

slide-13
SLIDE 13

¡ Margin 𝜹: Distance of closest example from

the decision line/hyperplane

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13

The reason we define margin this way is due to theoretical convenience and existence of generalization error bounds that depend on the value of margin.

slide-14
SLIDE 14

¡ Remember: The Dot product

𝑩 ⋅ 𝑪 = 𝑩 ⋅ 𝑪 ⋅ 𝐝𝐩𝐭 𝜾

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14

| 𝑩 | = '

𝒌"𝟐 𝒆

𝑩(𝒌) 𝟑

𝑩 𝒅𝒑𝒕𝜾

slide-15
SLIDE 15

¡ Dot product

𝑩 ⋅ 𝑪 = 𝑩 𝑪 𝐝𝐩𝐭 𝜾

¡ What is 𝒙 ⋅ 𝒚𝟐 , 𝒙 ⋅ 𝒚𝟑? ¡ So, 𝜹 roughly corresponds to the margin

§ Bottom line: Bigger 𝜹, bigger the separation

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15

w × x + b = 0 𝒙

+ +

x2 x1 In this case 𝜹𝟐 ≈ 𝒙 𝟑 𝒙

+

x1

+

x2 𝒙

+

x2 In this case 𝜹𝟑 ≈ 𝟑 𝒙 𝟑

+

x1 | w | = '

(") *

𝑥(() +

slide-16
SLIDE 16

w · x + b = 0

Distance from a point to a line A (xA(1), xA(2)) M (xM(1), xM(2)) H

d(A, L) = |AH| = |(A-M) ∙ w| = |(xA(1) – xM(1)) w(1) + (xA(2) – xM(2)) w(2)| = |xA(1) w(1) + xA(2) w(2) + b|

= |w ∙ A + b|

Remember xM(1)w(1) + xM(2)w(2) = - b since M belongs to line L w d(A, L)

L

+

¡ Let:

§ Line L: w·x+b = w(1)x(1)+w(2)x(2)+b=0 § w = (w(1), w(2)) § Point A = (xA(1), xA(2)) § Point M on a line = (xM(1), xM(2))

(0,0)

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

Note we assume 𝒙 𝟑 = 𝟐

slide-17
SLIDE 17

¡ Prediction = sign(w×x + b) ¡ “Confidence” = (w× x + b) y ¡ For i-th datapoint:

𝜹𝒋 = 𝒙× 𝒚𝒋 + 𝒄 𝒛𝒋

¡ Want to solve:

𝐧𝐛𝐲

𝒙,𝒄 𝐧𝐣𝐨 𝒋

𝜹𝒋

¡ Can rewrite as

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

+ + + + + + +

  • w × x + b = 0

g g

g

³ + × " ) ( , . . max

,

b x w y i t s

i i w

𝒙

slide-18
SLIDE 18

¡ Maximize the margin:

§ Good according to intuition, theory (c.f. “VC dimension”) and practice § 𝜹 is margin … distance from the separating hyperplane

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18

+ + + + + + + +

  • -
  • -

w×x+b=0

g g Maximizing the margin g

g g

g

³ + × " ) ( , . . max

,

b x w y i t s

i i w

slide-19
SLIDE 19
slide-20
SLIDE 20

¡ Separating hyperplane

is defined by the support vectors

§ Points on +/- planes from the solution § If you knew these points, you could ignore the rest § Generally, d+1 support vectors (for d dim. data)

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20

slide-21
SLIDE 21

¡ Problem:

§ Let 𝒙×𝒚 + 𝒄 𝒛 = 𝜹 then 𝟑𝒙×𝒚 + 𝟑𝒄 𝒛 = 𝟑𝜹

§ Scaling w increases margin!

¡ Solution:

§ Work with normalized w: 𝜹 =

𝒙 𝒙 ×𝒚 + 𝒄 𝒛

§ Let’s also require support vectors 𝒚𝒌 to be on the plane defined by: 𝒙 ⋅ 𝒚𝒌 + 𝒄 = ±𝟐

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21

2

x

|| || w w

w×x+b=0 w×x+b=+1 w×x+b=-1

1

x

| w | = '

(") *

𝑥(() +

slide-22
SLIDE 22

¡ Want to maximize margin! ¡ What is the relation

between x1 and x2?

§ 𝒚𝟐 = 𝒚𝟑 + 𝟑𝜹

𝒙 ||𝒙||

§ We also know:

§ 𝒙 ⋅ 𝒚𝟐 + 𝒄 = +𝟐 § 𝒙 ⋅ 𝒚𝟑 + 𝒄 = −𝟐

¡ So:

§ 𝒙 ⋅ 𝒚𝟐 + 𝒄 = +𝟐 § 𝒙 𝒚𝟑 + 𝟑𝜹

𝒙 ||𝒙|| + 𝒄 = +𝟐

§ 𝒙 ⋅ 𝒚𝟑 + 𝒄 + 𝟑𝜹 𝒙⋅𝒙

𝒙 = +𝟐

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

w×x+b=0 w×x+b=+1 w×x+b=-1

2g

  • 1

w w w w 1 = × = Þ g

2

w w w = × Note:

2

x

1

x

|| || w w

slide-23
SLIDE 23

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23

g g

g

³ + × " ) ( , . . max

,

b x w y i t s

i i w

1 ) ( , . . || || min

2 2 1

³ + × " b x w y i t s w

i i w

This is called SVM with “hard” constraints

2 2 1

min arg min arg 1 max arg max arg w w w = = = g

w×x+b=0 w×x+b=+1 w×x+b=-1

2g

¡ We started with

But w can be arbitrarily large!

¡ We normalized and... ¡ Then:

2

x

1

x

|| || w w

slide-24
SLIDE 24

¡ If data is not separable introduce penalty:

§ Minimize ǁwǁ2 plus the number of training mistakes § Set C using cross validation

¡ How to penalize mistakes?

§ All mistakes are not equally bad!

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24

1 ) ( , . . mistakes)

  • f

number (# C min

2 2 1

³ + × " × + b x w y i t s w

i i w

+ + + + + + +

  • w×x+b=0

+

slide-25
SLIDE 25

¡ Introduce slack variables xi ¡ If point xi is on the wrong

side of the margin then get penalty xi

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25

i i i n i i b w

b x w y i t s C w

i

x x

x

  • ³

+ × " × +

å

= ³

1 ) ( , . . min

1 2 2 1 , ,

+ + + + + + +

  • -
  • w

× x + b =

For each data point: If margin ³ 1, don’t care If margin < 1, pay linear penalty

+

xj

  • xi
slide-26
SLIDE 26

¡ What is the role of slack penalty C:

§ C=¥: Only want w, b that separate the data § C=0: Can set xi to anything, then w=0 (basically ignores the data)

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26

1 ) ( , . . mistakes)

  • f

number (# C min

2 2 1

³ + × " × + b x w y i t s w

i i w

+ + + + + + +

  • -
  • + -

big C “good” C small C (0,0)

slide-27
SLIDE 27

¡ SVM in the “natural” form ¡ SVM uses “Hinge Loss”:

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27

{ }

å

=

+ ×

  • ×

+ ×

n i i i b w

b x w y C w w

1 2 1 ,

) ( 1 , max min arg

Margin Empirical loss L (how well we fit training data) Regularization parameter

i i i n i i b w

b x w y i t s C w x x

  • ³

+ × × " + å

=

1 ) ( , . . min

1 2 2 1 ,

  • 1 0 1 2

0/1 loss penalty

) ( b w x y z

i i

+ × × =

Hinge loss: max{0, 1-z}

slide-28
SLIDE 28
slide-29
SLIDE 29

¡ Want to estimate 𝒙 and 𝒄!

§ Standard way: Use a solver!

§ Solver: software for finding solutions to “common” optimization problems

¡ Use a quadratic solver:

§ Minimize quadratic function § Subject to linear constraints

¡ Problem: Solvers are inefficient for big data!

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29

i i i n i i b w

b w x y i t s C w w x x

  • ³

+ × × " × + ×

å

=

1 ) ( , . . min

1 2 1 ,

slide-30
SLIDE 30

¡ Want to minimize J(w,b): ¡ Compute the gradient Ñ(j) w.r.t. w(j)

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30

else 1 ) (w if ) , (

) ( ) ( j i i i i j i i

x y b x y w y x L

  • =

³ + × = ¶ ¶

( )

å å å

= = =

þ ý ü î í ì +

  • +

=

n i d j j i j i d j j

b x w y C w b w J

1 1 ) ( ) ( 1 2 ) ( 2 1

) ( 1 , max ) , (

Empirical loss 𝑴(𝒚𝒋 𝒛𝒋)

rJ(j) = ∂J(w, b) ∂w(j) = w(j) + C

n

X

i=1

∂L(xi, yi) ∂w(j)

slide-31
SLIDE 31

¡ Gradient descent: ¡ Problem:

§ Computing ÑJ(j) takes O(n) time!

§ n … size of the training dataset

Iterate until convergence:

  • For j = 1 … d
  • Evaluate:
  • Update:

w’(j) ¬ w(j) - hÑJ(j)

  • w ¬ w’

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31

å

=

¶ ¶ + = ¶ ¶ = Ñ

n i j i i j j j

w y x L C w w b w f J

1 ) ( ) ( ) ( ) (

) , ( ) , (

h…learning rate parameter C… regularization parameter

slide-32
SLIDE 32

¡ Stochastic Gradient Descent

§ Instead of evaluating gradient over all examples evaluate it for each individual training example

¡ Stochastic gradient descent:

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 32

) ( ) ( ) (

) , ( ) (

j i i j i j

w y x L C w x J ¶ ¶ × + = Ñ

å

=

¶ ¶ + = Ñ

n i j i i j j

w y x L C w J

1 ) ( ) ( ) (

) , (

We just had:

Iterate until convergence:

  • For i = 1 … n
  • For j = 1 … d
  • Compute: ÑJ(j)(xi)
  • Update: w(j) ¬ w(j) - h ÑJ(j)(xi)

Notice: no summation

  • ver i anymore
slide-33
SLIDE 33

¡ Batch Gradient Descent

§ Calculates error for each example in the training dataset, but updates model only after all examples have been evaluated (i.e., end of training epoch) § PROS: fewer updates, more stable error gradient § CONS: usually requires whole dataset in memory, slower than SGD

¡ Mini-Batch Gradient Descent

§ Like BGD, but using smaller batches of training

  • data. Balance between robustness of BGD, and

efficiency of SGD.

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33

slide-34
SLIDE 34
slide-35
SLIDE 35

¡ Dataset:

§ Reuters RCV1 document corpus

§ Predict a category of a document

§ One vs. the rest classification

§ n = 781,000 training examples (documents) § 23,000 test examples § d = 50,000 features

§ One feature per word § Remove stop-words § Remove low frequency words

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 35

slide-36
SLIDE 36

¡ Questions:

§ (1) Is SGD successful at minimizing J(w,b)? § (2) How quickly does SGD find the min of J(w,b)? § (3) What is the error on a test set?

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36

Training time Value of J(w,b) Test error Standard SVM “Fast SVM” SGD-SVM

(1) SGD-SVM is successful at minimizing the value of J(w,b) (2) SGD-SVM is super fast (3) SGD-SVM test set error is comparable

slide-37
SLIDE 37

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 37

Optimization quality: | J(w,b) – J (wopt,bopt) |

Conventional SVM SGD SVM For optimizing J(w,b) within reasonable quality SGD-SVM is super fast

slide-38
SLIDE 38

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 38

slide-39
SLIDE 39

¡ Need to choose learning rate h and t0 ¡ Tricks:

§ Choose t0 so that the expected initial updates are comparable with the expected size of the weights § Choose h:

§ Select a small subsample § Try various rates h (e.g., 10, 1, 0.1, 0.01, …) § Pick the one that most reduces the cost § Use h for next 100k iterations on the full dataset

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 39

÷ ø ö ç è æ ¶ ¶ + +

  • ¬

+

w y x L C w t t w w

i i t t t t

) , (

1

h

slide-40
SLIDE 40

¡ Idea 1:

One against all Learn 3 classifiers

§ + vs. {o, -} § - vs. {o, +} § o vs. {+, -} Obtain: w+ b+, w- b-, wo bo

¡ How to classify? ¡ Return class c

arg maxc wc x + bc

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 40

slide-41
SLIDE 41

¡ Idea 2: Learn 3 sets of weights simultaneously!

§ For each class c estimate wc, bc § Want the correct class yi to have highest margin: 𝒙𝒛_𝒋 xi + 𝒄𝒛_𝒋 ³ 1 + wc xi + bc "c ¹ yi , "i

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 41

(xi, yi)

slide-42
SLIDE 42

¡ Optimization problem:

§ To obtain parameters wc, bc(for each class c) we can use similar techniques as for 2 class SVM

¡ SVM is widely perceived as a very powerful

learning algorithm

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 42

i c i c y i y n i i c b w

b x w b x w C w

i i

x x

  • +

+ × ³ + × + å

å

=

1 min

1 c 2 2 1 ,

i i y c

i i

" ³ " ¹ " , , x

slide-43
SLIDE 43
slide-44
SLIDE 44

¡ The Unreasonable Effectiveness of Data

§ In 2017, Google revisited a 15-year-old experiment on the effect of data and model size in ML, focusing on the latest Deep Learning models in computer vision

¡ Findings:

§ Performance increases logarithmically based on volume of training data § Complexity of modern ML models (i.e., deep neural nets) allows for even further performance gains

¡ Large datasets + large ML models => amazing results!!

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44

“Revisiting Unreasonable Effectiveness of Data in Deep Learning Era”: https://arxiv.org/abs/1707.02968

slide-45
SLIDE 45

¡ Last lecture: Decision Trees (and PLANET) as a

prime example of Data Parallelism in ML

¡ Today’s lecture: Multiclass SVMs, Statistical

models, Neural Networks, etc. can leverage both Data Parallelism and Model Parallelism

§ State-of-the-art models can have more than 100 million parameters!

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 45

slide-46
SLIDE 46

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 46

M2 and M4 must wait for the 1st stage to complete!

slide-47
SLIDE 47

Model Machine (Model Partition) Core Training Data

¡ Unsupervised or Supervised Objective ¡ Minibatch Stochastic Gradient Descent

(SGD)

¡ Model parameters sharded by partition ¡ 10s, 100s, or 1000s of cores per model

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 47

slide-48
SLIDE 48

p Model Data ∆p p’ p’ = p + ∆p Parameter Server ∆p’ p’’ = p’ + ∆p’

¡ Parameter Server: Key/Value store ¡ Keys index the model parameters (e.g.,

weights)

¡ Values are the parameters of the ML

model (e.g., a neural network)

¡ Systems challenges:

§ Bandwidth limits § Synchronization § Fault tolerance

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 48

slide-49
SLIDE 49

Parameter Server Model Workers Data Shards p’ = p + ∆p ∆p p’

Asynchronous Distributed Stochastic Gradient Descent

Why d0 parallel updates work?

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 49

slide-50
SLIDE 50

¡ Key idea: don’t synchronize, just overwrite parameters

  • pportunistically from multiple workers (i.e., servers)

§ Same implementation as SGD, just without locking!

¡ In theory, Async SGD converges, but a slower rate than

the serial version.

¡ In practice, when gradient updates are sparse (i.e.,

high dimensional data), same convergence!

¡

Recht et al. “HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent”, 2011

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 50

RR is a super optimized version of online Gradient Descent

slide-51
SLIDE 51

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 51

<= P is the number of partitions / processors Component-wise gradient updates (relies on sparsity) SGD

slide-52
SLIDE 52

Asynchronous Distributed Stochastic Gradient Descent

Parameter Server Model Workers Data Shards

¡ Synchronization boundaries involve fewer machines ¡ Better robustness to individual slow machines ¡ Makes forward progress even during evictions/restarts

From an engineering standpoint, this is much better than a single model with the same number of total machines:

2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 52

¡

Google, Large Scale Distributed Deep Networks [2012]

¡ All ingredients together:

§ Model and Data parallelism § Async SGD

¡ Dawn of modern Deep Learning