machine learning 2
play

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace Neural Networks! In 2020, neural networks are the dominant technology in machine learning (for better or worse)! Today, well spend a day going


  1. Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace

  2. Neural Networks! • In 2020, neural networks are the dominant technology in machine learning (for better or worse)! • Today, we’ll spend a day going over some of the fundamentals of NNs and modern libraries (we saw a preview last time, with auto-diff)! • This will also serve as a refresher of gradient descent

  3. Neural Networks! • In 2020, neural networks are the dominant technology in machine learning (for better or worse)! • Today, we’ll go over some of the fundamentals of NNs and modern libraries (we saw a preview last week, with auto-diff)! • This will also serve as a refresher of gradient descent

  4. Neural Networks! • In 2020, neural networks are the dominant technology in machine learning (for better or worse)! • Today, we’ll go over some of the fundamentals of NNs and modern libraries (we saw a preview last week, with auto-diff)! • This will also serve as a refresher on gradient descent

  5. Gradient Descent in Linear Models Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as optimization We’ll start with linear models, review gradient descent , and then talk about neural nets + backprop

  6. Gradient Descent in Linear Models Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as search/ optimization We’ll start with linear models, review gradient descent , and then talk about neural nets + backprop

  7. Gradient Descent in Linear Models Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as search/ optimization We’ll start with linear models, review gradient descent , and then talk about neural nets + backprop

  8. Loss The simplest loss is probably 0/1 loss: 0 if we’re correct 1 if we’re wrong What’s an algo that minimizes this?

  9. The Perceptron !

  10. Training data x y Consider a simple linear model with parameters w ifw.x.IT is 1 otherwise

  11. Training data x y Consider a simple linear model with parameters w ifw.x.IT is 1 otherwise (assumes bias term moved into x or omitted)

  12. Training data x y Consider a simple linear model with parameters w ifw.x.IT is 1 otherwise (assumes bias term moved into x or omitted) The learning problem is to estimate w

  13. Training data x y Consider a simple linear model with parameters w ifw.x.IT is 1 otherwise (assumes bias term moved into x or omitted) The learning problem is to estimate w What is our criterion for a good w ? Minimal loss

  14. Perceptron! Algorithm 5 P erceptron T rain ( D , MaxIter ) 1 : w d ← 0 , for all d = 1 . . . D // initialize weights 2 : b ← 0 // initialize bias 3 : for iter = 1 . . . MaxIter do for all ( x , y ) ∈ D do 4 : a ← ∑ D d = 1 w d x d + b // compute activation for this example 5 : if ya ≤ 0 then 6 : w d ← w d + yx d , for all d = 1 . . . D // update weights 7 : b ← b + y // update bias 8 : end if 9 : end for 10 : 11 : end for 12 : return w 0 , w 1 , . . . , w D , b Fig and Alg from CIML [Daume]

  15. Problems with 0/1 loss • If we’re wrong by .0001 it is “as bad” as being wrong by .9999 • Because it is discrete, optimization is hard if the instances are not linearly separable

  16. Smooth loss Idea: Introduce a “smooth” loss function to make optimization easier Example: Hinge loss 1 loss 0 Hinge Max l 2 z o i g Langley 2 w Xi raw yE I output I 2 correct wrong which signed margin

  17. Losses · ` (0/1) ( y , ˆ y ) = 1 [ y ˆ y ≤ 0 ] Zero/one: ` (hin) ( y , ˆ y ) = max { 0, 1 − y ˆ y } Hinge: 1 ` (log) ( y , ˆ y ) = log 2 log ( 1 + exp [ − y ˆ y ]) Logistic: ` (exp) ( y , ˆ y ) = exp [ − y ˆ y ] Exponential: y ) 2 ` (sqr) ( y , ˆ y ) = ( y − ˆ Squared: Fig and Eq’s from CIML [Daume]

  18. Regularization ∑ ` ( y n , w · x n + b ) + l R ( w , b ) min w , b n

  19. Regularization ∑ ` ( y n , w · x n + b ) + l R ( w , b ) min w , b n Prevent w from “getting to crazy”

  20. Gradient descent By Gradient_descent.png: The original uploader was Olegalexandrov at English Wikipedia.derivative work: Zerodamage - This file was derived from: Gradient descent.png:, Public Domain, https://commons.wikimedia.org/w/index.php?curid=20569355

  21. Algorithm 21 G radient D escent ( F , K , η 1 , . . . ) 1 : z (0) h 0 , 0 , . . . , 0 i // initialize variable we are optimizing 2 : for k = 1 . . . K do g (k) r z F| z (k-1) // compute gradient at current location 3 : z (k) z (k-1) � η (k) g (k) // take a step down the gradient 4 : 5 : end for 6 : return z (K) Alg from CIML [Daume]

  22. ⇤ + r w ⇥ � y n ( w · x n + b ) λ 2 || w || 2 r w L = r w ∑ exp ( n ⇤ + λ w ⇥ � y n ( w · x n + b ) = ∑ ( r w � y n ( w · x n + b )) exp n ( ⇤ + λ w ⇥ � y n ( w · x n + b ) = � ∑ y n x n exp ( n

  23. ⇤ + r w ⇥ � y n ( w · x n + b ) λ 2 || w || 2 r w L = r w ∑ exp ( n ⇤ + λ w ⇥ � y n ( w · x n + b ) = ∑ ( r w � y n ( w · x n + b )) exp n ( ⇤ + λ w ⇥ � y n ( w · x n + b ) = � ∑ y n x n exp ( n

  24. ⇤ + r w ⇥ � y n ( w · x n + b ) λ 2 || w || 2 r w L = r w ∑ exp ( n ⇤ + λ w ⇥ � y n ( w · x n + b ) = ∑ ( r w � y n ( w · x n + b )) exp n ( ⇤ + λ w ⇥ � y n ( w · x n + b ) = � ∑ y n x n exp ( n

  25. Limitations of linear models

  26. Neural networks Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations ( x ) nor outputs ( y ) E E E W1 v k wa b h o I WZ wut but y ol i y

  27. Neural networks Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations ( x ) nor outputs ( y ) E E E W1 v k wa b h o I WZ wut but y ol i y (Non-linear) activation functions

  28. Neural networks Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations ( x ) nor outputs ( y ) E E E W1 v k wa b h o I WZ wut but y ol i y The challenge: How do we update weights associated with each node in this multi-layer regime?

  29. back-propagation = gradient descent + chain rule

  30. Algorithm 27 F orward P ropagation ( x ) 1 : for all input nodes u do h u ← corresponding feature of x 2 : 3 : end for 4 : for all nodes v in the network whose parent’s are computed do a v ← ∑ u ∈ par ( v ) w ( u , v ) h u 5 : h v ← tanh ( a v ) 6 : 7 : end for 8 : return a y Tanh is another common activation function

  31. Algorithm 28 B ack P ropagation ( x , y ) 1 : run F orward P ropagation ( x ) to compute activations 2 : e y ← y − a y // compute overall network error 3 : for all nodes v in the network whose error e v is computed do for all u ∈ par ( v ) do 4 : g u , v ← − e v h u // compute gradient of this edge 5 : e u ← e u + e v w u , v ( 1 − tanh 2 ( a u )) // compute the “error” of the parent node 6 : end for 7 : 8 : end for 9 : return all gradients g e

  32. What are we doing with these gradients again?

  33. Gradient descent By Gradient_descent.png: The original uploader was Olegalexandrov at English Wikipedia.derivative work: Zerodamage - This file was derived from: Gradient descent.png:, Public Domain, https://commons.wikimedia.org/w/index.php?curid=20569355

  34. Neural Networks! If you’re interested in learning more…

Recommend


More recommend