Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace
Neural Networks! • In 2020, neural networks are the dominant technology in machine learning (for better or worse)! • Today, we’ll spend a day going over some of the fundamentals of NNs and modern libraries (we saw a preview last time, with auto-diff)! • This will also serve as a refresher of gradient descent
Neural Networks! • In 2020, neural networks are the dominant technology in machine learning (for better or worse)! • Today, we’ll go over some of the fundamentals of NNs and modern libraries (we saw a preview last week, with auto-diff)! • This will also serve as a refresher of gradient descent
Neural Networks! • In 2020, neural networks are the dominant technology in machine learning (for better or worse)! • Today, we’ll go over some of the fundamentals of NNs and modern libraries (we saw a preview last week, with auto-diff)! • This will also serve as a refresher on gradient descent
Gradient Descent in Linear Models Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as optimization We’ll start with linear models, review gradient descent , and then talk about neural nets + backprop
Gradient Descent in Linear Models Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as search/ optimization We’ll start with linear models, review gradient descent , and then talk about neural nets + backprop
Gradient Descent in Linear Models Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as search/ optimization We’ll start with linear models, review gradient descent , and then talk about neural nets + backprop
Loss The simplest loss is probably 0/1 loss: 0 if we’re correct 1 if we’re wrong What’s an algo that minimizes this?
The Perceptron !
Training data x y Consider a simple linear model with parameters w ifw.x.IT is 1 otherwise
Training data x y Consider a simple linear model with parameters w ifw.x.IT is 1 otherwise (assumes bias term moved into x or omitted)
Training data x y Consider a simple linear model with parameters w ifw.x.IT is 1 otherwise (assumes bias term moved into x or omitted) The learning problem is to estimate w
Training data x y Consider a simple linear model with parameters w ifw.x.IT is 1 otherwise (assumes bias term moved into x or omitted) The learning problem is to estimate w What is our criterion for a good w ? Minimal loss
Perceptron! Algorithm 5 P erceptron T rain ( D , MaxIter ) 1 : w d ← 0 , for all d = 1 . . . D // initialize weights 2 : b ← 0 // initialize bias 3 : for iter = 1 . . . MaxIter do for all ( x , y ) ∈ D do 4 : a ← ∑ D d = 1 w d x d + b // compute activation for this example 5 : if ya ≤ 0 then 6 : w d ← w d + yx d , for all d = 1 . . . D // update weights 7 : b ← b + y // update bias 8 : end if 9 : end for 10 : 11 : end for 12 : return w 0 , w 1 , . . . , w D , b Fig and Alg from CIML [Daume]
Problems with 0/1 loss • If we’re wrong by .0001 it is “as bad” as being wrong by .9999 • Because it is discrete, optimization is hard if the instances are not linearly separable
Smooth loss Idea: Introduce a “smooth” loss function to make optimization easier Example: Hinge loss 1 loss 0 Hinge Max l 2 z o i g Langley 2 w Xi raw yE I output I 2 correct wrong which signed margin
Losses · ` (0/1) ( y , ˆ y ) = 1 [ y ˆ y ≤ 0 ] Zero/one: ` (hin) ( y , ˆ y ) = max { 0, 1 − y ˆ y } Hinge: 1 ` (log) ( y , ˆ y ) = log 2 log ( 1 + exp [ − y ˆ y ]) Logistic: ` (exp) ( y , ˆ y ) = exp [ − y ˆ y ] Exponential: y ) 2 ` (sqr) ( y , ˆ y ) = ( y − ˆ Squared: Fig and Eq’s from CIML [Daume]
Regularization ∑ ` ( y n , w · x n + b ) + l R ( w , b ) min w , b n
Regularization ∑ ` ( y n , w · x n + b ) + l R ( w , b ) min w , b n Prevent w from “getting to crazy”
Gradient descent By Gradient_descent.png: The original uploader was Olegalexandrov at English Wikipedia.derivative work: Zerodamage - This file was derived from: Gradient descent.png:, Public Domain, https://commons.wikimedia.org/w/index.php?curid=20569355
Algorithm 21 G radient D escent ( F , K , η 1 , . . . ) 1 : z (0) h 0 , 0 , . . . , 0 i // initialize variable we are optimizing 2 : for k = 1 . . . K do g (k) r z F| z (k-1) // compute gradient at current location 3 : z (k) z (k-1) � η (k) g (k) // take a step down the gradient 4 : 5 : end for 6 : return z (K) Alg from CIML [Daume]
⇤ + r w ⇥ � y n ( w · x n + b ) λ 2 || w || 2 r w L = r w ∑ exp ( n ⇤ + λ w ⇥ � y n ( w · x n + b ) = ∑ ( r w � y n ( w · x n + b )) exp n ( ⇤ + λ w ⇥ � y n ( w · x n + b ) = � ∑ y n x n exp ( n
⇤ + r w ⇥ � y n ( w · x n + b ) λ 2 || w || 2 r w L = r w ∑ exp ( n ⇤ + λ w ⇥ � y n ( w · x n + b ) = ∑ ( r w � y n ( w · x n + b )) exp n ( ⇤ + λ w ⇥ � y n ( w · x n + b ) = � ∑ y n x n exp ( n
⇤ + r w ⇥ � y n ( w · x n + b ) λ 2 || w || 2 r w L = r w ∑ exp ( n ⇤ + λ w ⇥ � y n ( w · x n + b ) = ∑ ( r w � y n ( w · x n + b )) exp n ( ⇤ + λ w ⇥ � y n ( w · x n + b ) = � ∑ y n x n exp ( n
Limitations of linear models
Neural networks Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations ( x ) nor outputs ( y ) E E E W1 v k wa b h o I WZ wut but y ol i y
Neural networks Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations ( x ) nor outputs ( y ) E E E W1 v k wa b h o I WZ wut but y ol i y (Non-linear) activation functions
Neural networks Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations ( x ) nor outputs ( y ) E E E W1 v k wa b h o I WZ wut but y ol i y The challenge: How do we update weights associated with each node in this multi-layer regime?
back-propagation = gradient descent + chain rule
Algorithm 27 F orward P ropagation ( x ) 1 : for all input nodes u do h u ← corresponding feature of x 2 : 3 : end for 4 : for all nodes v in the network whose parent’s are computed do a v ← ∑ u ∈ par ( v ) w ( u , v ) h u 5 : h v ← tanh ( a v ) 6 : 7 : end for 8 : return a y Tanh is another common activation function
Algorithm 28 B ack P ropagation ( x , y ) 1 : run F orward P ropagation ( x ) to compute activations 2 : e y ← y − a y // compute overall network error 3 : for all nodes v in the network whose error e v is computed do for all u ∈ par ( v ) do 4 : g u , v ← − e v h u // compute gradient of this edge 5 : e u ← e u + e v w u , v ( 1 − tanh 2 ( a u )) // compute the “error” of the parent node 6 : end for 7 : 8 : end for 9 : return all gradients g e
What are we doing with these gradients again?
Gradient descent By Gradient_descent.png: The original uploader was Olegalexandrov at English Wikipedia.derivative work: Zerodamage - This file was derived from: Gradient descent.png:, Public Domain, https://commons.wikimedia.org/w/index.php?curid=20569355
Neural Networks! If you’re interested in learning more…
Recommend
More recommend