lecture 5 linear models
play

Lecture 5 Linear Models Lin ZHANG, PhD School of Software - PowerPoint PPT Presentation

Lecture 5 Linear Models Lin ZHANG, PhD School of Software Engineering Tongji University Fall 2020 Lin ZHANG, SSE, 2020 Outline Linear model Linear regression Logistic regression Softmax regression Lin ZHANG, SSE, 2020


  1. Lecture 5 Linear Models Lin ZHANG, PhD School of Software Engineering Tongji University Fall 2020 Lin ZHANG, SSE, 2020

  2. Outline • Linear model – Linear regression – Logistic regression – Softmax regression Lin ZHANG, SSE, 2020

  3. Linear regression • Our goal in linear regression is to predict a target ∈ d x R continuous value y from a vector of input values ; we use a linear function h as the model • At the training stage, we aim to find h ( x ) so that we ≈ h ( x ) y ( x , y ) have for each training sample i i i i • We suppose that h is a linear function, so = θ + θ ∈ R × T d 1 h ( , ) ( ) x x b , θ b θ     x Rewrite it, θ = = ' ' , x     b 1     θ θ ≡ T ' T ' ' x +b= x h ( x ) ' θ ( ) ( ) + × + × θ θ ∈ d 1 1 ∈ d 1 1 T h ( ) x = x , R , x R Later, we simply use θ Lin ZHANG, SSE, 2020

  4. Linear regression h ( x ) θ • Then, our task is to find a choice of so that is θ i y as close as possible to i The cost function can be written as, 1 m ( ) ∑ 2 θ = θ − T J ( ) x y i i 2 = i 1 Then, the task at the training stage is to find 1 m ( ) ∑ 2 θ = θ − * T arg min 2 x y i i θ = i 1 For this special case, it has a closed-form optimal solution Here we use a more general method, gradient descent method Lin ZHANG, SSE, 2020

  5. Linear regression • Gradient descent – It is a first-order optimization algorithm – To find a local minimum of a function, one takes steps proportional to the negative of the gradient of the function at the current point θ J θ ( ) – One starts with a guess for a local minimum of 0 and considers the sequence such that θ = θ − ∇ θ α θ : J ( ) + θ θ = n 1 n | n α where is called as learning rate Lin ZHANG, SSE, 2020

  6. Linear regression • Gradient descent Lin ZHANG, SSE, 2020

  7. Linear regression • Gradient descent Lin ZHANG, SSE, 2020

  8. Linear regression • Gradient descent J θ ( ) Repeat until convergence ( will not reduce anymore) { θ = θ − ∇ θ α θ : J ( ) + θ θ = n 1 n | n } GD is a general optimization solution; for a specific problem, the key step is how to compute gradient Lin ZHANG, SSE, 2020

  9. Linear regression • Gradient of the cost function of linear regression 1 m ( ) ∑ 2 θ = θ x − T J ( ) y i i 2 = i 1 The gradient is, ∂ θ  J ( )    ∂ θ   1 ∂ θ   J ( ) ∂ θ J ( ) m   ∑ ( ) = − ∂ θ ∇ θ =  h ( x ) y x where, J ( )  θ i i ij ∂ θ θ 2 = i 1 j      ∂ θ J ( )     ∂ θ +   d 1 Lin ZHANG, SSE, 2020

  10. Linear regression • Some variants of gradient descent – The ordinary gradient descent algorithm looks at every sample in the entire training set on every step; it is also called as batch gradient descent – Stochastic gradient descent (SGD) repeatedly run through the training set, and each time when we encounter a training sample, we update the parameters according to the gradient of the error w.r.t that single training sample only Repeat until convergence { for i = 1 to m ( m is the number of training samples) { ( ) θ + = θ − α θ x − T 1 : y x n n n i i i } } Lin ZHANG, SSE, 2020

  11. Linear regression • Some variants of gradient descent – The ordinary gradient descent algorithm looks at every sample in the entire training set on every step; it is also called as batch gradient descent – Stochastic gradient descent (SGD) repeatedly run through the training set, and each time when we encounter a training sample, we update the parameters according to the gradient of the error w.r.t that single training sample only – Minibatch SGD : it works identically to SGD, except that it uses more than one training samples to make each estimate of the gradient Lin ZHANG, SSE, 2020

  12. Linear regression • More concepts – m Training samples can be divided into N minibatches – When the training sweeps all the batches, we say we complete one epoch of training process; for a typical training process, several epochs are usually required epochs = 10; numMiniBatches = N ; while epochIndex< epochs && not convergent { for minibatchIndex = 1 to numMiniBatches { update the model parameters based on this minibatch } } Lin ZHANG, SSE, 2020

  13. Outline • Linear model – Linear regression – Logistic regression – Softmax regression Lin ZHANG, SSE, 2020

  14. Logistic regression • Logistic regression is used for binary classification θ x T • It squeezes the linear regression into the range (0, 1) ; thus the prediction result can be interpreted as probability • At the testing stage The probability that the testing sample x is positive is 1 = + represented as h ( ) x θ − θ T 1 exp( x ) The probability that the testing sample x is negative is represented as 1- h ( ) x θ 1 σ = ( ) z Function is called as sigmoid or logistic + − 1 exp( z ) function Lin ZHANG, SSE, 2020

  15. Logistic regression One property of the sigmoid function σ = σ − σ ' ( ) z ( )(1 z ( )) z Can you The shape of sigmoid function verify? Lin ZHANG, SSE, 2020

  16. Logistic regression • The hypothesis model can be written neatly as ( ) ( ) − y 1 y θ = − P y ( | ; ) x h ( ) x 1 h ( ) x θ θ h θ x ( ) θ • Our goal is to search for a value so that is large when x belongs to “1” class and small when x belongs to “0” class { } = ( x , y ) : i 1,..., m Thus, given a training set with binary labels i i , we want to maximize, m ∏ ( ) ( ) − y 1 y − h ( x ) 1 h ( x ) i i θ θ i i = i 1 Equivalent to maximize, m ∑ ( ) ( ) + − − y log h ( x ) (1 y )log 1 h ( x ) θ θ i i i i = i 1 Lin ZHANG, SSE, 2020

  17. Logistic regression • Thus, the cost function for the logistic regression is (we want to minimize), m ∑ ( ) ( ) θ = − + − − J ( ) y log h ( x ) (1 y )log 1 h ( x ) θ θ i i i i = i 1 To solve it with gradient descent, gradient needs to be computed, m ∑ ( ) ∇ θ = − J ( ) x h ( x ) y θ θ i i i = i 1 Assignment! Lin ZHANG, SSE, 2020

  18. Logistic regression • Exercise – Use logistic regression to perform digital classification Lin ZHANG, SSE, 2020

  19. Outline • Linear model – Linear regression – Logistic regression – Softmax regression Lin ZHANG, SSE, 2020

  20. Softmax regression • Softmax operation – It squashes a K -dimensional vector z of arbitrary real values σ z ( ) to a K -dimensional vector of real values in the range (0, 1). The function is given by, exp( z ) σ = j ( ) z j K ∑ exp( z ) k = k 1 σ z ( ) – Since the components of the vector sum to one and are all strictly between 0 and 1, they represent a categorical probability distribution Lin ZHANG, SSE, 2020

  21. Softmax regression • For multiclass classification, given a test input x , we = p y ( k | x ) want our hypothesis to estimate for each value k =1,2,…, K Lin ZHANG, SSE, 2020

  22. Softmax regression • The hypothesis should output a K -dimensional vector giving us K estimated probabilities. It takes the form, ( )   ( ) T θ exp x   = φ 1   p y ( 1| ; ) x ( )     ( ) T = φ θ p y ( 2 | ; ) x exp x 1     = = 2 h ( ) x ) (   φ   K  ( ) ∑ T  θ exp x     j = φ ( ) p y ( K | ; ) x     = j 1 ( ) T θ exp x     K [ ] + × φ = θ θ θ ∈ ( d 1) K , ,..., R where 1 2 K Lin ZHANG, SSE, 2020

  23. Softmax regression • In softmax regression, for each training sample we have, ( ) ( ) T θ exp x ( ) k i = φ = p y k | x ; ( ) i i K ( ) ∑ T θ exp x j i = j 1 ( ) = φ p y k | x ; At the training stage, we want to maximize i i for each training sample for the correct label k Lin ZHANG, SSE, 2020

  24. Softmax regression • Cost function for softmax regression ( ) ( ) T θ exp x m K ∑∑ k i φ = − = J ( ) 1 { y k }log ( ) i K ( ) ∑ T θ = = i 1 k 1 exp x j i = j 1 where 1{.} is an indicator function • Gradient of the cost function m ( ) ∑ ( )   ∇ φ = − = − = φ J ( ) x 1 { y k } p y k | x ;   θ i i i i k = i 1 Can you verify? Lin ZHANG, SSE, 2020

Recommend


More recommend