Lecture 5 Linear Models Lin ZHANG, PhD School of Software Engineering Tongji University Fall 2020 Lin ZHANG, SSE, 2020
Outline • Linear model – Linear regression – Logistic regression – Softmax regression Lin ZHANG, SSE, 2020
Linear regression • Our goal in linear regression is to predict a target ∈ d x R continuous value y from a vector of input values ; we use a linear function h as the model • At the training stage, we aim to find h ( x ) so that we ≈ h ( x ) y ( x , y ) have for each training sample i i i i • We suppose that h is a linear function, so = θ + θ ∈ R × T d 1 h ( , ) ( ) x x b , θ b θ x Rewrite it, θ = = ' ' , x b 1 θ θ ≡ T ' T ' ' x +b= x h ( x ) ' θ ( ) ( ) + × + × θ θ ∈ d 1 1 ∈ d 1 1 T h ( ) x = x , R , x R Later, we simply use θ Lin ZHANG, SSE, 2020
Linear regression h ( x ) θ • Then, our task is to find a choice of so that is θ i y as close as possible to i The cost function can be written as, 1 m ( ) ∑ 2 θ = θ − T J ( ) x y i i 2 = i 1 Then, the task at the training stage is to find 1 m ( ) ∑ 2 θ = θ − * T arg min 2 x y i i θ = i 1 For this special case, it has a closed-form optimal solution Here we use a more general method, gradient descent method Lin ZHANG, SSE, 2020
Linear regression • Gradient descent – It is a first-order optimization algorithm – To find a local minimum of a function, one takes steps proportional to the negative of the gradient of the function at the current point θ J θ ( ) – One starts with a guess for a local minimum of 0 and considers the sequence such that θ = θ − ∇ θ α θ : J ( ) + θ θ = n 1 n | n α where is called as learning rate Lin ZHANG, SSE, 2020
Linear regression • Gradient descent Lin ZHANG, SSE, 2020
Linear regression • Gradient descent Lin ZHANG, SSE, 2020
Linear regression • Gradient descent J θ ( ) Repeat until convergence ( will not reduce anymore) { θ = θ − ∇ θ α θ : J ( ) + θ θ = n 1 n | n } GD is a general optimization solution; for a specific problem, the key step is how to compute gradient Lin ZHANG, SSE, 2020
Linear regression • Gradient of the cost function of linear regression 1 m ( ) ∑ 2 θ = θ x − T J ( ) y i i 2 = i 1 The gradient is, ∂ θ J ( ) ∂ θ 1 ∂ θ J ( ) ∂ θ J ( ) m ∑ ( ) = − ∂ θ ∇ θ = h ( x ) y x where, J ( ) θ i i ij ∂ θ θ 2 = i 1 j ∂ θ J ( ) ∂ θ + d 1 Lin ZHANG, SSE, 2020
Linear regression • Some variants of gradient descent – The ordinary gradient descent algorithm looks at every sample in the entire training set on every step; it is also called as batch gradient descent – Stochastic gradient descent (SGD) repeatedly run through the training set, and each time when we encounter a training sample, we update the parameters according to the gradient of the error w.r.t that single training sample only Repeat until convergence { for i = 1 to m ( m is the number of training samples) { ( ) θ + = θ − α θ x − T 1 : y x n n n i i i } } Lin ZHANG, SSE, 2020
Linear regression • Some variants of gradient descent – The ordinary gradient descent algorithm looks at every sample in the entire training set on every step; it is also called as batch gradient descent – Stochastic gradient descent (SGD) repeatedly run through the training set, and each time when we encounter a training sample, we update the parameters according to the gradient of the error w.r.t that single training sample only – Minibatch SGD : it works identically to SGD, except that it uses more than one training samples to make each estimate of the gradient Lin ZHANG, SSE, 2020
Linear regression • More concepts – m Training samples can be divided into N minibatches – When the training sweeps all the batches, we say we complete one epoch of training process; for a typical training process, several epochs are usually required epochs = 10; numMiniBatches = N ; while epochIndex< epochs && not convergent { for minibatchIndex = 1 to numMiniBatches { update the model parameters based on this minibatch } } Lin ZHANG, SSE, 2020
Outline • Linear model – Linear regression – Logistic regression – Softmax regression Lin ZHANG, SSE, 2020
Logistic regression • Logistic regression is used for binary classification θ x T • It squeezes the linear regression into the range (0, 1) ; thus the prediction result can be interpreted as probability • At the testing stage The probability that the testing sample x is positive is 1 = + represented as h ( ) x θ − θ T 1 exp( x ) The probability that the testing sample x is negative is represented as 1- h ( ) x θ 1 σ = ( ) z Function is called as sigmoid or logistic + − 1 exp( z ) function Lin ZHANG, SSE, 2020
Logistic regression One property of the sigmoid function σ = σ − σ ' ( ) z ( )(1 z ( )) z Can you The shape of sigmoid function verify? Lin ZHANG, SSE, 2020
Logistic regression • The hypothesis model can be written neatly as ( ) ( ) − y 1 y θ = − P y ( | ; ) x h ( ) x 1 h ( ) x θ θ h θ x ( ) θ • Our goal is to search for a value so that is large when x belongs to “1” class and small when x belongs to “0” class { } = ( x , y ) : i 1,..., m Thus, given a training set with binary labels i i , we want to maximize, m ∏ ( ) ( ) − y 1 y − h ( x ) 1 h ( x ) i i θ θ i i = i 1 Equivalent to maximize, m ∑ ( ) ( ) + − − y log h ( x ) (1 y )log 1 h ( x ) θ θ i i i i = i 1 Lin ZHANG, SSE, 2020
Logistic regression • Thus, the cost function for the logistic regression is (we want to minimize), m ∑ ( ) ( ) θ = − + − − J ( ) y log h ( x ) (1 y )log 1 h ( x ) θ θ i i i i = i 1 To solve it with gradient descent, gradient needs to be computed, m ∑ ( ) ∇ θ = − J ( ) x h ( x ) y θ θ i i i = i 1 Assignment! Lin ZHANG, SSE, 2020
Logistic regression • Exercise – Use logistic regression to perform digital classification Lin ZHANG, SSE, 2020
Outline • Linear model – Linear regression – Logistic regression – Softmax regression Lin ZHANG, SSE, 2020
Softmax regression • Softmax operation – It squashes a K -dimensional vector z of arbitrary real values σ z ( ) to a K -dimensional vector of real values in the range (0, 1). The function is given by, exp( z ) σ = j ( ) z j K ∑ exp( z ) k = k 1 σ z ( ) – Since the components of the vector sum to one and are all strictly between 0 and 1, they represent a categorical probability distribution Lin ZHANG, SSE, 2020
Softmax regression • For multiclass classification, given a test input x , we = p y ( k | x ) want our hypothesis to estimate for each value k =1,2,…, K Lin ZHANG, SSE, 2020
Softmax regression • The hypothesis should output a K -dimensional vector giving us K estimated probabilities. It takes the form, ( ) ( ) T θ exp x = φ 1 p y ( 1| ; ) x ( ) ( ) T = φ θ p y ( 2 | ; ) x exp x 1 = = 2 h ( ) x ) ( φ K ( ) ∑ T θ exp x j = φ ( ) p y ( K | ; ) x = j 1 ( ) T θ exp x K [ ] + × φ = θ θ θ ∈ ( d 1) K , ,..., R where 1 2 K Lin ZHANG, SSE, 2020
Softmax regression • In softmax regression, for each training sample we have, ( ) ( ) T θ exp x ( ) k i = φ = p y k | x ; ( ) i i K ( ) ∑ T θ exp x j i = j 1 ( ) = φ p y k | x ; At the training stage, we want to maximize i i for each training sample for the correct label k Lin ZHANG, SSE, 2020
Softmax regression • Cost function for softmax regression ( ) ( ) T θ exp x m K ∑∑ k i φ = − = J ( ) 1 { y k }log ( ) i K ( ) ∑ T θ = = i 1 k 1 exp x j i = j 1 where 1{.} is an indicator function • Gradient of the cost function m ( ) ∑ ( ) ∇ φ = − = − = φ J ( ) x 1 { y k } p y k | x ; θ i i i i k = i 1 Can you verify? Lin ZHANG, SSE, 2020
Recommend
More recommend