Lecture 5 Linear Models Lin ZHANG, PhD School of Software - PowerPoint PPT Presentation

Lecture 5 Linear Models Lin ZHANG, PhD School of Software Engineering Tongji University Fall 2020 Lin ZHANG, SSE, 2020

Outline • Linear model – Linear regression – Logistic regression – Softmax regression Lin ZHANG, SSE, 2020

Linear regression • Our goal in linear regression is to predict a target ∈ d x R continuous value y from a vector of input values ; we use a linear function h as the model • At the training stage, we aim to find h ( x ) so that we ≈ h ( x ) y ( x , y ) have for each training sample i i i i • We suppose that h is a linear function, so = θ + θ ∈ R × T d 1 h ( , ) ( ) x x b , θ b θ     x Rewrite it, θ = = ' ' , x     b 1     θ θ ≡ T ' T ' ' x +b= x h ( x ) ' θ ( ) ( ) + × + × θ θ ∈ d 1 1 ∈ d 1 1 T h ( ) x = x , R , x R Later, we simply use θ Lin ZHANG, SSE, 2020

Linear regression h ( x ) θ • Then, our task is to find a choice of so that is θ i y as close as possible to i The cost function can be written as, 1 m ( ) ∑ 2 θ = θ − T J ( ) x y i i 2 = i 1 Then, the task at the training stage is to find 1 m ( ) ∑ 2 θ = θ − * T arg min 2 x y i i θ = i 1 For this special case, it has a closed-form optimal solution Here we use a more general method, gradient descent method Lin ZHANG, SSE, 2020

Linear regression • Gradient descent – It is a first-order optimization algorithm – To find a local minimum of a function, one takes steps proportional to the negative of the gradient of the function at the current point θ J θ ( ) – One starts with a guess for a local minimum of 0 and considers the sequence such that θ = θ − ∇ θ α θ : J ( ) + θ θ = n 1 n | n α where is called as learning rate Lin ZHANG, SSE, 2020

Linear regression • Gradient descent Lin ZHANG, SSE, 2020

Linear regression • Gradient descent J θ ( ) Repeat until convergence ( will not reduce anymore) { θ = θ − ∇ θ α θ : J ( ) + θ θ = n 1 n | n } GD is a general optimization solution; for a specific problem, the key step is how to compute gradient Lin ZHANG, SSE, 2020

Linear regression • Gradient of the cost function of linear regression 1 m ( ) ∑ 2 θ = θ x − T J ( ) y i i 2 = i 1 The gradient is, ∂ θ  J ( )    ∂ θ   1 ∂ θ   J ( ) ∂ θ J ( ) m   ∑ ( ) = − ∂ θ ∇ θ =  h ( x ) y x where, J ( )  θ i i ij ∂ θ θ 2 = i 1 j      ∂ θ J ( )     ∂ θ +   d 1 Lin ZHANG, SSE, 2020

Linear regression • Some variants of gradient descent – The ordinary gradient descent algorithm looks at every sample in the entire training set on every step; it is also called as batch gradient descent – Stochastic gradient descent (SGD) repeatedly run through the training set, and each time when we encounter a training sample, we update the parameters according to the gradient of the error w.r.t that single training sample only Repeat until convergence { for i = 1 to m ( m is the number of training samples) { ( ) θ + = θ − α θ x − T 1 : y x n n n i i i } } Lin ZHANG, SSE, 2020

Linear regression • Some variants of gradient descent – The ordinary gradient descent algorithm looks at every sample in the entire training set on every step; it is also called as batch gradient descent – Stochastic gradient descent (SGD) repeatedly run through the training set, and each time when we encounter a training sample, we update the parameters according to the gradient of the error w.r.t that single training sample only – Minibatch SGD : it works identically to SGD, except that it uses more than one training samples to make each estimate of the gradient Lin ZHANG, SSE, 2020

Linear regression • More concepts – m Training samples can be divided into N minibatches – When the training sweeps all the batches, we say we complete one epoch of training process; for a typical training process, several epochs are usually required epochs = 10; numMiniBatches = N ; while epochIndex< epochs && not convergent { for minibatchIndex = 1 to numMiniBatches { update the model parameters based on this minibatch } } Lin ZHANG, SSE, 2020

Logistic regression • Logistic regression is used for binary classification θ x T • It squeezes the linear regression into the range (0, 1) ; thus the prediction result can be interpreted as probability • At the testing stage The probability that the testing sample x is positive is 1 = + represented as h ( ) x θ − θ T 1 exp( x ) The probability that the testing sample x is negative is represented as 1- h ( ) x θ 1 σ = ( ) z Function is called as sigmoid or logistic + − 1 exp( z ) function Lin ZHANG, SSE, 2020

Logistic regression One property of the sigmoid function σ = σ − σ ' ( ) z ( )(1 z ( )) z Can you The shape of sigmoid function verify? Lin ZHANG, SSE, 2020

Logistic regression • The hypothesis model can be written neatly as ( ) ( ) − y 1 y θ = − P y ( | ; ) x h ( ) x 1 h ( ) x θ θ h θ x ( ) θ • Our goal is to search for a value so that is large when x belongs to “1” class and small when x belongs to “0” class { } = ( x , y ) : i 1,..., m Thus, given a training set with binary labels i i , we want to maximize, m ∏ ( ) ( ) − y 1 y − h ( x ) 1 h ( x ) i i θ θ i i = i 1 Equivalent to maximize, m ∑ ( ) ( ) + − − y log h ( x ) (1 y )log 1 h ( x ) θ θ i i i i = i 1 Lin ZHANG, SSE, 2020

Logistic regression • Thus, the cost function for the logistic regression is (we want to minimize), m ∑ ( ) ( ) θ = − + − − J ( ) y log h ( x ) (1 y )log 1 h ( x ) θ θ i i i i = i 1 To solve it with gradient descent, gradient needs to be computed, m ∑ ( ) ∇ θ = − J ( ) x h ( x ) y θ θ i i i = i 1 Assignment! Lin ZHANG, SSE, 2020

Logistic regression • Exercise – Use logistic regression to perform digital classification Lin ZHANG, SSE, 2020

Softmax regression • Softmax operation – It squashes a K -dimensional vector z of arbitrary real values σ z ( ) to a K -dimensional vector of real values in the range (0, 1). The function is given by, exp( z ) σ = j ( ) z j K ∑ exp( z ) k = k 1 σ z ( ) – Since the components of the vector sum to one and are all strictly between 0 and 1, they represent a categorical probability distribution Lin ZHANG, SSE, 2020

Softmax regression • For multiclass classification, given a test input x , we = p y ( k | x ) want our hypothesis to estimate for each value k =1,2,…, K Lin ZHANG, SSE, 2020

Softmax regression • The hypothesis should output a K -dimensional vector giving us K estimated probabilities. It takes the form, ( )   ( ) T θ exp x   = φ 1   p y ( 1| ; ) x ( )     ( ) T = φ θ p y ( 2 | ; ) x exp x 1     = = 2 h ( ) x ) (   φ   K  ( ) ∑ T  θ exp x     j = φ ( ) p y ( K | ; ) x     = j 1 ( ) T θ exp x     K [ ] + × φ = θ θ θ ∈ ( d 1) K , ,..., R where 1 2 K Lin ZHANG, SSE, 2020

Softmax regression • In softmax regression, for each training sample we have, ( ) ( ) T θ exp x ( ) k i = φ = p y k | x ; ( ) i i K ( ) ∑ T θ exp x j i = j 1 ( ) = φ p y k | x ; At the training stage, we want to maximize i i for each training sample for the correct label k Lin ZHANG, SSE, 2020

Softmax regression • Cost function for softmax regression ( ) ( ) T θ exp x m K ∑∑ k i φ = − = J ( ) 1 { y k }log ( ) i K ( ) ∑ T θ = = i 1 k 1 exp x j i = j 1 where 1{.} is an indicator function • Gradient of the cost function m ( ) ∑ ( )   ∇ φ = − = − = φ J ( ) x 1 { y k } p y k | x ;   θ i i i i k = i 1 Can you verify? Lin ZHANG, SSE, 2020

Lecture 5 Linear Models Lin ZHANG, PhD School of Software - PowerPoint PPT Presentation

Lecture 5 Linear Models Lin ZHANG, PhD School of Software Engineering Tongji University Fall 2020 Lin ZHANG, SSE, 2020 Outline Linear model Linear regression Logistic regression Softmax regression Lin ZHANG, SSE, 2020

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

Workshop 2 Building from Linear Models to Generalised Linear Models Part 1: understanding LMs 2

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

ECON 950 Winter 2020 Prof. James MacKinnon 9. Going Beyond Linear Models Linear regression,

Outline Statistical inference for linear mixed models general form of linear mixed models

Lecture 2: Linear Programming and Duality Lecture Outline Part I: Linear Programming and

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

A Stack-based Algorithm for Neural Lattice Rescoring Gaurav Kumar Center for Language and Speech

Workshop 10.4: Generalized linear models Murray Logan August 16, 2016 Table of contents 1

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

The Log-Linear Model The flu example from last class is actually one of our most common

A Brief History of Lognormal and Power Law Distributions and an Application to File Size

Approximations of the Laplace Transform of a Lognormal Random Variable Leonardo Rojas Nandayapa

Lognormals and friends Lognormals Empirical Confusability Principles of Complex Systems Random

Computational Bayesian data analysis Bruno Nicenboim / Shravan Vasishth 2020-03-11 1 Bayesian

Lecture 5 Linear Models Lin ZHANG, PhD School of Software - PowerPoint PPT Presentation

Lecture 5 Linear Models Lin ZHANG, PhD School of Software Engineering Tongji University Fall 2020 Lin ZHANG, SSE, 2020 Outline Linear model Linear regression Logistic regression Softmax regression Lin ZHANG, SSE, 2020

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

Workshop 2 Building from Linear Models to Generalised Linear Models Part 1: understanding LMs 2

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

ECON 950 Winter 2020 Prof. James MacKinnon 9. Going Beyond Linear Models Linear regression,

Outline Statistical inference for linear mixed models general form of linear mixed models

Lecture 2: Linear Programming and Duality Lecture Outline Part I: Linear Programming and

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

A Stack-based Algorithm for Neural Lattice Rescoring Gaurav Kumar Center for Language and Speech

Workshop 10.4: Generalized linear models Murray Logan August 16, 2016 Table of contents 1

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

The Log-Linear Model The flu example from last class is actually one of our most common

A Brief History of Lognormal and Power Law Distributions and an Application to File Size

Approximations of the Laplace Transform of a Lognormal Random Variable Leonardo Rojas Nandayapa

Lognormals and friends Lognormals Empirical Confusability Principles of Complex Systems Random

Computational Bayesian data analysis Bruno Nicenboim / Shravan Vasishth 2020-03-11 1 Bayesian

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE