6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSupélec — Fall 2017 6. Linear & logistjc regressions Chloé-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

Learning objectjves ● Density estjmatjon: – Defjne parametric methods. – Defjne the maximum likelihood estjmator and compute it for Bernouilli , multjnomial and Gaussian densitjes. – Defjne the Bayes estjmator and compute it for normal priors. ● Supervised learning: – Compute the maximum likelihood estjmator / least- square fjt solutjon for linear regression. – Compute the maximum likelihood estjmator for logistjc regression. 2

Density estjmatjon 3

Parametric methods ● ● Parametric estjmatjon: – assume a form for p(x|θ) E.g. – Goal: estjmate θ using X – usually assume that independent and identjcally distributed (iid) 4

Maximum likelihood estjmatjon ● Find θ such that X is the most likely to be drawn. ● Likelihood of θ given the i.i.d. sample X : ● Log likelihood: ● Maximum likelihood estjmatjon (MLE): 5

Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : 6

Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ? 7

Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ● Maximize the likelihood: ? 8

Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ● Maximize the likelihood: set the gradient to 0. ? 9

Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ● Maximize the likelihood: set its gradient to 0. 11

Multjnomial density ● Consider K mutually exclusive and exhaustjve classes – Each class occurs with probability p k – x 1 , x 2 , …, x K indicator variables: x k =1 if the outcome is class k and 0 otherwise ● The MLE of p k is 12

Gaussian distributjon ● Gaussian distributjon = normal distributjon Compute the MLE estjmates of μ and σ. 13

Gaussian distributjon ● Gaussian distributjon = normal distributjon Compute the MLE estjmates of μ and σ. 15

Bias-variance tradeof ● Mean squared error of the estjmator: A biased estjmator may achieve betuer MSE than an unbiased one. bias E[θθ̃] θ 0 θ variance 16

Bayes estjmator prior likelihood posterior evidence ● Treat θ as a random variable with prior p(θ) ● Bayes rule: ● Density estjmatjon at x: 17

Bayes estjmator ● Treat θ as a random variable with prior p(θ) ● Bayes rule: ● Density estjmatjon ● Maximum likelihood estjmate (MLE): ● Bayes estjmate: ? 18

Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: Compute the Bayes estjmator of θ Hint: Compute p(θ| X ) and show that it follows a normal distributjon 20

Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: Compute the Bayes estjmator of θ p(θ| X ) follows a normal distributjon with – mean – variance 24

Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: Compute the Bayes estjmator of θ p(θ| X ) follows a normal distributjon with – mean – variance 25

Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: ● Bayes estjmator: prior mean sample mean 26

Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: ● Bayes estjmator: prior mean sample mean ? large when n is ? large when σ is 27

Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: ● Bayes estjmator: ● When n ↗: θ Bayes gets closer to the sample average (uses informatjon from the sample). ● When σ is small, θ Bayes gets closer to μ (litule uncertainty about the prior). 28

Linear regression 29

Linear regression 30

Linear regression: MLE ● Assume error is Gaussian distributed ● Replace g with its estjmator f E[y|x] = βx+β 0 E[y|x*] p(y|x*) x* x 31

MLE under Gaussian noise ● Maximize (log) likelihood independent of β 32

MLE under Gaussian noise ● Maximize (log) likelihood independent of β ? 33

MLE under Gaussian noise ● Maximize (log) likelihood independent of β ● Assuming Gaussian error, maximizing the likelihood is equivalent to minimizing the sum of squared residuals. 34

Linear regression least-squares fjt ● Minimize the residual sum of squares 35

Linear regression least-squares fjt ● Minimize the residual sum of squares Historically: – Carl Friedrich Gauss (to predict the locatjon of Ceres) – Adrien Marie Legendre 36

Linear regression least-squares fjt ● Minimize the residual sum of squares Estjmate β. What conditjon do you need to verify? 37

Linear regression least-squares fjt ● Minimize the residual sum of squares ● Assuming X has full column rank (and hence X T X invertjble): 38

Linear regression least-squares fjt ● Minimize the residual sum of squares ● Assuming X has full column rank (and hence X T X invertjble): ● If X is rank-defjcient, use a pseudo-inverse. A pseudo-inverse of A is a matrix G s. t. AGA = A 39

Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. 40

Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β 41

Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β psd and minimal for D=0 44

(true for all β) 46

Correlated variables ● If the variables are decorrelated : – Each coeffjcient can be estjmated separately; – Interpretatjon is easy: “A change of 1 in x j is associated with a change of β j in Y, while everything else stays the same.” ● Correlatjons between variables cause problems: – The variance of all coeffjcients tend to increase; – Interpretatjon is much harder when x j changes, so does everything else. 49

Logistjc regression 50

What about classifjcatjon? 51

What about classifjcatjon? ? ● Model P(Y=1|x) as a linear functjon? 52

What about classifjcatjon? ● Model P(Y=1|x) as a linear functjon? – Problem: P(Y=1| x ) must be between 0 and 1. – Non-linearity: ● If P(Y=1| x ) close to +1 or 0, x must change a lot for y to change; ● If P(Y=1| x ) close to 0.5, that's not the case. – Hence: use a logit transformatjon p f( x ) → Logistjc regression. 53

Maximum likelihood estjmatjon of logistjc regression coeffjcients ● Log likelihood for n observatjons ? 54

Maximum likelihood estjmatjon of logistjc regression coeffjcients ● Log likelihood for n observatjons 56

Maximum likelihood estjmatjon of logistjc regression coeffjcients ? ● Gradient of the log likelihood 57

Maximum likelihood estjmatjon of logistjc regression coeffjcients ● Gradient of the log likelihood ● To maximize the likelihood: – set the gradient to 0 – cannot be solved analytjcally – -L convex so we can use gradient descent (no local minima) 59

Summary ● MAP estjmate: ● MLE: ● Bayes estjmate: ● Assuming Gaussian error, maximizing the likelihood is equivalent to minimizing the RSS. ● Linear regression MLE: ● Logistjc regression MLE: solve with gradient descent. 61

References ● A Course in Machine Learning. http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf – Least-squares regression: Chap 7.6 ● The Elements of Statjstjcal Learning. http://web.stanford.edu/~hastie/ElemStatLearn/ – Least-squares regression: Chap 2.2.1, 3.1, 3.2.1 – Gauss-Markov theorem: Chap 3.2.3 62

class GradientDescentOptjmizer(): 63

class LeastSquaresRegr() 64

class seq_LeastSquaresRegr() 65

class LogistjcRegr() 68

class LogistjcRegr() 69

6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves

Session 03 Classical Linear Models Regression with factor variables Separate quadratic

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

On the Causal Interpretation of Race in Regressions Adjusting for Confounding and Mediating

After Fitting Regressions Paul E. Johnson 1 2 1 Department of Political Science 2 Center for

Spectral Experts for Estimating Mixtures of Linear Regressions Arun Tejasvi Chaganty Percy Liang

Modelling with R A Solution to Two Non-linear Regressions Examples 1. The data set in the file

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Linear Programming Linear Programming In a linear programming problem, there is a set of

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Fighting regressions with git bisect Christian Couder chriscool@tuxfamily.org October 29,2009

Outline Motivation Sparsity in static regressions Ridge and lasso regressions Spike and slab

APOLLO AUTOMATIC DETECTION AND DIAGNOSIS OF PERFORMANCE REGRESSIONS IN DATABASE SYSTEMS Jinho

SparkFuzz : Searching Correctness Regressions in Modern Query Engines Bogdan Ghit, Nicolas Poggi

EC3062 ECONOMETRICS DYNAMIC REGRESSIONS MODELS Autoregressive Disturbance Processes Economic

Regressions: what, why and their extermination Michael Meeks General Manager at Collabora

TUTORIAL TUTORIAL Matthieu R Bloch Tuesday, March 24, 2020 1 MLE FOR UNIFORM DISTRIBUTIONS

Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

About this class Point Estimators The next two lectures are really coming from Lets say we

Maximum Likelihood Fits on GPUs S. Jarp, A. Lazzaro, J. Leduc, A. Nowak, F. Pantaleo CERN

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 7: Maximum likelihood

Maximum Likelihood Estimation of Factored Regular Deterministic Stochastic Languages Chihiro

Applied Machine Learning Maximum Likelihood and Bayesian Reasoning Siamak Ravanbakhsh COMP 551

6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves

Session 03 Classical Linear Models Regression with factor variables Separate quadratic

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

On the Causal Interpretation of Race in Regressions Adjusting for Confounding and Mediating

After Fitting Regressions Paul E. Johnson 1 2 1 Department of Political Science 2 Center for

Spectral Experts for Estimating Mixtures of Linear Regressions Arun Tejasvi Chaganty Percy Liang

Modelling with R A Solution to Two Non-linear Regressions Examples 1. The data set in the file

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Linear Programming Linear Programming In a linear programming problem, there is a set of

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Fighting regressions with git bisect Christian Couder chriscool@tuxfamily.org October 29,2009

Outline Motivation Sparsity in static regressions Ridge and lasso regressions Spike and slab

APOLLO AUTOMATIC DETECTION AND DIAGNOSIS OF PERFORMANCE REGRESSIONS IN DATABASE SYSTEMS Jinho

SparkFuzz : Searching Correctness Regressions in Modern Query Engines Bogdan Ghit, Nicolas Poggi

EC3062 ECONOMETRICS DYNAMIC REGRESSIONS MODELS Autoregressive Disturbance Processes Economic

Regressions: what, why and their extermination Michael Meeks General Manager at Collabora

TUTORIAL TUTORIAL Matthieu R Bloch Tuesday, March 24, 2020 1 MLE FOR UNIFORM DISTRIBUTIONS

Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

About this class Point Estimators The next two lectures are really coming from Lets say we

Maximum Likelihood Fits on GPUs S. Jarp, A. Lazzaro, J. Leduc, A. Nowak, F. Pantaleo CERN

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 7: Maximum likelihood

Maximum Likelihood Estimation of Factored Regular Deterministic Stochastic Languages Chihiro

Applied Machine Learning Maximum Likelihood and Bayesian Reasoning Siamak Ravanbakhsh COMP 551

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE