Foundatjons of Machine Learning CentraleSupélec — Fall 2017 6. Linear & logistjc regressions Chloé-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr
Learning objectjves ● Density estjmatjon: – Defjne parametric methods. – Defjne the maximum likelihood estjmator and compute it for Bernouilli , multjnomial and Gaussian densitjes. – Defjne the Bayes estjmator and compute it for normal priors. ● Supervised learning: – Compute the maximum likelihood estjmator / least- square fjt solutjon for linear regression. – Compute the maximum likelihood estjmator for logistjc regression. 2
Density estjmatjon 3
Parametric methods ● ● Parametric estjmatjon: – assume a form for p(x|θ) E.g. – Goal: estjmate θ using X – usually assume that independent and identjcally distributed (iid) 4
Maximum likelihood estjmatjon ● Find θ such that X is the most likely to be drawn. ● Likelihood of θ given the i.i.d. sample X : ● Log likelihood: ● Maximum likelihood estjmatjon (MLE): 5
Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : 6
Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ? 7
Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ● Maximize the likelihood: ? 8
Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ● Maximize the likelihood: set the gradient to 0. ? 9
10
Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ● Maximize the likelihood: set its gradient to 0. 11
Multjnomial density ● Consider K mutually exclusive and exhaustjve classes – Each class occurs with probability p k – x 1 , x 2 , …, x K indicator variables: x k =1 if the outcome is class k and 0 otherwise ● The MLE of p k is 12
Gaussian distributjon ● Gaussian distributjon = normal distributjon Compute the MLE estjmates of μ and σ. 13
14
Gaussian distributjon ● Gaussian distributjon = normal distributjon Compute the MLE estjmates of μ and σ. 15
Bias-variance tradeof ● Mean squared error of the estjmator: A biased estjmator may achieve betuer MSE than an unbiased one. bias E[θθ̃] θ 0 θ variance 16
Bayes estjmator prior likelihood posterior evidence ● Treat θ as a random variable with prior p(θ) ● Bayes rule: ● Density estjmatjon at x: 17
Bayes estjmator ● Treat θ as a random variable with prior p(θ) ● Bayes rule: ● Density estjmatjon ● Maximum likelihood estjmate (MLE): ● Bayes estjmate: ? 18
19
Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: Compute the Bayes estjmator of θ Hint: Compute p(θ| X ) and show that it follows a normal distributjon 20
21
22
23
Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: Compute the Bayes estjmator of θ p(θ| X ) follows a normal distributjon with – mean – variance 24
Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: Compute the Bayes estjmator of θ p(θ| X ) follows a normal distributjon with – mean – variance 25
Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: ● Bayes estjmator: prior mean sample mean 26
Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: ● Bayes estjmator: prior mean sample mean ? large when n is ? large when σ is 27
Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: ● Bayes estjmator: ● When n ↗: θ Bayes gets closer to the sample average (uses informatjon from the sample). ● When σ is small, θ Bayes gets closer to μ (litule uncertainty about the prior). 28
Linear regression 29
Linear regression 30
Linear regression: MLE ● Assume error is Gaussian distributed ● Replace g with its estjmator f E[y|x] = βx+β 0 E[y|x*] p(y|x*) x* x 31
MLE under Gaussian noise ● Maximize (log) likelihood independent of β 32
MLE under Gaussian noise ● Maximize (log) likelihood independent of β ? 33
MLE under Gaussian noise ● Maximize (log) likelihood independent of β ● Assuming Gaussian error, maximizing the likelihood is equivalent to minimizing the sum of squared residuals. 34
Linear regression least-squares fjt ● Minimize the residual sum of squares 35
Linear regression least-squares fjt ● Minimize the residual sum of squares Historically: – Carl Friedrich Gauss (to predict the locatjon of Ceres) – Adrien Marie Legendre 36
Linear regression least-squares fjt ● Minimize the residual sum of squares Estjmate β. What conditjon do you need to verify? 37
Linear regression least-squares fjt ● Minimize the residual sum of squares ● Assuming X has full column rank (and hence X T X invertjble): 38
Linear regression least-squares fjt ● Minimize the residual sum of squares ● Assuming X has full column rank (and hence X T X invertjble): ● If X is rank-defjcient, use a pseudo-inverse. A pseudo-inverse of A is a matrix G s. t. AGA = A 39
Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. 40
Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β 41
Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β 42
Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β 43
Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β psd and minimal for D=0 44
45
(true for all β) 46
47
48
Correlated variables ● If the variables are decorrelated : – Each coeffjcient can be estjmated separately; – Interpretatjon is easy: “A change of 1 in x j is associated with a change of β j in Y, while everything else stays the same.” ● Correlatjons between variables cause problems: – The variance of all coeffjcients tend to increase; – Interpretatjon is much harder when x j changes, so does everything else. 49
Logistjc regression 50
What about classifjcatjon? 51
What about classifjcatjon? ? ● Model P(Y=1|x) as a linear functjon? 52
What about classifjcatjon? ● Model P(Y=1|x) as a linear functjon? – Problem: P(Y=1| x ) must be between 0 and 1. – Non-linearity: ● If P(Y=1| x ) close to +1 or 0, x must change a lot for y to change; ● If P(Y=1| x ) close to 0.5, that's not the case. – Hence: use a logit transformatjon p f( x ) → Logistjc regression. 53
Maximum likelihood estjmatjon of logistjc regression coeffjcients ● Log likelihood for n observatjons ? 54
55
Maximum likelihood estjmatjon of logistjc regression coeffjcients ● Log likelihood for n observatjons 56
Maximum likelihood estjmatjon of logistjc regression coeffjcients ? ● Gradient of the log likelihood 57
58
Maximum likelihood estjmatjon of logistjc regression coeffjcients ● Gradient of the log likelihood ● To maximize the likelihood: – set the gradient to 0 – cannot be solved analytjcally – -L convex so we can use gradient descent (no local minima) 59
60
Summary ● MAP estjmate: ● MLE: ● Bayes estjmate: ● Assuming Gaussian error, maximizing the likelihood is equivalent to minimizing the RSS. ● Linear regression MLE: ● Logistjc regression MLE: solve with gradient descent. 61
References ● A Course in Machine Learning. http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf – Least-squares regression: Chap 7.6 ● The Elements of Statjstjcal Learning. http://web.stanford.edu/~hastie/ElemStatLearn/ – Least-squares regression: Chap 2.2.1, 3.1, 3.2.1 – Gauss-Markov theorem: Chap 3.2.3 62
class GradientDescentOptjmizer(): 63
class LeastSquaresRegr() 64
class seq_LeastSquaresRegr() 65
class seq_LeastSquaresRegr() 66
class seq_LeastSquaresRegr() 67
class LogistjcRegr() 68
class LogistjcRegr() 69
Recommend
More recommend