6 linear logistjc regressions
play

6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves


  1. Foundatjons of Machine Learning CentraleSupélec — Fall 2017 6. Linear & logistjc regressions Chloé-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

  2. Learning objectjves ● Density estjmatjon: – Defjne parametric methods. – Defjne the maximum likelihood estjmator and compute it for Bernouilli , multjnomial and Gaussian densitjes. – Defjne the Bayes estjmator and compute it for normal priors. ● Supervised learning: – Compute the maximum likelihood estjmator / least- square fjt solutjon for linear regression. – Compute the maximum likelihood estjmator for logistjc regression. 2

  3. Density estjmatjon 3

  4. Parametric methods ● ● Parametric estjmatjon: – assume a form for p(x|θ) E.g. – Goal: estjmate θ using X – usually assume that independent and identjcally distributed (iid) 4

  5. Maximum likelihood estjmatjon ● Find θ such that X is the most likely to be drawn. ● Likelihood of θ given the i.i.d. sample X : ● Log likelihood: ● Maximum likelihood estjmatjon (MLE): 5

  6. Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : 6

  7. Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ? 7

  8. Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ● Maximize the likelihood: ? 8

  9. Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ● Maximize the likelihood: set the gradient to 0. ? 9

  10. 10

  11. Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ● Maximize the likelihood: set its gradient to 0. 11

  12. Multjnomial density ● Consider K mutually exclusive and exhaustjve classes – Each class occurs with probability p k – x 1 , x 2 , …, x K indicator variables: x k =1 if the outcome is class k and 0 otherwise ● The MLE of p k is 12

  13. Gaussian distributjon ● Gaussian distributjon = normal distributjon Compute the MLE estjmates of μ and σ. 13

  14. 14

  15. Gaussian distributjon ● Gaussian distributjon = normal distributjon Compute the MLE estjmates of μ and σ. 15

  16. Bias-variance tradeof ● Mean squared error of the estjmator: A biased estjmator may achieve betuer MSE than an unbiased one. bias E[θθ̃] θ 0 θ variance 16

  17. Bayes estjmator prior likelihood posterior evidence ● Treat θ as a random variable with prior p(θ) ● Bayes rule: ● Density estjmatjon at x: 17

  18. Bayes estjmator ● Treat θ as a random variable with prior p(θ) ● Bayes rule: ● Density estjmatjon ● Maximum likelihood estjmate (MLE): ● Bayes estjmate: ? 18

  19. 19

  20. Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: Compute the Bayes estjmator of θ Hint: Compute p(θ| X ) and show that it follows a normal distributjon 20

  21. 21

  22. 22

  23. 23

  24. Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: Compute the Bayes estjmator of θ p(θ| X ) follows a normal distributjon with – mean – variance 24

  25. Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: Compute the Bayes estjmator of θ p(θ| X ) follows a normal distributjon with – mean – variance 25

  26. Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: ● Bayes estjmator: prior mean sample mean 26

  27. Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: ● Bayes estjmator: prior mean sample mean ? large when n is ? large when σ is 27

  28. Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: ● Bayes estjmator: ● When n ↗: θ Bayes gets closer to the sample average (uses informatjon from the sample). ● When σ is small, θ Bayes gets closer to μ (litule uncertainty about the prior). 28

  29. Linear regression 29

  30. Linear regression 30

  31. Linear regression: MLE ● Assume error is Gaussian distributed ● Replace g with its estjmator f E[y|x] = βx+β 0 E[y|x*] p(y|x*) x* x 31

  32. MLE under Gaussian noise ● Maximize (log) likelihood independent of β 32

  33. MLE under Gaussian noise ● Maximize (log) likelihood independent of β ? 33

  34. MLE under Gaussian noise ● Maximize (log) likelihood independent of β ● Assuming Gaussian error, maximizing the likelihood is equivalent to minimizing the sum of squared residuals. 34

  35. Linear regression least-squares fjt ● Minimize the residual sum of squares 35

  36. Linear regression least-squares fjt ● Minimize the residual sum of squares Historically: – Carl Friedrich Gauss (to predict the locatjon of Ceres) – Adrien Marie Legendre 36

  37. Linear regression least-squares fjt ● Minimize the residual sum of squares Estjmate β. What conditjon do you need to verify? 37

  38. Linear regression least-squares fjt ● Minimize the residual sum of squares ● Assuming X has full column rank (and hence X T X invertjble): 38

  39. Linear regression least-squares fjt ● Minimize the residual sum of squares ● Assuming X has full column rank (and hence X T X invertjble): ● If X is rank-defjcient, use a pseudo-inverse. A pseudo-inverse of A is a matrix G s. t. AGA = A 39

  40. Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. 40

  41. Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β 41

  42. Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β 42

  43. Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β 43

  44. Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β psd and minimal for D=0 44

  45. 45

  46. (true for all β) 46

  47. 47

  48. 48

  49. Correlated variables ● If the variables are decorrelated : – Each coeffjcient can be estjmated separately; – Interpretatjon is easy: “A change of 1 in x j is associated with a change of β j in Y, while everything else stays the same.” ● Correlatjons between variables cause problems: – The variance of all coeffjcients tend to increase; – Interpretatjon is much harder when x j changes, so does everything else. 49

  50. Logistjc regression 50

  51. What about classifjcatjon? 51

  52. What about classifjcatjon? ? ● Model P(Y=1|x) as a linear functjon? 52

  53. What about classifjcatjon? ● Model P(Y=1|x) as a linear functjon? – Problem: P(Y=1| x ) must be between 0 and 1. – Non-linearity: ● If P(Y=1| x ) close to +1 or 0, x must change a lot for y to change; ● If P(Y=1| x ) close to 0.5, that's not the case. – Hence: use a logit transformatjon p f( x ) → Logistjc regression. 53

  54. Maximum likelihood estjmatjon of logistjc regression coeffjcients ● Log likelihood for n observatjons ? 54

  55. 55

  56. Maximum likelihood estjmatjon of logistjc regression coeffjcients ● Log likelihood for n observatjons 56

  57. Maximum likelihood estjmatjon of logistjc regression coeffjcients ? ● Gradient of the log likelihood 57

  58. 58

  59. Maximum likelihood estjmatjon of logistjc regression coeffjcients ● Gradient of the log likelihood ● To maximize the likelihood: – set the gradient to 0 – cannot be solved analytjcally – -L convex so we can use gradient descent (no local minima) 59

  60. 60

  61. Summary ● MAP estjmate: ● MLE: ● Bayes estjmate: ● Assuming Gaussian error, maximizing the likelihood is equivalent to minimizing the RSS. ● Linear regression MLE: ● Logistjc regression MLE: solve with gradient descent. 61

  62. References ● A Course in Machine Learning. http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf – Least-squares regression: Chap 7.6 ● The Elements of Statjstjcal Learning. http://web.stanford.edu/~hastie/ElemStatLearn/ – Least-squares regression: Chap 2.2.1, 3.1, 3.2.1 – Gauss-Markov theorem: Chap 3.2.3 62

  63. class GradientDescentOptjmizer(): 63

  64. class LeastSquaresRegr() 64

  65. class seq_LeastSquaresRegr() 65

  66. class seq_LeastSquaresRegr() 66

  67. class seq_LeastSquaresRegr() 67

  68. class LogistjcRegr() 68

  69. class LogistjcRegr() 69

Recommend


More recommend