linear classifiers and regressors
play

Linear Classifiers and Regressors Borrowed with permission from - PowerPoint PPT Presentation

Linear Classifiers and Regressors Borrowed with permission from Andrew Moore (CMU) Single-Parameter Linear Regression Linear: Slide 2 Regression vs Classification Input Prediction of Classifier Attributes categorical output Input


  1. Linear Classifiers and Regressors “Borrowed” with permission from Andrew Moore (CMU)

  2. Single-Parameter Linear Regression Linear: Slide 2

  3. Regression vs Classification Input Prediction of Classifier Attributes categorical output Input Prediction of Regressor Attributes real-valued output Input Density Probability Attributes Estimator Linear: Slide 3

  4. Linear Regression DATASET inputs outputs x 1 = 1 y 1 = 1 x 2 = 3 y 2 = 2.2 ↑ x 3 = 2 y 3 = 2 w x 4 = 1.5 y 4 = 1.9 ↓ ← 1 → x 5 = 4 y 5 = 3.1 Linear regression assumes expected value of output y given input x , E[y|x] , is linear. Simplest case: Out( x ) = w × x for some unknown w . Challenge: Given dataset, how to estimate w . Linear: Slide 4

  5. 1-parameter linear regression Assume data is formed by y i = w × x i + noise i where… • noise signals are independent • noise has normal distribution with mean 0 and unknown variance σ 2 P(y|w,x) has a normal distribution with • mean w × x • variance σ 2 Linear: Slide 5

  6. Bayesian Linear Regression P( y | w , x ) = Normal(mean w × x ; var σ 2 ) Datapoints ( x 1 , y 1 ) ( x 2 , y 2 ) … ( x n , y n ) are EVIDENCE about w . Want to infer w from data: P( w | x 1 , x 2 ,…, x n , y 1 , y 2 …, y n ) •?? use BAYES rule to work out a posterior distribution for w given the data ?? •Or Maximum Likelihood Estimation ? Linear: Slide 6

  7. Maximum likelihood estimation of w Question: “For what value of w is this data most likely to have happened?” ⇔ What value of w maximizes n = ∏ P y y y x x x w P y w x ( , ,..., | , ,..., , ) ( , ) ? n n i i 1 2 1 2 = i 1 Linear: Slide 7

  8. ⎧ ⎫ n ⎪ ⎪ ∏ = * w P y w x ⎨ ⎬ arg max ( , ) i i ⎪ ⎪ ⎩ ⎭ = i 1 ⎧ ⎫ n ⎪ ⎪ ∏ 1 − y wx i i − 2 ⎨ ⎬ = arg max exp( ( ) ) σ ⎪ 2 ⎪ ⎩ ⎭ = i 1 ⎧ ⎫ n 2 − ⎪ ⎛ ⎞ ⎪ y wx ∑ 1 i i = − ⎨ ⎬ arg max ⎜ ⎟ σ ⎝ ⎠ 2 ⎪ ⎪ ⎩ ⎭ = i 1 ⎧ ⎫ 2 n ⎪ ⎪ ∑ ( ) = − y wx ⎨ ⎬ arg min i i ⎪ ⎪ ⎩ = ⎭ i 1 Linear: Slide 8

  9. Linear Regression Maximum likelihood w minimizes E(w) = sum-of-squares of residuals E(w) w ( ) ( ) ∑ ( ) ∑ ∑ ∑ 2 Ε = − = − + w y wx y 2 x y w x 2 w 2 ( ) 2 i i i i i i i i ⇒ Need to minimize a quadratic function of w . Linear: Slide 9

  10. Linear Regression Sum-of-squares minimized when ∑ p(w) x y = i i w w ∑ 2 x i Note: Bayesian stats would The maximum likelihood provide a prob dist of w model is … and predictions would give a prob dist of expected output Out(x) = w × x Often useful to know your confidence. Can use for prediction Max likelihood also provides kind of confidence! Linear: Slide 10

  11. Multi-variate Linear Regression Linear: Slide 11

  12. Multivariate Regression What if inputs are vectors ? 3 . . 4 6 . I nput is 2-d; Output value is “height” . 5 x 2 . 8 . 10 x 1 y 1 x 1 Dataset has form y 2 x 2 y 3 x 3 .: : . y R x R Linear: Slide 12

  13. Multivariate Regression R datapoints; each input has m components … as Matrices: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ x x x y ..... ..... ... x m 11 12 1 1 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x x x y ..... ..... ... x ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = = m = 2 21 22 2 2 x y ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ M M M ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x x x y ⎣ ..... ..... ⎦ ⎣ ... ⎦ ⎣ ⎦ x R R R Rm R 1 2 I MPORTANT EXERCI SE: PROVE I T !!!!! Linear regression model assumes ∃ vector w s.t. Out( x ) = w T x = w 1 x[1] + w 2 x[2] + … + w m x[m] Max. likelihood w = ( X T X) -1 ( X T Y) Linear: Slide 13

  14. Multivariate Regression (con’t) The max. likelihood w is w = ( X T X) -1 ( X T Y) R ∑ x ki x X T X is m × m matrix: i,j’th elt = kj = k 1 R ∑ x ki y X T Y is m -element vector: i ’th elt = k = k 1 Linear: Slide 15

  15. Constant Term in Linear Regression Linear: Slide 16

  16. What about a constant term? What if linear data does not go through origin (0,0,…0) ? Statisticians and Neural Net Folks all agree on a simple obvious hack. Can you guess?? Linear: Slide 17

  17. The constant term • Trick: create fake input “ X 0 ” that always takes value 1 X 1 X 2 Y X 0 X 1 X 2 Y 2 4 16 1 2 4 16 3 4 17 1 3 4 17 5 5 20 1 5 5 20 After: Before: Y= w 0 X 0 +w 1 X 1 + w 2 X 2 Y=w 1 X 1 + w 2 X 2 = w 0 +w 1 X 1 + w 2 X 2 …is a poor model Here, you should be able to see MLE …is good model! w 0 , w 1 , w 2 by inspection Linear: Slide 18

  18. Heteroscedasticity... Linear Regression with varying noise Linear: Slide 19

  19. Regression with varying noise • Suppose you know variance of noise that was added to each datapoint. y=3 x i y i σ i2 σ =2 ½ ½ 4 y=2 σ =1/2 1 1 1 σ =1 2 1 1/4 y=1 σ =1/2 σ =2 2 3 4 y=0 3 2 1/4 x=0 x=1 x=2 x=3 What’s the MLE estimate of w? σ y N wx 2 ~ ( , ) Assume i i i Linear: Slide 20

  20. MLE estimation with varying noise σ σ σ argmax p y y y x x x 2 2 2 w log ( , ,..., | , ,..., , , ,..., , ) R R R 1 2 1 2 1 2 w Assuming i.i.d. and − R y wx 2 then plugging in ( ) ∑ = argmin i i equation for Gaussian σ 2 and simplifying. = i 1 i w ⎛ ⎞ − R x y wx ( ) Setting dLL/dw ∑ = = w i i i ⎜ ⎟ such that 0 equal to zero σ 2 ⎝ ⎠ = i 1 i ⎛ ⎞ R x y Trivial algebra ∑ i i ⎜ ⎟ σ 2 ⎝ ⎠ = i = i 1 ⎛ ⎞ R x 2 ∑ i ⎜ ⎟ σ 2 ⎝ ⎠ = i 1 i Linear: Slide 21

  21. This is Weighted Regression • How to minimize weighted sum of squares ? y=3 σ =2 − R y wx 2 ( ) ∑ argmin i i σ y=2 2 = i σ =1/2 i 1 w σ =1 y=1 σ =1/2 σ =2 y=0 x=0 x=1 x=2 x=3 1 where weight for i’th datapoint is σ 2 i Linear: Slide 22

  22. Weighted Multivariate Regression The max. likelihood w is w = (W X T WX) -1 (W X T WY) x ki x R ∑ kj (W X T WX) is an m x m matrix: i,j’th elt is σ 2 = k 1 i R x ki y (W X T WY) is an m -element vector: i ’th elt ∑ k σ 2 = k 1 i Linear: Slide 23

  23. Non-linear Regression (Digression…) Linear: Slide 24

  24. Non-linear Regression Suppose y is related to function of x in that predicted values have a non-linear dependence on w: y=3 x i y i ½ ½ y=2 1 2.5 2 3 y=1 3 2 y=0 3 3 x=0 x=1 x=2 x=3 What’s the MLE + σ estimate of w? y N w x 2 ~ ( , ) Assume i i Linear: Slide 25

  25. Non-linear MLE estimation σ = argmax p y y y x x x w log ( , ,..., | , ,..., , , ) R R 1 2 1 2 w Assuming i.i.d. and ( ) R Common (but not only) approach: ∑ 2 = − + then plugging in argmin y w x Numerical Solutions: i i equation for Gaussian = i 1 w and simplifying. • Line Search • Simulated Annealing ⎛ ⎞ − + y w x R ∑ = i i = ⎜ ⎟ w • Gradient Descent such that 0 Setting dLL/dw ⎜ ⎟ + w x ⎝ ⎠ equal to zero • Conjugate Gradient = i 1 i • Levenberg Marquart We’re down the • Newton’s Method algebraic toilet Also, special purpose statistical- So guess what optimization-specific tricks such as E.M. (See Gaussian Mixtures lecture we do? for introduction) Linear: Slide 26

  26. GRADIENT DESCENT Goal: Find a local minimum of f: ℜ→ℜ Approach: 1. Start with some value for w η ∂ ( ) ← − w w w 2. GRADIENT DESCENT: f ∂ w 3. Iterate … until bored … η = LEARNING RATE = small positive number, e.g. η = 0.05 Good default value for anything ! QUESTION: Justify the Gradient Descent Rule Linear: Slide 28

  27. Gradient Descent in “m” Dimensions ℜ m → ℜ Given f( w ) : ∂ ⎛ ⎞ ( ) ⎜ ⎟ f w ∂ w ⎜ ⎟ 1 ( ) points in direction of steepest ascent. ∇ = ⎜ ⎟ M f w ∂ ⎜ ⎟ ( ) ⎟ f w ⎜ ∂ w ⎝ ⎠ m ( ) ∇ f w is the gradient in that direction ( ) ← η ∇ GRADIENT DESCENT RULE: w w - f w Equivalently ∂ ( ) is j th ← ….where w j weight w w η - f w j j ∂ w j “just like a linear feedback system” Linear: Slide 29

  28. Linear Perceptron Linear: Slide 31

  29. Linear Perceptrons Multivariate linear models: Out( x ) = w T x “Training” ≡ minimizing sum-of-squared residuals… ( ( ) ) ∑ 2 Ε = Out x − y k k k ( ) ∑ 2 Τ = x − w y k k k by gradient descent… → perceptron training rule Linear: Slide 32

  30. Linear Perceptron Training Rule R ∂ ∂ E R ∑ ∑ = − T = − E y 2 T y 2 ( ) w x ( ) w x k k k k ∂ ∂ w w = k j 1 j = k 1 ∂ R Gradient descent: ∑ = − − T T y y 2 ( ) ( ) w x w x k k k k ∂ w to minimize E, = k 1 j update w … ∂ R ∑ = − T δ 2 w x ∂ E k k ∂ w ← = w w η k 1 j - j j ∂ …where w j = − T δ y w x k k k ∂ R m ∂ E ∑ ∑ = − δ w x So what’s 2 ? k i ki ∂ w ∂ w = = k i 1 j 1 j R ∑ = − δ k x 2 kj = k 1 Linear: Slide 33

Recommend


More recommend