linear regression
play

Linear regression DS GA 1002 Probability and Statistics for Data - PowerPoint PPT Presentation

Linear regression DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda Linear models Least-squares estimation Overfitting Example: Global warming Regression


  1. Linear regression DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda

  2. Linear models Least-squares estimation Overfitting Example: Global warming

  3. Regression The aim is to learn a function h that relates ◮ a response or dependent variable y ◮ to several observed variables x 1 , x 2 , . . . , x p , known as covariates, features or independent variables The response is assumed to be of the form y = h ( � x ) + z x ∈ R p contains the features and z is noise where �

  4. Linear regression The regression function h is assumed to be linear y ( i ) = � β ∗ + z ( i ) , x ( i ) T � 1 ≤ i ≤ n β ∗ ∈ R p from the data Our aim is to estimate �

  5. Linear regression In matrix form x ( 1 ) x ( 1 ) x ( 1 )     � y ( 1 ) z ( 1 )   � � · · · � β ∗   p 1 2 1 x ( 2 ) x ( 2 ) x ( 2 ) � y ( 2 ) z ( 2 ) · · · β ∗  � � �        p 2  = 1 2  +         · · · · · · · · ·  · · · · · · · · · · · ·          y ( n ) z ( n ) x ( n ) x ( n ) x ( n ) � β ∗ � � · · · � p p 1 2 Equivalently, β ∗ + � y = X � � z

  6. Linear model for GDP State GDP (millions) Population Unemployment Rate North Dakota 52 089 757 952 2 . 4   Alabama 204 861 4 863 300 3.8   Mississippi  107 680 2 988 726 5.2      Arkansas 120 689 2 988 248 3.5     Kansas 153 258 2 907 289 3.8     Georgia 525 360 10 310 371 4.5     Iowa 178 766 3 134 693 3.2     West Virginia 73 374 1 831 102 5.1     Kentucky 197 043 4 436 974 5.2   Tennessee ??? 6 651 194 3.0

  7. Centering 3 044 121 − 1 . 7    − 127 147  1 061 227 − 2 . 8 25 625       − 813 346 1 . 1     − 71 556     − 813 825 − 5 . 8     − 58 547     � y cent = X cent = − 894 784 − 2 . 8     − 25 978     6508 298 4 . 2     470     − 667 379 − 8 . 8     − 105 862     − 1 970 971 1 . 0   17 807 634 901 1 . 1 � � av ( � y ) = 179 236 av ( X ) = 3 802 073 4 . 1

  8. Normalizing − 0 . 321 − 0 . 394 − 0 . 600     0 . 065 0 . 137 − 0 . 099         − 0 . 180 − 0 . 105 0 . 401         − 0 . 148 − 0 . 105 − 0 . 207         y norm = � − 0 . 065 X norm = − 0 . 116 − 0 . 099         0 . 872 0 . 843 0 . 151         − 0 . 001 − 0 . 086 − 0 . 314         − 0 . 267 − 0 . 255 0 . 366     0 . 045 0 . 082 0 . 401 � � std ( � y ) = 396 701 std ( X ) = 7 720 656 2 . 80

  9. Linear model for GDP β ∈ R 2 such that � Aim: find � y norm ≈ X norm � β The estimate for the GDP of Tennessee will be y Ten = av ( � � � x Ten norm , � � y ) + std ( � y ) � β x Ten where � norm is centered using av ( X ) and normalized using std ( X )

  10. Linear models Least-squares estimation Overfitting Example: Global warming

  11. Least squares For fixed � β we can evaluate the error using n � 2 2 � y ( i ) − � � � � � x ( i ) T � � y − X � β = � � β � � � � � � � 2 i = 1 The least-squares estimate � β LS minimizes this cost function � � � � � y − X � β LS := arg min � � β � � � � � � � � 2 β

  12. Least-squares fit 1.2 Data Least-squares fit 1.0 0.8 0.6 y 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 x

  13. Linear model for GDP The least-squares estimate is � 1 . 019 � � β LS = − 0 . 111 GDP roughly proportional to the population Unemployment has a negative (linear) effect

  14. Linear model for GDP State GDP Estimate North Dakota 52 089 46 241   Alabama 204 861 239 165   Mississippi  107 680 119 005      Arkansas 120 689 145 712     Kansas 153 258 136 756     Georgia 525 360 513 343     Iowa 178 766 158 097     West Virginia 73 374 59 969     Kentucky 197 043 194 829   Tennessee 328 770 345 352

  15. Geometric interpretation ◮ Any vector X � β is in the span of the columns of X ◮ The least-squares estimate is the closest vector to � y that can be represented in this way ◮ This is the projection of � y onto the column space of X

  16. Geometric interpretation

  17. Probabilistic interpretation We model the noise as an iid Gaussian random vector � Z Entries have zero mean and variance σ 2 The data are a realization of the random vector Y := X � � β + � Z Y is Gaussian with mean X � � β and covariance matrix σ 2 I

  18. Likelihood The joint pdf of � Y is n � � 2 � 1 − 1 � � � � X � Y ( � a ) := √ exp � a i − f � β 2 σ 2 2 πσ i i = 1 1 � − 1 2 � � � � � a − X � = ( 2 π ) n σ n exp � � β � � � � 2 σ 2 � � � � 2 The likelihood is � � 1 − 1 2 � � � � � � � y − X � L � = ( 2 π ) n exp � � β β � � � � y � 2 � � � 2

  19. Maximum-likelihood estimate The maximum-likelihood estimate is � � � � β ML = arg max L � β y � β � � � = arg max log L � β y � β 2 � � � � y − X � = arg min � � β � � � � � � � � 2 β = � β LS

  20. Linear models Least-squares estimation Overfitting Example: Global warming

  21. Temperature predictor A friend tells you: I found a cool way to predict the temperature in New York: It’s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it’s perfect!

  22. Overfitting If a model is very complex, it may overfit the data To evaluate a model we separate the data into a training and a test set 1. We fit the model using the training set 2. We evaluate the error on the test set

  23. Experiment z train and β ∗ are iid Gaussian with mean 0 and variance 1 X train , X test , � β ∗ + � y train = X train � � z train y test = X test � � β ∗ y train and X train to compute � We use � β LS � � � � � X train � β LS − � y train � � � � � � � 2 error train = || � y train || 2 � � � � � X test � β LS − � y test � � � � � � � 2 error test = || � y test || 2

  24. Experiment 0.5 Error (training) Error (test) Noise level (training) 0.4 Relative error (l2 norm) 0.3 0.2 0.1 0.0 50 100 200 300 400 500 n

  25. Linear models Least-squares estimation Overfitting Example: Global warming

  26. Maximum temperatures in Oxford, UK 30 25 20 Temperature (Celsius) 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000

  27. Maximum temperatures in Oxford, UK 25 20 Temperature (Celsius) 15 10 5 0 1900 1901 1902 1903 1904 1905

  28. Linear model � 2 π t � � 2 π t � y t ≈ � β 0 + � + � + � � β 1 cos β 2 sin β 3 t 12 12 1 ≤ t ≤ n is the time in months ( n = 12 · 150)

  29. Model fitted by least squares 30 25 20 Temperature (Celsius) 15 10 5 0 Data Model 1860 1880 1900 1920 1940 1960 1980 2000

  30. Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 Data Model 0 1900 1901 1902 1903 1904 1905

  31. Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 0 Data Model 5 1960 1961 1962 1963 1964 1965

  32. Trend: Increase of 0.75 ◦ C / 100 years (1.35 ◦ F) 30 25 20 Temperature (Celsius) 15 10 5 0 Data Trend 1860 1880 1900 1920 1940 1960 1980 2000

  33. Model for minimum temperatures 20 15 Temperature (Celsius) 10 5 0 5 Data Model 10 1860 1880 1900 1920 1940 1960 1980 2000

  34. Model for minimum temperatures 14 12 10 Temperature (Celsius) 8 6 4 2 0 Data Model 2 1900 1901 1902 1903 1904 1905

  35. Model for minimum temperatures 15 10 Temperature (Celsius) 5 0 5 Data Model 10 1960 1961 1962 1963 1964 1965

  36. Trend: Increase of 0.88 ◦ C / 100 years (1.58 ◦ F) 20 15 Temperature (Celsius) 10 5 0 5 Data Trend 10 1860 1880 1900 1920 1940 1960 1980 2000

Recommend


More recommend