Linear regression DS GA 1002 Probability and Statistics for Data - PowerPoint PPT Presentation

Linear regression DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda

Linear models Least-squares estimation Overfitting Example: Global warming

Regression The aim is to learn a function h that relates ◮ a response or dependent variable y ◮ to several observed variables x 1 , x 2 , . . . , x p , known as covariates, features or independent variables The response is assumed to be of the form y = h ( � x ) + z x ∈ R p contains the features and z is noise where �

Linear regression The regression function h is assumed to be linear y ( i ) = � β ∗ + z ( i ) , x ( i ) T � 1 ≤ i ≤ n β ∗ ∈ R p from the data Our aim is to estimate �

Linear regression In matrix form x ( 1 ) x ( 1 ) x ( 1 )     � y ( 1 ) z ( 1 )   � � · · · � β ∗   p 1 2 1 x ( 2 ) x ( 2 ) x ( 2 ) � y ( 2 ) z ( 2 ) · · · β ∗  � � �        p 2  = 1 2  +         · · · · · · · · ·  · · · · · · · · · · · ·          y ( n ) z ( n ) x ( n ) x ( n ) x ( n ) � β ∗ � � · · · � p p 1 2 Equivalently, β ∗ + � y = X � � z

Linear model for GDP State GDP (millions) Population Unemployment Rate North Dakota 52 089 757 952 2 . 4   Alabama 204 861 4 863 300 3.8   Mississippi  107 680 2 988 726 5.2      Arkansas 120 689 2 988 248 3.5     Kansas 153 258 2 907 289 3.8     Georgia 525 360 10 310 371 4.5     Iowa 178 766 3 134 693 3.2     West Virginia 73 374 1 831 102 5.1     Kentucky 197 043 4 436 974 5.2   Tennessee ??? 6 651 194 3.0

Centering 3 044 121 − 1 . 7    − 127 147  1 061 227 − 2 . 8 25 625       − 813 346 1 . 1     − 71 556     − 813 825 − 5 . 8     − 58 547     � y cent = X cent = − 894 784 − 2 . 8     − 25 978     6508 298 4 . 2     470     − 667 379 − 8 . 8     − 105 862     − 1 970 971 1 . 0   17 807 634 901 1 . 1 � � av ( � y ) = 179 236 av ( X ) = 3 802 073 4 . 1

Normalizing − 0 . 321 − 0 . 394 − 0 . 600     0 . 065 0 . 137 − 0 . 099         − 0 . 180 − 0 . 105 0 . 401         − 0 . 148 − 0 . 105 − 0 . 207         y norm = � − 0 . 065 X norm = − 0 . 116 − 0 . 099         0 . 872 0 . 843 0 . 151         − 0 . 001 − 0 . 086 − 0 . 314         − 0 . 267 − 0 . 255 0 . 366     0 . 045 0 . 082 0 . 401 � � std ( � y ) = 396 701 std ( X ) = 7 720 656 2 . 80

Linear model for GDP β ∈ R 2 such that � Aim: find � y norm ≈ X norm � β The estimate for the GDP of Tennessee will be y Ten = av ( � � � x Ten norm , � � y ) + std ( � y ) � β x Ten where � norm is centered using av ( X ) and normalized using std ( X )

Least squares For fixed � β we can evaluate the error using n � 2 2 � y ( i ) − � � � � � x ( i ) T � � y − X � β = � � β � � � � � � � 2 i = 1 The least-squares estimate � β LS minimizes this cost function � � � � � y − X � β LS := arg min � � β � � � � � � � � 2 β

Least-squares fit 1.2 Data Least-squares fit 1.0 0.8 0.6 y 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 x

Linear model for GDP The least-squares estimate is � 1 . 019 � � β LS = − 0 . 111 GDP roughly proportional to the population Unemployment has a negative (linear) effect

Linear model for GDP State GDP Estimate North Dakota 52 089 46 241   Alabama 204 861 239 165   Mississippi  107 680 119 005      Arkansas 120 689 145 712     Kansas 153 258 136 756     Georgia 525 360 513 343     Iowa 178 766 158 097     West Virginia 73 374 59 969     Kentucky 197 043 194 829   Tennessee 328 770 345 352

Geometric interpretation ◮ Any vector X � β is in the span of the columns of X ◮ The least-squares estimate is the closest vector to � y that can be represented in this way ◮ This is the projection of � y onto the column space of X

Geometric interpretation

Probabilistic interpretation We model the noise as an iid Gaussian random vector � Z Entries have zero mean and variance σ 2 The data are a realization of the random vector Y := X � � β + � Z Y is Gaussian with mean X � � β and covariance matrix σ 2 I

Likelihood The joint pdf of � Y is n � � 2 � 1 − 1 � � � � X � Y ( � a ) := √ exp � a i − f � β 2 σ 2 2 πσ i i = 1 1 � − 1 2 � � � � � a − X � = ( 2 π ) n σ n exp � � β � � � � 2 σ 2 � � � � 2 The likelihood is � � 1 − 1 2 � � � � � � � y − X � L � = ( 2 π ) n exp � � β β � � � � y � 2 � � � 2

Maximum-likelihood estimate The maximum-likelihood estimate is � � � � β ML = arg max L � β y � β � � � = arg max log L � β y � β 2 � � � � y − X � = arg min � � β � � � � � � � � 2 β = � β LS

Temperature predictor A friend tells you: I found a cool way to predict the temperature in New York: It’s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it’s perfect!

Overfitting If a model is very complex, it may overfit the data To evaluate a model we separate the data into a training and a test set 1. We fit the model using the training set 2. We evaluate the error on the test set

Experiment z train and β ∗ are iid Gaussian with mean 0 and variance 1 X train , X test , � β ∗ + � y train = X train � � z train y test = X test � � β ∗ y train and X train to compute � We use � β LS � � � � � X train � β LS − � y train � � � � � � � 2 error train = || � y train || 2 � � � � � X test � β LS − � y test � � � � � � � 2 error test = || � y test || 2

Experiment 0.5 Error (training) Error (test) Noise level (training) 0.4 Relative error (l2 norm) 0.3 0.2 0.1 0.0 50 100 200 300 400 500 n

Maximum temperatures in Oxford, UK 30 25 20 Temperature (Celsius) 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000

Maximum temperatures in Oxford, UK 25 20 Temperature (Celsius) 15 10 5 0 1900 1901 1902 1903 1904 1905

Linear model � 2 π t � � 2 π t � y t ≈ � β 0 + � + � + � � β 1 cos β 2 sin β 3 t 12 12 1 ≤ t ≤ n is the time in months ( n = 12 · 150)

Model fitted by least squares 30 25 20 Temperature (Celsius) 15 10 5 0 Data Model 1860 1880 1900 1920 1940 1960 1980 2000

Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 Data Model 0 1900 1901 1902 1903 1904 1905

Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 0 Data Model 5 1960 1961 1962 1963 1964 1965

Trend: Increase of 0.75 ◦ C / 100 years (1.35 ◦ F) 30 25 20 Temperature (Celsius) 15 10 5 0 Data Trend 1860 1880 1900 1920 1940 1960 1980 2000

Model for minimum temperatures 20 15 Temperature (Celsius) 10 5 0 5 Data Model 10 1860 1880 1900 1920 1940 1960 1980 2000

Model for minimum temperatures 14 12 10 Temperature (Celsius) 8 6 4 2 0 Data Model 2 1900 1901 1902 1903 1904 1905

Model for minimum temperatures 15 10 Temperature (Celsius) 5 0 5 Data Model 10 1960 1961 1962 1963 1964 1965

Trend: Increase of 0.88 ◦ C / 100 years (1.58 ◦ F) 20 15 Temperature (Celsius) 10 5 0 5 Data Trend 10 1860 1880 1900 1920 1940 1960 1980 2000

Linear regression DS GA 1002 Probability and Statistics for Data - PowerPoint PPT Presentation

Linear regression DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda Linear models Least-squares estimation Overfitting Example: Global warming Regression

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Linear and Logistic Regression Marta Arias marias@cs.upc.edu Dept. CS, UPC Fall 2018 Linear

A First Supervised Learning Problem How do you measure the biomass of a forest? Linear Regression

Core API : linear regression IN TR OD U C TION TO TE N SOR FL OW IN R Colleen Bobbie Instr u

Linear programming Input: System of inequalities or equalities over the reals R A linear cost

CS 445 Introduction to Machine Learning Logistic Regression Instructor: Dr. Kevin Molloy Review

4. Minimax and planning problems Optimizing piecewise linear functions Minimax problems

Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University Reference and

Applied Machine Learning Linear Regression Siamak Ravanbakhsh COMP 551 (fall 2020) Learning