Linear Regression Machine Learning Hamid Beigy Sharif University - PowerPoint PPT Presentation

Linear Regression Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 1 / 38

Introduction 1 Linear regression 2 Model selection 3 Sample size 4 Maximum likelihood and least squares 5 Over-fitting 6 Regularization 7 Maximum a posteriori and regularization 8 Geometric Interpretation 9 10 Sequential learning 11 Multiple outputs regression 12 Bias-Variance trade-off Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 2 / 38

Introduction In regression, c ( x ) is a continuous function. Hence the training set is in the form of S = { ( x 1 , t 1 ) , ( x 2 , t 2 ) , . . . , ( x N , t N ) } , t k ∈ R . If there is no noise, the task is interpolation and our goal is to find a function f ( x ) that passes through these points such that we have t k = f ( x k ) ∀ k = 1 , 2 , . . . , N In polynomial interpolation, given N points, we find ( N − 1)st degree polynomial to predict the output for any x . If x is outside of the range of the training set, the task is called extrapolation. In regression, there is noise added to the output of the unknown function. t k = f ( x k ) + ǫ ∀ k = 1 , 2 , . . . , N f ( x k ) ∈ R is the unknown function and ǫ is the random noise. Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 3 / 38

Introduction(cont.) In regression, there is noise added to the output of the unknown function. t k = f ( x k ) + ǫ ∀ k = 1 , 2 , . . . , N The explanation for the noise is that there are extra hidden variables that we cannot observe. t k = f ∗ ( x k , z k ) + ǫ ∀ k = 1 , 2 , . . . , N z k denotes hidden variables Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 4 / 38

Linear regression Our goal is to approximate the output by function g ( x ). The empirical error on the training set S is measured using loss/error/cost function. Squared Error from target E E ( g ( x i ) | S ) = ( t i − g ( x i )) 2 . Linear error from target E E ( g ( x i ) | S ) = | t i − g ( x i ) | . i =1 ( t i − g ( x i )) 2 . � N Mean square error from target E E ( g ( x ) | S ) = 1 N i =1 ( t i − g ( x i )) 2 . � N Sum of square error from target E E ( g ( x ) | S ) = 1 2 The aim is to find g ( . ) that minimizes the empirical error. We assume that a hypothesis class for g ( . ) with a small set of parameters. Assume that g ( x ) is linear g ( x ) = w 0 + w 1 x 1 + w 2 x 2 + . . . + w D x D Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 5 / 38

Linear regression(cont.) In the linear regression, when D = 1 g ( x ) = w 0 + w 1 x Parameters w 0 and w 1 should minimize the empirical error N E E ( w 0 , w 1 | S ) = E E ( g ( x ) | S ) = 1 [ t k − ( w 0 + w 1 x k )] 2 � 2 i =1 This error function is a quadratic function of W and its derivative is linear w.r.t W . Its minimization has a unique solution denoted by W ∗ . Taking derivative of error w.r.t w 0 and w 1 and setting equal to zero ¯ t − w 1 ¯ w 0 = x x ¯ � k t k x k − ¯ tN w 1 = k ( x k ) 2 − N (¯ � x ) 2 � N � N k =1 t k k =1 x k ¯ t = , ¯ x = N N Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 6 / 38

Linear Regression(cont.) When the input variables form a D − dimensional vector, the linear regression model is g ( x ) = w 0 + w 1 x 1 + w 2 x 2 + . . . + w D x D . Parameters w 0 , w 1 , . . . , w D should minimize the empirical error N E E ( g ( x ) | S ) = 1 ( t i − g ( x i )) 2 . � 2 i =1 Taking derivative of error w.r.t w s and setting equal to zero, Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 7 / 38

Linear Regression(cont.) Taking derivative of error w.r.t w s and setting equal to zero, N N N N � � � � t k = Nw 0 + w 1 x k 1 + w 2 x k 2 + . . . + w D x kD k =1 k =1 k =1 k =1 N N N N N ( x k 1 ) 2 + w 2 � � � � � x k 1 t k = w 0 x k 1 + w 1 x k 1 x k 2 + . . . + w D x k 1 x kD k =1 k =1 k =1 k =1 k =1 N N N N N ( x k 2 ) 2 + . . . + w D � � � � � = x k 2 + w 1 x k 1 x k 2 + w 2 x k 2 t k w 0 x k 2 x kD k =1 k =1 k =1 k =1 k =1 . . . N N N N N � � � � � ( x kD ) 2 x kD t k = w 0 x kD + w 1 x k 1 x kD + w 2 x k 2 x kD + . . . + w D k =1 k =1 k =1 k =1 k =1 Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 8 / 38

Linear Regression(cont.) Define the following vectors and Matrix Data matrix  1 . . .  x 11 x 12 x 1 D 1 . . . x 21 x 22 x 2 D   X =  . .  . .   . .   1 x N 1 x N 2 . . . x DD The k th input vector X k = (1 , x k 1 , x k 2 , . . . , x kD ) T The weight vector W = ( w 0 , w 1 , w 2 , . . . , w D ) T The target vector t = ( t 1 , t 2 , t 3 , . . . , t N ) T Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 9 / 38

Linear Regression(cont.) The empirical error is equal to N E E ( g ( x ) | S ) = 1 � 2 � � t k − W T X k . 2 k =1 The gradient of E E ( g ( x ) | S ) is N � � � t k − W T X k X T ∇ W E E ( g ( x ) | S ) = k k =1 N N � � t k X T k − W T X k X T = k = 0 k =1 k =1 Solving for W , we obtain � − 1 W ∗ = � X T X X T t If X T X is invertible, the problem has a unique solution. If X T X is not invertible, the pseudo inverse is used and the problem has several solution. Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 10 / 38

Regression(cont.) If the linear model is too simple, the model can be a polynomial (a more complex hypothesis set) g ( x ) = w 0 + w 1 x + w 2 x 2 + . . . + w M x M . M is the order of the polynomial. Choosing the right value of M is called model selection. For M = 1, we have a too general model For M = 9, we have a too specific model Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 11 / 38

Regression(Model selection) The goal of model selection is to achieve good generalization by making accurate predictions for new data. The generalization ability of a model is measured by a separate test data generated using exactly the same process used for generating training data. The model is chosen using a validation data set. Two models sometimes are compared using root mean square (RMS) error. � 2 E E ( W ∗ | S ) / N E RMS = This allows comparison on different sizes of data sets. Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 12 / 38

Regression(Sample size) For a given model complexity, the over-fitting problem become less severe as the size of the data set increases. Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 13 / 38

Linear Regression(cont.) We can extend the class of models by considering linear combinations of fixed nonlinear functions of the input variables, of the form M − 1 � g ( x ) = w 0 + w j φ j ( x ) j =1 . φ j ( x ) are known as basis functions. M is total number of parameters. w 0 is called bias parameter. usually a dummy basis function φ 0 ( x ) = 1 is used M � w j φ j ( x ) = W T Φ( x ) g ( x ) = j =0 . W = ( w 0 , w 1 , . . . , w M − 1 ) T and Φ = ( φ 0 , φ 1 , . . . , φ M − 1 ) T . Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 14 / 38

Linear Regression(cont.) In pre-processing phase, the features can be expressed in terms of the basis functions { φ j ( x ) } . Examples of basis functions Polynomial basis function φ j ( x ) = x j . Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 15 / 38

Linear Regression(cont.) Examples of basis functions � ( x − µ j ) 2 � Gaussian basis function φ j ( x ) = exp . 2 s 2 µ j is location of the basis function. s is the spatial scale of the basis function. Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 16 / 38

Linear Regression(cont.) Examples of basis functions � � x − µ j Logistic basis function φ j ( x ) = σ . s 1 σ ( a ) = 1+ exp ( − a ) . Fourier basis function Wavelets basis function Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 17 / 38

Linear Regression(cont.) The empirical error is equal to N E E ( g ( x ) | S ) = 1 � 2 � � t k − W T Φ( X k ) . 2 k =1 The gradient of E E ( g ( x ) | S ) is N � � � t k − W T Φ( X k ) Φ( X k ) T ∇ W E E ( g ( x ) | S ) = k =1 N N t k Φ( X k ) T − W T Φ( X k )Φ( X k ) T = 0 � � = k =1 k =1 � − 1 Φ T t . Solving for W , we obtain W ∗ = � Φ T Φ   φ 0 ( x 1 ) φ 1 ( x 1 ) φ 2 ( x 1 ) . . . φ M − 1 ( x 1 ) φ 0 ( x 2 ) φ 1 ( x 2 ) φ 2 ( x 2 ) . . . φ M − 1 ( x 2 )   Φ =  . .  . .   . .   φ 0 ( x N ) φ 1 ( x N ) φ 2 ( x N ) . . . φ M − 1 ( x N ) Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 18 / 38

Linear Regression(cont.) � − 1 Φ T is known as Moore-Penrose pseudo-inverse. The quantity Φ † = � Φ T Φ The bias value ( w 0 )   M − 1 N w 0 = 1   � �  t k − w j φ j ( X k )  . N k =1 j =1 and equals to the difference between target values and the weighted sum of the basis function values. Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 19 / 38

Linear Regression Machine Learning Hamid Beigy Sharif University - PowerPoint PPT Presentation

Linear Regression Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 1 / 38 Introduction 1 Linear regression 2 Model selection 3 Sample size

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Linear regression . Course of Machine Learning Master Degree in Computer Science University of

Characteristics of Mobile Web Content Felix Nwaobasi Text Block #3 Contact, Department, or

Research Methods - Tools Overview References Motivation Outline Chia-Hui Huang Manage Your

HyperText Markup Language and Cascading Style Sheets Web and Apps 1) HTML - CSS Introduction

CAN WE TRUST THE GOSPELS? DR PETER J. WILLIAMS @DRPJWILLIAMS MAIN SOURCES ABOUT EMPEROR TIBERIUS

Konstantin Tretjakov (kt@ut.ee) 22. november 2005 RVM & Eponine p.1/ ??

Information Rates for Phase Noise Channels Luca Barletta Politecnico di Milano

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Sambuz

Useful Links

Newsletter

Mail Us

Linear Regression Machine Learning Hamid Beigy Sharif University - PowerPoint PPT Presentation

Linear Regression Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 1 / 38 Introduction 1 Linear regression 2 Model selection 3 Sample size

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Linear regression . Course of Machine Learning Master Degree in Computer Science University of

Characteristics of Mobile Web Content Felix Nwaobasi Text Block #3 Contact, Department, or

Research Methods - Tools Overview References Motivation Outline Chia-Hui Huang Manage Your

HyperText Markup Language and Cascading Style Sheets Web and Apps 1) HTML - CSS Introduction

CAN WE TRUST THE GOSPELS? DR PETER J. WILLIAMS @DRPJWILLIAMS MAIN SOURCES ABOUT EMPEROR TIBERIUS

Konstantin Tretjakov (kt@ut.ee) 22. november 2005 RVM &amp; Eponine p.1/ ??

Information Rates for Phase Noise Channels Luca Barletta Politecnico di Milano

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Sambuz

Useful Links

Newsletter

Mail Us

Konstantin Tretjakov (kt@ut.ee) 22. november 2005 RVM & Eponine p.1/ ??