overview iaml linear regression
play

Overview IAML: Linear Regression The linear model Fitting the - PowerPoint PPT Presentation

Overview IAML: Linear Regression The linear model Fitting the linear model to data Probabilistic interpretation of the error function Nigel Goddard School of Informatics Examples of regression problems Dealing with multiple


  1. Overview IAML: Linear Regression ◮ The linear model ◮ Fitting the linear model to data ◮ Probabilistic interpretation of the error function Nigel Goddard School of Informatics ◮ Examples of regression problems ◮ Dealing with multiple outputs ◮ Generalized linear regression ◮ Radial basis function (RBF) models Semester 1 1 / 38 2 / 38 The Regression Problem Examples of regression problems ◮ Classification and regression problems: ◮ Classification: target of prediction is discrete ◮ Robot inverse dynamics: predicting what torques are ◮ Regression: target of prediction is continuous needed to drive a robot arm along a given trajectory ◮ Electricity load forecasting, generate hourly forecasts two ◮ Training data: Set D of pairs ( x i , y i ) for i = 1 , . . . , n , where ◮ days in advance (see W & F, § 1.3) x i ∈ R D and y i ∈ R ◮ Predicting staffing requirements at help desks based on ◮ Today: Linear regression, i.e., relationship between x and historical data and product and sales information, y is linear. ◮ Predicting the time to failure of equipment based on ◮ Although this is simple (and limited) it is: utilization and environmental conditions ◮ More powerful than you would expect ◮ The basis for more complex nonlinear methods ◮ Teaches a lot about regression and classification 3 / 38 4 / 38

  2. The Linear Model Toy example: Data ◮ Linear model 4 ● ● 3 ● f ( x ; w ) = w 0 + w 1 x 1 + . . . + w D x D ● ● ● ● = φ ( x ) w 2 ● ● ● ● ● ● ● ● y 1 ● where φ ( x ) = ( 1 , x 1 , . . . , x D ) = ( 1 , x T ) ● ● ● ● and 0 ●   w 0 ● ● −1 ● ● w 1   w = (1)   −2 ...   w D −3 −2 −1 0 1 2 3 ◮ The maths of fitting linear models to data is easy. We use x the notation φ ( x ) to make generalisation easy later. 5 / 38 6 / 38 Toy example: Data With two features Y 4 4 ● ● ● ● 3 3 ● ● • • ● ● ● ● ● ● ● ● • • • • • • 2 ● 2 ● • • • • ● ● ● ● • • • ● ● ● ● • • • ● ● ● ● • • • • • • • ● ● • • y 1 y 1 • • • • ● ● • • • • • • • • • • • • • • ● ● ● ● ● ● ● ● • • 0 0 • • • • • • ● ● • • • • • • • • • • • ● ● ● ● • −1 −1 ● ● ● ● • • • • • • X 2 −2 −2 • • • −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 X 1 x x Instead of a line, a plane . With more features, a hyperplane . Figure: Hastie, Tibshirani, and Friedman 7 / 38 8 / 38

  3. With more features With more features CPU Performance Data Set ◮ Predict: PRP: published relative performance PRP = - 56.1 + 0.049 MYCT ◮ MYCT: machine cycle time in nanoseconds (integer) + 0.015 MMIN ◮ MMIN: minimum main memory in kilobytes (integer) + 0.006 MMAX ◮ MMAX: maximum main memory in kilobytes (integer) + 0.630 CACH ◮ CACH: cache memory in kilobytes (integer) - 0.270 CHMIN ◮ CHMIN: minimum channels in units (integer) + 1.46 CHMAX ◮ CHMAX: maximum channels in units (integer) 9 / 38 10 / 38 In matrix notation Linear Algebra: The 1-Slide Version What is matrix multiplication?     a 11 a 12 a 13 b 1  , b = A = a 21 a 22 a 23 b 2 ◮ Design matrix is n × ( D + 1 )    a 31 a 32 a 33 b 3   1 x 11 x 12 . . . x 1 D First consider matrix times vector, i.e., A b . Two answers: 1 x 21 x 22 . . . x 2 D   Φ =  . . . . .  . . . . . 1. A b is a linear combination of the columns of A   . . . . .   1 x n 1 x n 2 . . . x nD       a 11 a 12 a 13  + b 2  + b 3 Ab = b 1 a 21 a 22 a 23     ◮ x ij is the j th component of the training input x i a 31 a 32 a 33 ◮ Let y = ( y 1 , . . . , y n ) T 2. A b is a vector. Each element of the vector is the dot ◮ Then ˆ y = Φ w is ...? products between b and one row of A .   ( a 11 , a 12 , a 13 ) b A b = ( a 21 , a 22 , a 23 ) b   ( a 31 , a 32 , a 33 ) b 11 / 38 12 / 38

  4. Linear model (part 2) Solving for Model Parameters This looks like what we’ve seen in linear algebra In matrix notation: y = Φ w ◮ Design matrix is n × ( D + 1 ) We know y and Φ but not w .   . . . 1 x 11 x 12 x 1 D 1 x 21 x 22 . . . x 2 D   Φ =  . . . . .  So why not take w = Φ − 1 y ? (You can’t, but why?) . . . . .   . . . . .   1 x n 1 x n 2 . . . x nD ◮ x ij is the j th component of the training input x i ◮ Let y = ( y 1 , . . . , y n ) T ◮ Then ˆ y = Φ w is the model’s predicted values on training inputs. 13 / 38 14 / 38 Solving for Model Parameters Loss function This looks like what we’ve seen in linear algebra y = Φ w Want a loss function O ( w ) that We know y and Φ but not w . ◮ We minimize wrt w . ◮ At minimum, ˆ y looks like y . So why not take w = Φ − 1 y ? (You can’t, but why?) ◮ (Recall: ˆ y depends on w ) Three reasons: ˆ y = Φ w ◮ Φ is not square. It is n × ( D + 1 ) . ◮ The system is overconstrained ( n equations for D + 1 parameters), in other words ◮ The data has noise 15 / 38 16 / 38

  5. Fitting a linear model to data Fitting a linear model to data ◮ n � ( y i − w T x i ) 2 O ( w ) = Y i = 1 ◮ A common choice: squared error = ( y − Φ w ) T ( y − Φ w ) (makes the maths easy) ◮ We want to minimize this with respect to w . • • • • • • • • n • • ◮ The error surface is a parabolic bowl • • • • • • � ( y i − w T x i ) 2 • • O ( w ) = • • • • • • • • • 25 • • • • • • • • • • • • • • • • i = 1 • • 20 • • • • • • • • • • • • • • 15 • • ◮ In the picture: this is sum of • • • • E[w] • • 10 • • squared length of black sticks. • • X 2 • 5 ◮ (Each one is called a residual , • 0 • i.e., each y i − w T x i ) 2 X 1 1 -2 -1 0 0 1 2 -1 3 w1 w0 ◮ How do we do this? 17 / 38 18 / 38 The Solution Probabilistic interpretation of O ( w ) ◮ Assume that y = w T x + ǫ , where ǫ ∼ N ( 0 , σ 2 ) ◮ Answer: to minimize O ( w ) = � n i = 1 ( y i − w T x i ) 2 , set partial ◮ (This is an exact linear relationship plus Gaussian noise.) derivatives to 0. ◮ This implies that y | x i ∼ N ( w T x i , σ 2 ) , i.e. ◮ This has an analytical solution √ 2 π + log σ + ( y i − w T x i ) 2 w = (Φ T Φ) − 1 Φ T y − log p ( y i | x i ) = log ˆ 2 σ 2 ◮ (Φ T Φ) − 1 Φ T is the pseudo-inverse of Φ ◮ So minimising O ( w ) equivalent to maximising likelihood! ◮ First check: Does this make sense? Do the matrix ◮ Can view w T x as E [ y | x ] . dimensions line up? ◮ Squared residuals allow estimation of σ 2 ◮ Then: Why is this called a pseudo-inverse? () n σ 2 = 1 ◮ Finally: What happens if there are no features? � ( y i − w T x i ) 2 ˆ n i = 1 19 / 38 20 / 38

  6. Sensitivity to Outliers ◮ Linear regression is sensitive to outliers √ ◮ Example: Suppose y = 0 . 5 x + ǫ , where ǫ ∼ N ( 0 , 0 . 25 ) , Fitting this into the general structure for learning algorithms: and then add a point (2.5,3): 3.0 ● ◮ Define the task : regression ◮ Decide on the model structure : linear regression model 2.5 ◮ Decide on the score function : squared error (likelihood) ● ● 2.0 ◮ Decide on optimization/search method to optimize the score function: calculus (analytic solution) 1.5 ● 1.0 ● 0.5 ● 0.0 0 1 2 3 4 5 21 / 38 22 / 38 Diagnositics Dealing with multiple outputs Graphical diagnostics can be useful for checking: ◮ Is the relationship obviously nonlinear? Look for structure ◮ Suppose there are q different targets for each input x in residuals? ◮ Are there obvious outliers? ◮ We introduce a different w i for each target dimension, and do regression separately for each one The goal isn’t to find all problems. You can’t. The goal is to find ◮ This is called multiple regression obvious, embarrassing problems. Examples: Plot residuals by fitted values. Stats packages will do this for you. 23 / 38 24 / 38

Recommend


More recommend