Linear Regression Yijun Zhao Northeastern University Fall 2016 Yijun Zhao Linear Regression
Regression Examples Any Attributes Continuous Value = ⇒ x y { age , major , gender , race } ⇒ GPA { income , credit score , profession } ⇒ loan { college , major , GPA } ⇒ future income . . . Yijun Zhao Linear Regression
Regression Examples Data often has/can be converted into matrix form: Age Gender Race Major GPA 20 0 A Art 3.85 22 0 C Engineer 3.90 25 1 A Engineer 3.50 24 0 AA Art 3.60 19 1 H Art 3.70 18 1 C Engineer 3.00 30 0 AA Engineer 3.80 25 0 C Engineer 3.95 28 1 A Art 4.00 26 0 C Engineer 3.20 Yijun Zhao Linear Regression
Formal Problem Setup Given N observations { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } a regression problem tries to uncover the function y i = f ( x i ) ∀ i = 1 , 2 . . . . , n such that for a new input value x ∗ , we can accurately predict the corresponding value y ∗ = f ( x ∗ ). Yijun Zhao Linear Regression
Linear Regression Assume the function f is a linear combination of components in x Formally, let x = (1 , x 1 , x 2 , . . . , x d ) T , we have y = ω 0 + ω 1 x 1 + ω 2 x 2 + · · · + ω d x d = w T x where w = ( ω 0 , ω 1 , ω 2 , . . . , ω d ) T w is the parameter to estimate ! Prediction: y ∗ = w T x ∗ Yijun Zhao Linear Regression
Visual Illustration Figure: 1D and 2D linear regression Yijun Zhao Linear Regression
Error Measure Mean Squared Error (MSE): N E ( w ) = 1 � ( w T x n − y n ) 2 N n =1 = 1 N � Xw − y � 2 where — x 1 T — y 1 — x 2 T — y 2 X = y = . . . . . . — x NT — y N Yijun Zhao Linear Regression
Minimizing Error Measure E ( w ) = 1 N � Xw − y � 2 ▽ E ( w ) = 2 N X T ( Xw − y ) = 0 X T Xw = X T y w = X † y where X † = ( X T X ) − 1 X T is the ’pseudo-inverse’ of X Yijun Zhao Linear Regression
LR Algorithm Summary Ordinary Least Squares (OLS) Algorithm Construct matrix X and the vector y from the dataset { ( x 1 , y 1 ) , x 2 , y 2 ) , . . . , ( x N , y N ) } (each x includes x 0 = 1) as follows: — x T 1 — y 1 — x T 2 — y 2 X = y = . . . . . . — x T N — y N Compute X † = ( X T X ) − 1 X T Return w = X † y Yijun Zhao Linear Regression
Gradient Descent Why? Minimize our target function ( E ( w )) by moving down in the steepest direction Yijun Zhao Linear Regression
Gradient Descent Gradient Descent Algorithm Initialize the w (0) for time t = 0 for t = 0 , 1 , 2 , . . . do Compute the gradient g t = ▽ E ( w ( t )) Set the direction to move, v t = − g t Update w ( t + 1) = w ( t ) + η v t Iterate until it is time to stop Return the final weights w Yijun Zhao Linear Regression
Gradient Descent How η affects the algorithm? Use 0.1 (practical observation) Use variable size: η t = η � ▽ E � Yijun Zhao Linear Regression
OLS or Gradient Descent? Yijun Zhao Linear Regression
Computational Complexity OLS Gradient Descent ¡ OLS is expensive when D is large! Yijun Zhao Linear Regression
Linear Regression What is the Probabilistic Interpretation? Yijun Zhao Linear Regression
Normal Distribution Right Skewed Left Skewed Random Yijun Zhao Linear Regression
Normal Distribution mean = median = mode symmetry about the center 2 σ 2 ( x − µ ) 2 1 2 π e − x ∼ N ( µ, σ 2 ) = 1 ⇒ f ( x ) = √ σ Yijun Zhao Linear Regression
Central Limit Theorem All things bell shaped! Random occurrences over a large population tend to wash out the asymmetry and uniformness of individual events. A more ’natural’ distribution ensues. The name for it is the Normal distribution (the bell curve). Formal definition: If ( y 1 , . . . , y n ) are i.i.d. and 0 < σ 2 y < ∞ , then when n is large the distribution of ¯ y is well approximated by a σ 2 normal distribution N ( µ y , n ). y Yijun Zhao Linear Regression
Central Limit Theorem Example: Yijun Zhao Linear Regression
LR: Probabilistic Interpretation Yijun Zhao Linear Regression
LR: Probabilistic Interpretation 2 πσ e − 1 2 σ 2 ( w T x i − y i ) 2 1 prob ( y i | x i ) = √ Yijun Zhao Linear Regression
LR: Probabilistic Interpretation Likelihood of the entire dataset: e − 1 2 σ 2 ( w T x i − y i ) 2 � � � L ∝ i − 1 ( w T x i − y i ) 2 � 2 σ 2 = e i ( w T x i − y i ) 2 Maximize L ⇐ ⇒ Minimize � i Yijun Zhao Linear Regression
Non-linear Transformation Linear is limited: Linear models become powerful when we consider non-linear feature transformations: X i = (1 , x i , x 2 ⇒ y i = ω 0 + ω 1 x i + ω 2 x 2 i ) = i Yijun Zhao Linear Regression
Overfitting Yijun Zhao Linear Regression
Overfitting How do we know we overfitted? E in : Error from the training data E out : Error from the test data Example: Yijun Zhao Linear Regression
Overfitting How to avoid overfitting? Use more data Evaluate on a parameter tuning set Regularization Yijun Zhao Linear Regression
Regularization Attempts to impose ”Occam’s razor” principle Add a penalty term for model complexity Most commonly used : L 2 regularization (ridge regression) minimizes: E ( w ) = � Xw − y � 2 + λ � w � 2 where λ ≥ 0 and � w � 2 = w T w L 1 regularization (LASSO) minimizes: E ( w ) = � Xw − y � 2 + λ | w | 1 D � where λ ≥ 0 and | w | 1 = | ω i | i =1 Yijun Zhao Linear Regression
Regularization L 2: closed form solution w = ( X T X + λ I ) − 1 X T y L 1: No closed form solution. Use quadratic programming: minimize � Xw − y � 2 � w � 1 ≤ s s . t . Yijun Zhao Linear Regression
L 2 Regularization Example Yijun Zhao Linear Regression
Model Selection Which model? A central problem in supervised learning Simple model: ”underfit” the data Constant function Linear model applied to quadratic data Complex model: ”overfit” the data High degree polynomials Model with hidden logics that fits the data to completion Yijun Zhao Linear Regression
Bias-Variance Trade-off � � N 1 ( w T x n − y n ) 2 y = w T x n � Consider E let ˆ N n =1 y − y n ) 2 can be decomposed into (reading): E (ˆ var { noise } + bias 2 + var { y i } var { noise } : can’t be reduced bias 2 + var { y i } is what counts for prediction High bias 2 : model mismatch, often due to ”underfitting” High var { y i } : training set and test set mismatch, often due to ”overfitting” Yijun Zhao Linear Regression
Bias-Variance Trade-off Often: low bias ⇒ high variance low variance ⇒ high bias Trade-off: Yijun Zhao Linear Regression
How to choose λ ? But we still need to pick λ . Use the test set data ? NO! Set aside another evaluation set Small evaluation set ⇒ inaccurate estimated error Large evaluation set ⇒ small training set CrossValidation Yijun Zhao Linear Regression
Cross Validation (CV) Divide data into K folds Alternatively train on all except k th folds, and test on k th fold Yijun Zhao Linear Regression
Cross Validation (CV) How to choose K? Common choice of K = 5 , 10, or N (LOOCV) Measure on average performance Cost of computation: K folds × choices of λ Yijun Zhao Linear Regression
Learning Curve A learning curve plots the performance of the algorithm as a function of the size of training data Yijun Zhao Linear Regression
Learning Curve Yijun Zhao Linear Regression
Recommend
More recommend