Applied Machine Learning Linear Regression Siamak Ravanbakhsh COMP 551 (fall 2020)
Learning objectives linear model evaluation criteria how to find the best fit geometric interpretation maximum likelihood interpretation
Motivation History: method of least squares was invented by Legendre and Gauss (1800's) Gauss at age 24 used it to predict the future location of Ceres (largest astroid in the astroid belt) effect of income inequality on health and social problems source: http://chrisauld.com/2012/10/07/what-do-we-know-about-the-effect-of-income-inequality-on-health/
Motivation (?) COMP 551 | Fall 2020
Notation recall one instance ( n ) R D ∈ each instance: x ( n ) R ∈ y ⎡ x 1 ⎤ a feature ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x 2 ⊤ ⎢ ⎥ vectors are assume to be column vectors x = = [ x , x , … , x D ] 1 2 ⋮ ⎣ x D ⎦ we assume N instances in the dataset D = {( x ( n ) ( n , y )} N n =1 each instance has D features indexed by d ( n ) for example, is the feature d of instance n R ∈ x d
Notation recall design matrix: concatenate all instances each row is a datapoint, each column is a feature ⎡ x ⎤ ⎡ x (1)⊤ ⎤ (1) (1) (1) , , ⋯ , one instance x x D ⎢ ⎥ ⎢ ⎢ ⎥ ⎥ 1 2 ⎢ x (2)⊤ ⎥ ⎢ ⎥ ⎢ ⎥ = ∈ R N × D X = ⎢ ⎥ ⋮ ⋮ ⋱ ⋮ ⎣ ( N ) ⎦ ⋮ ⎣ x ( N )⊤ ⎦ ( N ) ( N ) , , ⋯ , x x x D 1 2 one feature
Notation Example: Micro array data (X), contains gene expression levels labels (y) can be {cancer/no cancer} label for each patient patient (n) ∈ R N × D gene (d) COMP 551 | Fall 2020
Linear model R R : D → assuming a scalar output f w will generalize to a vector later f ( x ) = w + w x + … + w x 0 1 1 w D D model parameters or weights bias or intercept simplification D ⊤ concatenate a 1 to x x = [1, x , … , x ] 1 ⊤ f ( x ) = w x w
Loss function ( n ) ( n ) , y ∀ n objective: find parameters to fit the data x ( n ) y ( n ) f ( x ) ≈ w and minimize a measure of difference between ^ ( n ) ( n ) y ( n ) = f ( x ) y w L ( y , ) ≜ 1 ^ 2 ^ ( y − ) square error loss (a.k.a. L2 loss) y y 2 for a single instance (a function of labels) for future convenience versus for the whole dataset sum of squared errors cost function 2 ( y ) 1 ∑ n =1 N ( n ) ⊤ ( n ) J ( w ) = − w x 2
Example (D = 1) +bias (D=2)! y (1) (1) ( x , y ) ∗ ∗ ( x ) = w + f w x w ∗ 0 1 (2) (2) ( x , y ) (3) (3) − f ( x ) y (3) (3) ( x , y ) (4) (4) ( x , y ) ∗ w 0 x = [ x ] 1 2 ( y ) Linear Least Squares ∗ ( n ) ( n ) w = min − T w ∑ n w x
Example (D=2) +bias (D=3)! y ∗ ∗ ∗ ( x ) = w + w x + f w x w ∗ 1 2 0 1 2 x 2 ∗ w 0 x 1 2 ( y ) Linear Least Squares ∗ ( n ) ( n ) w = arg min − T w ∑ n w x COMP 551 | Fall 2020
Matrix form ^ ( n ) ⊤ ( n ) = instead of y w x ∈ R 1 × D D × 1 ^ = y Xw use design matrix to write N × 1 D × 1 N × D Linear least squares 1 1 2 ⊤ arg min ∣∣ y − Xw ∣∣ = ( y − Xw ) ( y − Xw ) w 2 2 2 squared L2 norm of the residual vector
Minimizing the cost the cost function is a smooth function of w find minimum by setting partial derivatives to zero
Simple case: D = 1 (no intercept) J f ( x ) = wx model w both scalar 1 ∑ n ( n ) ( n ) 2 J ( w ) = ( y − ) cost function wx 2 w d J ( n ) ( n ) ( n ) = ( wx − ) derivative ∑ n x y d w ( n ) ( n ) ∑ n x y ∗ setting the derivative to zero w = ( n )2 ∑ n x global minimum because cost is smooth and convex more on convexity layer
Gradient J for a multivariate function J ( w , w ) 0 1 partial derivatives instead of derivative = derivative when other vars. are fixed w 1 J ( w , w + ϵ )− J ( w , w ) ∂ J ( w , w ) ≜ lim ϵ →0 0 1 0 1 w 0 0 1 ∂ w 1 ϵ J critical point : all partial derivatives are zero gradient : vector of all partial derivatives ∂ ∂ ⊤ ∇ J ( w ) = [ J ( w ), ⋯ J ( w )] ∂ w 1 ∂ w D w 0 w 1
Finding w (any D) J ( w ) ∂ J ( w ) = 0 setting ∂ w d ∂ ∑ n 2 1 ( n ) ( n ) 2 ( y − f ( x )) = 0 w ∂ w d ∂ f w ∂ J d J = using chain rule : w 0 ∂ w d d f w ∂ w d w 1 cost is a smooth and convex function of w ( n ) ⊤ ( n ) ( n ) ( w x − ) x = 0 ∀ d ∈ {1, … , D } ∑ n we get y d
Normal equation y ∈ R N optimal weights w* satisfy y − Xw ( n ) ( n ) ⊤ ( n ) ( y − ) x = 0 ∀ d ∑ n w x d x 1 matrix form (using the design matrix) each row enforces one of D equations ^ N × 1 D × N y ⊤ X ( y − Xw ) = 0 x 2 2nd column of the design matrix Normal equation: because for optimal w, the residual vector is normal to column space of the design matrix system of D linear equations ⊤ ⊤ or X Xw = Aw = b X y
Direct solution y ∈ R N we can get a closed form solution! y − Xw ⊤ X ( y − Xw ) = 0 x 1 ⊤ ⊤ X Xw = X y pseudo-inverse of X ^ y ∗ ⊤ −1 ⊤ w = ( X X ) X y x 2 D × D D × N N × 1 ⊤ −1 ⊤ ^ = Xw = X ( X X ) y X y projection matrix into column space of X
Uniqueness of the solution we can get a closed form solution! ∗ ⊤ −1 ⊤ w = ( X X ) X y what if the covariance matrix is not invertible? when is this not invertible? recall the eigenvalue decomposition for the covariance matrix Q Λ Q ⊤ ⎡ λ 1 ⎤ 1 0 … 0 ⎢ ⎥ ⊤ −1 −1 ⊤ ( Q Λ Q ) = Q Λ −1 1 where 0 … 0 Q Λ = ⎣ 1 ⎦ λ 2 0 0 … λ D this matrix is not well-defined when some eigenvalues are zero!
Uniqueness of the solution under some assumptions, we can get a closed form solution! ⊤ −1 −1 ⊤ ( Q Λ Q ) = Q Λ Q ⎡ λ 1 ⎤ this does not exist when some eigenvalues are zero! 1 0 … 0 ⎢ ⎥ −1 1 Λ = ⎣ 0 … 0 1 ⎦ λ 2 0 0 … λ D that is, if features are completely correlated λ 1 λ 2 ... or more generally if features are not linearly independent there exists some such that α x = 0 ∑ d { α } d d d
Uniqueness of the solution that is, if features are completely correlated ... or more generally if features are not linearly independent there exists some such that α x = 0 ∑ d { α } d d d alternatively w ∗ if satisfies the normal equation ⊤ ∗ 0 X ( y − Xw ) = 0 then the following are also solutions ∗ w + ∀ c because X ( w + ∗ ∗ cα cα ) = Xw + cXα examples x = (1 − x ) having a binary feature as well as its negation x 1 2 1 when we have many features and D ≥ N patient (n) gene (d) COMP 551 | Fall 2020
Time complexity D × D D × N N × 1 ∗ ⊤ −1 ⊤ w = ( X X ) X y O ( ND ) D elements, each using N ops. 3 matrix inversion O ( D ) 2 D x D elements, each requiring N multiplications O ( D N ) total complexity for N > D is O ( ND + 2 3 D ) in practice we don't directly use matrix inversion (unstable) however, other more stable solutions (e.g., Gaussian elimination) have similar complexity
Multiple targets y ∈ R N Y ∈ R N × D ′ instead of we have a different weight vectors for each target each column of Y is associated with a column of W ^ = Y XW N × D ′ N × D D × D ′ ∗ ⊤ −1 ⊤ = ( X X ) W X Y D × D D × N N × D ′
Feature engineering = ∑ d so far we learned a linear function f w x w d d sometimes this may be too simplistic idea create new more useful features out of initial set of given features e.g., x , x x , log( x ), 2 1 2 1 how about ? x + 2 x 3 1
Nonlinear basis functions = ∑ d so far we learned a linear function f w x w d d ϕ ( x )∀ d let's denote the set of all features by d = w ϕ ( x ) ∑ d the problem of linear regression doesn't change f w d d ϕ ( x ) is the new x ⊤ ∗ ⊤ d (Φ Φ) w = Φ y solution simply becomes with Φ replacing X a (nonlinear) feature ⎡ ϕ ( x ⎤ (1) (1) (1) ), ϕ ( x ), ⋯ , ϕ ( x ) 1 2 D ⎢ ⎥ ⎢ ⎥ (2) (2) (2) one instance ϕ ( x ), ϕ ( x ), ⋯ , ϕ ( x ) ⎢ ⎥ 1 2 D ⎢ ⎥ Φ = ⋮ ⋮ ⋱ ⋮ ⎣ ( N ) ⎦ ( N ) ( N ) ϕ ( x ), ϕ ( x ), ⋯ , ϕ ( x ) 1 2 D
Nonlinear basis functions x ∈ R example original input is scalar Sigmoid bases polynomial bases Gaussian bases 1 k 2 ϕ ( x ) = ( x − μ ) e − ϕ ( x ) = x − μk ϕ ( x ) = x k k s 2 1+ e − k k s
Example: Gaussian bases k 2 ( x − μ ) e − ϕ ( x ) = s 2 we are using a fixed standard deviation of s=1 k ^ ( n ) = w + w ϕ ( x ) ∑ k y 0 k k the green curve (our fit) is the sum of these scaled Gaussian bases plus the intercept. Each basis is scaled by the corresponding weight
Example: Sigmoid bases Sigmoid bases 1 ϕ ( x ) = k x − μk we are using a fixed standard deviation of s=1 1+ e − s ^ ( n ) = w + w ϕ ( x ) ∑ k y 0 k k the green curve (our fit) is the sum of these scaled Gaussian bases plus the intercept. Each basis is scaled by the corresponding weight COMP 551 | Fall 2020
Recommend
More recommend