Applied Machine Learning Applied Machine Learning Linear Regression - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Linear Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

Learning objectives Learning objectives linear model evaluation criteria how to find the best fit geometric interpretation 2

Motivation Motivation History: method of least squares was invented by Legendre and Gauss (1800's) Gauss at age 24 used it to predict the future location of Ceres (largest astroid in the astroid belt) effect of income inequality on health and social problems source: http://chrisauld.com/2012/10/07/what-do-we-know-about-the-effect-of-income-inequality-on-health/ 3 . 1

Motivation (?) Motivation (?) 3 . 2 Winter 2020 | Applied Machine Learning (COMP551)

Representing data Representing data one instance ( n ) R D ∈ each instance: x ( n ) R ∈ y ⎡ x ⎤ a feature 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x 2 ⊤ ⎢ ⎥ vectors are assume to be column vectors x = = , , … , [ x D ] x x 1 2 ⋮ ⎣ D ⎦ x we assume N instances in the dataset D = {( x ( n ) ( n , y )} N n =1 each instance has D features indexed by d ( n ) for example, is the feature d of instance n R ∈ x d 4 . 1

Representing data Representing data design matrix: concatenate all instances each row is a datapoint, each column is a feature ⎡ x ⎤ ⎡ x (1)⊤ ⎤ (1) (1) (1) , , ⋯ , one instance x x ⎢ ⎥ ⎢ ⎢ ⎥ ⎥ 1 2 D x (2)⊤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ∈ R N × D X = ⎢ ⎥ ⋮ ⋮ ⋱ ⋮ ⎣ ( N ) ⎦ ⋮ ⎣ x ( N )⊤ ⎦ ( N ) ( N ) , , ⋯ , x x x 1 2 D one feature 4 . 2

Representing data Representing data Example: Micro array data (X), contains gene expression levels labels (y) can be {cancer/no cancer} label for each patient patient (n) ∈ R N × D gene (d) 4 . 3 Winter 2020 | Applied Machine Learning (COMP551)

Linear model Linear model R R : D → assuming a scalar output f w will generalize to a vector later ( x ) = + + … + w f w w x x 0 1 1 w D D model parameters or weights bias or intercept yh_n = np.dot(w,x) simplification D ⊤ concatenate a 1 to x x = [1, x , … , x ] 1 ⊤ ( x ) = f w x w 5

Loss function Loss function ( n ) ( n ) , y ∀ n objective: find parameters to fit the data x ( n ) y ( n ) ( x ) ≈ f w and minimize a measure of difference between ^ ( n ) ( n ) y ( n ) = ( x ) y f w ) ≜ 1 ^ 2 L ( y , ^ ( y − ) square error loss (a.k.a. L2 loss) y y 2 for a single instance (a function of labels) for future convenience versus for the whole dataset sum of squared errors cost function 2 ( y ) 1 ∑ n =1 N ( n ) ⊤ ( n ) J ( w ) = − w x 2 6 . 1

Example Example (D = 1) (D = 1) +bias (D=2)! y (1) (1) ( x , y ) ∗ ∗ ( x ) = + f w w x w ∗ 0 1 (2) (2) ( x , y ) (3) (3) − f ( x ) y (3) (3) ( x , y ) (4) (4) ( x , y ) ∗ w 0 x = [ x ] 1 2 ( y ) Linear Least Squares ( n ) ( n ) min − T w ∑ n w x 6 . 2

Example Example (D=2) (D=2) +bias (D=3)! y ∗ ∗ ∗ ( x ) = + + f w w x w x w ∗ 1 2 0 1 2 ∗ x w 2 0 x 1 2 ( y ) Linear Least Squares ∗ ( n ) ( n ) w = arg min − T w ∑ n w x 6 . 3 Winter 2020 | Applied Machine Learning (COMP551)

Matrix form Matrix form ^ ( n ) ⊤ ( n ) = instead of y w x ∈ R 1 × D D × 1 ^ = y Xw use design matrix to write N × 1 D × 1 N × D Linear least squares yh = np.dot(X, w) 1 1 2 ⊤ arg min ∣∣ y − Xw ∣∣ = ( y − Xw ) ( y − Xw ) w 2 cost = np.sum((yh - y)**2)/2. 2 # or cost = np.mean((yh - y)**2)/2. squared L2 norm of the residual vector 7

Minimizing the cost Minimizing the cost weight space data space y w 0 x w 1 the objective is a smooth function of w find minimum by setting partial derivatives to zero image: Grosse, Farahmand, Carrasquilla 8 . 1

Simple case: D = 1 Simple case: D = 1 J ( x ) = f wx model w both scalar 1 ∑ n ( n ) ( n ) 2 J ( w ) = ( y − ) cost function wx 2 w d J ( n ) ( n ) ( n ) = ( wx − ) derivative ∑ n x y d w ( n ) ( n ) ∑ n x y setting the derivative to zero w = ( n )2 ∑ n x global minimum because cost is smooth and convex more on convexity layer 8 . 2

Gradient Gradient J for a multivariate function J ( w , w ) 0 1 partial derivatives instead of derivative = derivative when other vars. are fixed w 1 J ( w , w + ϵ )− J ( w , w ) ∂ ) ≜ J ( w , w lim 0 1 0 1 w 0 0 1 ϵ →0 ∂ w ϵ 1 J critical point : all partial derivatives are zero gradient : vector of all partial derivatives ∂ ∂ ⊤ ∇ J ( w ) = [ J ( w ), ⋯ J ( w )] ∂ w ∂ w 1 D w w 0 1 8 . 3

Finding w Finding w (any D) (any D) J ( w ) ∂ J ( w ) = 0 setting ∂ w i ∂ ∑ n 2 1 ( n ) ( n ) 2 ( y − ( x )) = 0 f w ∂ w i ∂ f ∂ J d J = w using chain rule : w ∂ w d f ∂ w 0 i w i w 1 cost is a smooth and convex function of w ( n ) ⊤ ( n ) ( n ) ( w x − ) x = 0 ∀ d ∈ {1, … , D } ∑ n we get y d 8 . 4

Normal equation Normal equation system of D linear equations y ∈ R N ( n ) ( n ) ⊤ ( n ) ( y − ) x = 0 ∀ d ∑ n w x d y − Xw matrix form (using the design matrix) x 1 each row enforces one of D equations N × 1 D × N ⊤ X ( y − Xw ) = 0 ^ y x 2 Normal equation: because for optimal w, the residual vector is normal to column space of the design matrix 8 . 5

Direct solution Direct solution y ∈ R N we can get a closed form solution! y − Xw ⊤ X ( y − Xw ) = 0 x 1 ⊤ ⊤ X Xw = X y pseudo-inverse of X ^ y ∗ ⊤ −1 ⊤ w = ( X X ) X y x D × D 2 D × N N × 1 ⊤ −1 ⊤ ^ = Xw = X ( X X ) y X y w = np.linalg.lstsq(X,y)[0] projection matrix into column space of X 8 . 6 Winter 2020 | Applied Machine Learning (COMP551)

Time complexity Time complexity D × D D × N N × 1 ∗ ⊤ −1 ⊤ w = ( X X ) X y O ( ND ) D elements, each using N ops. 3 matrix inversion O ( D ) 2 D x D elements, each requiring N multiplications O ( D N ) total complexity for N > D is O ( ND + 2 3 D ) in practice we don't directly use matrix inversion (unstable) 9

Multiple targets Multiple targets y ∈ R N Y ∈ R N × D ′ instead of we have a different weight vectors for each target each column of Y is associated with a column of W ^ = Y XW N × D ′ N × D D × D ′ ∗ ⊤ −1 ⊤ = ( X X ) W X Y w = np.linalg.lstsq(X,Y)[0] D × D D × N N × D ′ 10

Nonlinear basis functions Nonlinear basis functions = ∑ d so far we learned a linear function f w x w d d nothing changes if we have nonlinear bases = ( x ) ∑ d f w ϕ w d d ∗ ⊤ −1 ⊤ w = (Φ Φ) Φ y solution simply becomes X with Φ replacing a (nonlinear) feature ⎡ ϕ ⎤ (1) (1) (1) ( x ), ( x ), ⋯ , ( x ) ϕ ϕ 1 2 D ⎢ ⎢ ⎥ ⎥ (2) (2) (2) ( x ), ( x ), ⋯ , ( x ) one instance ⎢ ⎥ ϕ ϕ ϕ 1 2 D ⎢ ⎥ Φ = ⎣ ⋮ ⋮ ⋱ ⋮ ( N ) ⎦ ( N ) ( N ) ( x ), ( x ), ⋯ , ( x ) ϕ ϕ ϕ 1 2 D 11 . 1

Nonlinear basis functions Nonlinear basis functions x ∈ R examples original input is scalar Sigmoid bases polynomial bases Gaussian bases 1 k 2 ( x ) = ( x − μ ) ϕ e − ( x ) = x − μ ( x ) = x k k ϕ s 2 ϕ k 1+ e − k k s 11 . 2

Example Example: Gaussian bases : Gaussian bases k 2 ( x − μ ) e − ( x ) = ϕ s 2 k noise ( n ) ( n ) = sin( x ) + cos( ∣ x ( n ) ∣ ) + y ϵ before adding noise 1 1 1 #x: N #x: N #x: N 2 2 2 #y: N #y: N #y: N 3 3 3 plt.plot(x, y, 'b.') plt.plot(x, y, 'b.') plt.plot(x, y, 'b.') 4 4 4 phi = lambda x,mu: np.exp(-(x-mu)**2) phi = lambda x,mu: np.exp(-(x-mu)**2) phi = lambda x,mu: np.exp(-(x-mu)**2) 5 5 5 mu = np.linspace(0,10,10) #10 Gaussians bases mu = np.linspace(0,10,10) #10 Gaussians bases mu = np.linspace(0,10,10) #10 Gaussians bases 6 6 6 Phi = phi(x[:,None], mu[None,:]) #N x 10 Phi = phi(x[:,None], mu[None,:]) #N x 10 Phi = phi(x[:,None], mu[None,:]) #N x 10 7 7 7 w = np.linalg.lstsq(Phi, y)[0] w = np.linalg.lstsq(Phi, y)[0] w = np.linalg.lstsq(Phi, y)[0] 8 8 8 yh = np.dot(Phi,w) yh = np.dot(Phi,w) yh = np.dot(Phi,w) 9 9 9 plt.plot(x, yh, 'g-') plt.plot(x, yh, 'g-') plt.plot(x, yh, 'g-') our fit to data using 10 Gaussian bases 11 . 3

Example: Sigmoid bases Example : Sigmoid bases Sigmoid bases 1 ( x ) = ϕ k x − μ k 1+ e − s ( n ) ( n ) = sin( x ) + cos( ∣ x ( n ) ∣ ) + y ϵ 1 #x: N 2 #y: N 3 plt.plot(x, y, 'b.') 4 phi = lambda x,mu: 1/(1 + np.exp(-(x - mu))) 5 mu = np.linspace(0,10,10) #10 sigmoid bases 6 Phi = phi(x[:,None], mu[None,:]) #N x 10 7 w = np.linalg.lstsq(Phi, y)[0] 8 yh = np.dot(Phi,w) 9 plt.plot(x, yh, 'g-') our fit to data using 10 sigmoid bases 11 . 4 Winter 2020 | Applied Machine Learning (COMP551)

Problematic settings Problematic settings In ∗ ⊤ −1 ⊤ = ( X X ) W X Y what if we have a large dataset ? N > 100, 000, 000 use stochastic gradient descent what if ⊤ is not invertible ? X X w = np.linalg.lstsq(X,Y)[0] columns of X (features) are not linearly independent (either redundant features or D > N) or find one of the solutions W* is not unique, make it unique by decomposition-based (not discussed) methods still work removing redundant features use gradient descent (later!) regularization (later!) 12

Applied Machine Learning Applied Machine Learning Linear Regression - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Linear Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives linear model evaluation criteria how to

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Lecture 10 Householder Triangularization NLA Reading Group Spring 13 by Onur Gngr

Scalar Equation of a Plane MCV4U: Calculus & Vectors Imagine a plane containing point P ( x p

4. Square systems of linear equations We have already seen that equations of the form ax + by + cz

Notes Block Approach to LU Assignment 1 is out (due October 5) Rather than get bogged down

Lines Consider a line, a vector x 0 going from the origin to a point on the line and a vector v

Overview Computational Tricks Objective Computationally efficient algorithms for quadratic

Solving High Dimensional Hamilton- Jacobi-Bellman Equations Using Low Rank Tensor Decomposition

An Efficient Gauss-Newton Algorithm for Symmetric Low-Rank Product Matrix Approximations Xin Liu

Applied Machine Learning Applied Machine Learning Linear Regression - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Linear Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives linear model evaluation criteria how to

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Lecture 10 Householder Triangularization NLA Reading Group Spring 13 by Onur Gngr

Scalar Equation of a Plane MCV4U: Calculus &amp; Vectors Imagine a plane containing point P ( x p

4. Square systems of linear equations We have already seen that equations of the form ax + by + cz

Notes Block Approach to LU Assignment 1 is out (due October 5) Rather than get bogged down

Lines Consider a line, a vector x 0 going from the origin to a point on the line and a vector v

Overview Computational Tricks Objective Computationally efficient algorithms for quadratic

Solving High Dimensional Hamilton- Jacobi-Bellman Equations Using Low Rank Tensor Decomposition

An Efficient Gauss-Newton Algorithm for Symmetric Low-Rank Product Matrix Approximations Xin Liu

Scalar Equation of a Plane MCV4U: Calculus & Vectors Imagine a plane containing point P ( x p