coms 4721 machine learning for data science lecture 2 1
play

COMS 4721: Machine Learning for Data Science Lecture 2, 1/19/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 2, 1/19/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University L INEAR R EGRESSION E XAMPLE : O LD F AITHFUL E XAMPLE : O LD F AITHFUL


  1. COMS 4721: Machine Learning for Data Science Lecture 2, 1/19/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. L INEAR R EGRESSION

  3. E XAMPLE : O LD F AITHFUL

  4. E XAMPLE : O LD F AITHFUL ● ● ● ● ● ● 90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Waiting Time (min) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Current Eruption Time (min) Can we meaningfully predict the time between eruptions only using the duration of the last eruption?

  5. E XAMPLE : O LD F AITHFUL ● ● ● ● ● ● 90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Waiting Time (min) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Current Eruption Time (min) Can we meaningfully predict the time between eruptions only using the duration of the last eruption?

  6. E XAMPLE : O LD F AITHFUL ● ● ● ● One model for this ● 90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (wait time) ≈ w 0 + (last duration) × w 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Waiting Time (min) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ◮ w 0 and w 1 are to be learned. 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ◮ This is an example of linear regression. ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Current Eruption Time (min) Refresher w 1 is the slope, w 0 is called the intercept, bias, shift, offset.

  7. H IGHER DIMENSIONS Two inputs (output) ≈ w 0 + (input 1) × w 1 + (input 2) × w 2 y With two inputs the intuition is the same − → y = w0 + x1w1 + x2w2 x2 x1

  8. R EGRESSION : P ROBLEM D EFINITION Data Input : x ∈ R d (i.e., measurements, covariates, features, indepen. variables) Output : y ∈ R (i.e., response, dependent variable) Goal Find a function f : R d → R such that y ≈ f ( x ; w ) for the data pair ( x , y ) . f ( x ; w ) is called a regression function . Its free parameters are w . Definition of linear regression A regression method is called linear if the prediction f is a linear function of the unknown parameters w .

  9. L EAST SQUARES LINEAR REGRESSION MODEL Model The linear regression model we focus on now has the form d � y i ≈ f ( x i ; w ) = w 0 + x ij w j . j = 1 Model learning We have the set of training data ( x 1 , y 1 ) . . . ( x n , y n ) . We want to use this data to learn a w such that y i ≈ f ( x i ; w ) . But we first need an objective function to tell us what a “good” value of w is. Least squares The least squares objective tells us to pick the w that minimizes the sum of squared errors n ( y i − f ( x i ; w )) 2 ≡ arg min � w LS = arg min L . w w i = 1

  10. L EAST SQUARES IN PICTURES Observations: Vertical length is error. The objective function L is the sum of all the squared lengths. Find weights ( w 1 , w 2 ) plus an offset w 0 to minimize L . ( w 0 , w 1 , w 2 ) defines this plane.

  11. E XAMPLE : E DUCATION , S ENIORITY AND I NCOME 2-dimensional problem Input : (education, seniority) ∈ R 2 . Output : (income) ∈ R Model : (income) ≈ w 0 + (education) w 1 + (seniority) w 2 Question : Both w 1 , w 2 > 0. What does this tell us? Answer : As education and/or seniority goes up, income tends to go up. (Caveat: This is a statement about correlation, not causation.)

  12. L EAST SQUARES LINEAR REGRESSION MODEL Thus far We have data pairs ( x i , y i ) of measurements x i ∈ R d and a response y i ∈ R . We believe there is a linear relationship between x i and y i , d � y i = w 0 + x ij w j + ǫ i j = 1 and we want to minimize the objective function n n � � ( y i − w 0 − � d ǫ 2 L = i = j = 1 x ij w j ) 2 i = 1 i = 1 with respect to ( w 0 , w 1 , . . . , w d ) . Can math notation make this easier to look at/work with?

  13. N OTATION : V ECTORS AND M ATRICES We think of data with d dimensions as a column vector:     x i 1 age x i 2 height     x i = (e.g.) ⇒  .   .  . .     . .     x id income A set of n vectors can be stacked into a matrix:    − x T  x 11 . . . x 1 d 1 − − x T x 21 . . . x 2 d 2 −     X =  =  . .   .  . . .     . . .    − x T n − x n 1 . . . x nd Assumptions for now: ◮ All features are treated as continuous-valued ( x ∈ R d ) ◮ We have more observations than dimensions ( d < n )

  14. N OTATION : R EGRESSION ( AND CLASSIFICATION ) Usually, for linear regression (and classification) we include an intercept term w 0 that doesn’t interact with any element in the vector x ∈ R d . It will be convenient to attach a 1 to the first dimension of each vector x i (which we indicate by x i ∈ R d + 1 ) and in the first column of the matrix X :   1    1 − x T  1 x 11 . . . x 1 d 1 − x i 1   1 − x T 2 − 1 x 21 . . . x 2 d       x i 2 x i = X =  =   ,     .  . . . . . .       . . . . .     .   1 − x T n − 1 x n 1 . . . x nd x id We also now view w = [ w 0 , w 1 , . . . , w d ] T as w ∈ R d + 1 .

  15. L EAST SQUARES IN VECTOR FORM Original least squares objective function: L = � n i = 1 ( y i − w 0 − � d j = 1 x ij w j ) 2 Using vectors, this can now be written: L = � n i = 1 ( y i − x T i w ) 2 Least squares solution (vector version) We can find w by setting, n � ∇ w ( y 2 i − 2 w T x i y i + w T x i x T ∇ w L = 0 ⇒ i w ) = 0 . i = 1 Solving gives, n n n n � − 1 � � � � � � � � � 2 x i x T x i x T − 2 y i x i + w = 0 ⇒ w LS = y i x i . i i i = 1 i = 1 i = 1 i = 1

Recommend


More recommend