5. Summary of linear regression so far
Main points ◮ Model/function/predictor class of linear regressors x �→ w T x . ◮ ERM principle: we chose a loss (least squares) and find a good predictor by minimizing empirical risk. ◮ ERM solution for least squares: pick w satisfying A T Aw = A T b , which is not unique; one unique choice is the ordinary least squares solution A + b . 18 / 94
Part 2 of linear regression lecture. . .
Recap on SVD. (A messy slide, I’m sorry.) Suppose 0 � = M ∈ R n × d , thus r := rank( M ) > 0 . ◮ “Decomposition form” thin SVD: M = � r i =1 s i u i v T i , and s 1 ≥ · · · ≥ s r > 0 , and M + = � r 1 s i v i u T i . and in general i =1 M + M = � r i =1 v i v T i � = I . ◮ “Factorization form” thin SVD: M = USV T , U ∈ R n × r and V ∈ R d × r orthonormal but U T U and V T V are not identity matrices in general, and S = diag( s 1 , . . . , s r ) ∈ R r × r with s 1 ≥ · · · ≥ s r > 0 ; pseudoinverse M + = V S − 1 U T and in general M + M � = MM + � = I . f , U f ∈ R n × n and V ∈ R d × d orthonormal and ◮ Full SVD: M = U f S f V T f V f are identity matrices and S f ∈ R n × d is zero full rank so U T f U f and V T everywhere except the first r diagonal entries which are s 1 ≥ · · · ≥ s r > 0 ; pseudoinverse M + = V f S + f U T f where S + is obtained f by transposing S f and then flipping nonzero entries, and in general M + M � = MM + � = I . Additional property: agreement with eigendecompositions of MM T and M T M . The “full SVD” adds columns to U and V which hit zeros of S and therefore don’t matter (as a sanity check, verify for yourself that all these SVDs are equal). 19 / 94
Recap on SVD, zero matrix case Suppose 0 = M ∈ R n × d , thus r := rank( M ) = 0 . ◮ In all types of SVD, M + is M T (another zero matrix). ◮ Technically speaking, s is a singular value of M iff exist nonzero vectors ( u , v ) with Mv = s u and M T u = s v , and zero matrix therefore has no singular values (or left/right singular vectors). ◮ “Factorization form thin SVD” becomes a little messy. 20 / 94
6. More on the normal equations
Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by x T ← → y 1 1 1 1 . . A := . b := . √ n , √ n . . . x T ← → y n n 21 / 94
Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by x T ← → y 1 1 1 1 . . A := . b := . √ n , √ n . . . x T ← → y n n Can write empirical risk as � � 2 n � R ( w ) = 1 � T = � Aw − b � 2 y i − x i w 2 . n i =1 21 / 94
Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by x T ← → y 1 1 1 1 . . A := . b := . √ n , √ n . . . x T ← → y n n Can write empirical risk as � � 2 n � R ( w ) = 1 � T = � Aw − b � 2 y i − x i w 2 . n i =1 Necessary condition for w to be a minimizer of � R : ∇ � i.e., w is a critical point of � R ( w ) = 0 , R . 21 / 94
Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by x T ← → y 1 1 1 1 . . A := . b := . √ n , √ n . . . x T ← → y n n Can write empirical risk as � � 2 n � R ( w ) = 1 � T = � Aw − b � 2 y i − x i w 2 . n i =1 Necessary condition for w to be a minimizer of � R : ∇ � i.e., w is a critical point of � R ( w ) = 0 , R . This translates to T A ) w = A T b , ( A a system of linear equations called the normal equations . 21 / 94
Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by x T ← → y 1 1 1 1 . . A := . b := . √ n , √ n . . . x T ← → y n n Can write empirical risk as � � 2 n � R ( w ) = 1 � T = � Aw − b � 2 y i − x i w 2 . n i =1 Necessary condition for w to be a minimizer of � R : ∇ � i.e., w is a critical point of � R ( w ) = 0 , R . This translates to T A ) w = A T b , ( A a system of linear equations called the normal equations . We’ll now finally show that normal equations imply optimality. 21 / 94
Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. 22 / 94
Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. Morever, writing A = � r i =1 s i u i v T i , � r � Aw ′ − Aw � 2 = ( w ′ − w ) ⊤ ( A ⊤ A )( w ′ − w ) = ( w ′ − w ) ⊤ ( w ′ − w ) , s 2 T i v i v i i =1 so w ′ optimal iff w ′ − w is in the right nullspace of A . 22 / 94
Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. Morever, writing A = � r i =1 s i u i v T i , � r � Aw ′ − Aw � 2 = ( w ′ − w ) ⊤ ( A ⊤ A )( w ′ − w ) = ( w ′ − w ) ⊤ ( w ′ − w ) , s 2 T i v i v i i =1 so w ′ optimal iff w ′ − w is in the right nullspace of A . (We’ll revisit all this with convexity later.) 22 / 94
Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so ↑ ↑ A = a 1 · · · a d . ↓ ↓ 23 / 94
Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so ↑ ↑ A = a 1 · · · a d . ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . 23 / 94
Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so ↑ ↑ A = a 1 · · · a d . ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . Solution ˆ b is orthogonal projection of b onto range( A ) = { Aw : w ∈ R d } . b a 1 ˆ b a 2 23 / 94
Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so ↑ ↑ A = a 1 · · · a d . ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . Solution ˆ b is orthogonal projection of b onto range( A ) = { Aw : w ∈ R d } . ◮ ˆ b is uniquely determined; indeed, b = AA + b = � r b ˆ i =1 u i u T i b . a 1 ˆ b a 2 23 / 94
Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so ↑ ↑ A = a 1 · · · a d . ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . Solution ˆ b is orthogonal projection of b onto range( A ) = { Aw : w ∈ R d } . ◮ ˆ b is uniquely determined; indeed, b = AA + b = � r b ˆ i =1 u i u T i b . ◮ If r = rank( A ) < d , then > 1 way to write ˆ b as linear combination of a 1 , . . . , a d . a 1 ˆ b a 2 23 / 94
Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so ↑ ↑ A = a 1 · · · a d . ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . Solution ˆ b is orthogonal projection of b onto range( A ) = { Aw : w ∈ R d } . ◮ ˆ b is uniquely determined; indeed, b = AA + b = � r b ˆ i =1 u i u T i b . ◮ If r = rank( A ) < d , then > 1 way to write ˆ b as linear combination of a 1 , . . . , a d . If rank( A ) < d , then ERM solution is not unique . a 1 ˆ b a 2 23 / 94
Recommend
More recommend