regml 2020 class 3 early stopping and spectral
play

RegML 2020 Class 3 Early Stopping and Spectral Regularization - PowerPoint PPT Presentation

RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT Learning problem Solve d ( x, y ) L ( w x, y ) min w E ( w ) , E ( w ) = given ( x 1 , y 1 ) , . . . , ( x n , y n ) Beyond linear models:


  1. RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT

  2. Learning problem Solve � dρ ( x, y ) L ( w ⊤ x, y ) min w E ( w ) , E ( w ) = given ( x 1 , y 1 ) , . . . , ( x n , y n ) Beyond linear models: non-linear features and kernels L.Rosasco, RegML 2020 2

  3. Regularization by penalization Replace E ( w ) + λ � w � 2 � min w E ( w ) by min � �� � w � E λ ( w ) � n ◮ � E ( w ) = 1 i =1 L ( w ⊤ x i , y i ) n ◮ λ > 0 regularization parameter L.Rosasco, RegML 2020 3

  4. Loss functions and computational methods ◮ Logistic loss log(1 + e − yw ⊤ x ) ◮ Hinge loss | 1 − yw ⊤ x | + w t +1 = w t − γ t ∇ � E λ ( w t ) . . . L.Rosasco, RegML 2020 4

  5. Square loss (1 − yw ⊤ x ) 2 = ( y − w ⊤ x ) 2 L.Rosasco, RegML 2020 5

  6. Square loss (1 − yw ⊤ x ) 2 = ( y − w ⊤ x ) 2 E ( w ) = 1 E λ ( w ) = � � � n � ˆ E ( w ) + λ � w � 2 y � 2 with Xw − ˆ ◮ ˆ X n × d data matrix ◮ ˆ y n × 1 output vector. L.Rosasco, RegML 2020 5

  7. Ridge regression / Tikhonov regression E λ ( w ) = 1 y � 2 + λ � w � 2 � n � ˆ Xw − ˆ � �� � Smooth and strongly convex E λ ( w ) = 2 ∇ � X ⊤ ( ˆ ˆ Xw − ˆ y ) + 2 λw = 0 n X ⊤ ˆ ⇒ ( ˆ X + λnI ) w = ˆ X ⊤ ˆ = y L.Rosasco, RegML 2020 6

  8. Linear systems X ⊤ ˆ ( ˆ X + λnI ) w = ˆ X ⊤ ˆ y ◮ nd 2 to form ˆ X ⊤ ˆ X ◮ roughly d 3 to solve the linear system L.Rosasco, RegML 2020 7

  9. Representer theorem for square loss � n f ( x ) = x ⊤ w x ⊤ x i c i = ⇒ f ( x ) = i =1 L.Rosasco, RegML 2020 8

  10. Representer theorem for square loss � n f ( x ) = x ⊤ w x ⊤ x i c i = ⇒ f ( x ) = i =1 Using SVD of ˆ X ... X ⊤ ˆ X + λnI ) − 1 ˆ X ⊤ ( ˆ X ⊤ + λnI ) − 1 ˆ w = ( ˆ y = ˆ X ˆ X ⊤ ˆ y � �� � c � n ⇒ w = ˆ X ⊤ c = = x i c i i =1 L.Rosasco, RegML 2020 8

  11. Beyond linear models n � f ( x ) = x ⊤ w = x ⊤ x i c i , i =1 X ⊤ ˆ X + λnI ) − 1 ˆ X ⊤ + λnI ) − 1 ˆ w = ( ˆ X ⊤ ˆ c = ( ˆ X ˆ y, y ◮ non-linear function f ( x ) = φ ( x ) ⊤ w x �→ φ ( x ) = ( φ 1 ( x ) , . . . φ n ( x )) , ◮ non-linear kernels � n X ⊤ = ˆ X ˆ ˆ f ( x ) = K ( x, x i ) c i . K, i =1 L.Rosasco, RegML 2020 9

  12. Interlude: linear systems and stability c = a 1 Aw = y, A = diag( a 1 , . . . , a d ) , < ∞ , a d A − 1 = diag( a − 1 w = A − 1 y, 1 , . . . , a − 1 d ) More generally A = U Σ U ⊤ , Σ = diag( σ 1 , . . . , σ d ) A − 1 = U Σ − 1 U ⊤ , Σ − 1 = diag( σ − 1 1 , . . . , σ − 1 d ) L.Rosasco, RegML 2020 10

  13. Tikhonov Regularization � n � n 1 1 ( y i − w ⊤ x i ) 2 + λ � w � 2 ( y i − w ⊤ x i ) 2 �→ n n i =1 i =1 X ⊤ ˆ X ⊤ ˆ ˆ Xw = ˆ ( ˆ X + λnI ) w = ˆ X ⊤ ˆ X ⊤ ˆ y �→ y Overfitting and numerical stability L.Rosasco, RegML 2020 11

  14. Beyond Tikhonov: TSVD X ⊤ ˆ X ⊤ ˆ ˆ w M = ( ˆ M ˆ X = V Σ V ⊤ , X ) − 1 X ⊤ ˆ y X ⊤ ˆ ◮ ( ˆ X ) − 1 M = V Σ − 1 M V ⊤ ◮ Σ − 1 M = diag( σ − 1 1 , . . . , σ M − 1 , 0 . . . , 0) Also known as principal component regression (PCR) L.Rosasco, RegML 2020 12

  15. Principal component analysis (PCA) Dimensionality reduction X ⊤ ˆ ˆ X = V Σ V ⊤ Eigenfunctions are directions, of ◮ maximum variance ◮ best reconstruction L.Rosasco, RegML 2020 13

  16. TSVD and PCA ⇔ PCA + ERM TSV D Regularization by projection L.Rosasco, RegML 2020 14

  17. TSVD/PCR beyond linearity Non-linear function p � w i φ i ( x ) = Φ( x ) ⊤ w f ( x ) = i =1 with w = ( � Φ ⊤ � Φ) − 1 M � Φ ⊤ ˆ y Φ = (Φ( x 1 ) , . . . Φ( x n )) ⊤ ∈ R n × p . Let � Φ ⊤ � � ( � Φ ⊤ � Φ = V Σ V ⊤ , Φ) − 1 M = V Σ − 1 M V ⊤ Σ − 1 M = diag( σ − 1 1 , . . . , σ − 1 Σ = diag( σ 1 , . . . , σ p ) , M , 0 , . . . ) L.Rosasco, RegML 2020 15

  18. TSVD/PCR with kernels � n c = ( � K ) − 1 f ( x ) = K ( x, x i ) c i , M ˆ y i =1 � K = U Σ U ⊤ , � K ij = K ( x i , x j ) , Σ = ( σ 1 , . . . , σ n ) , � K − 1 M = U Σ − 1 M U ⊤ , Σ − 1 M = ( σ − 1 1 , . . . , σ − 1 M , 0 , . . . ) , L.Rosasco, RegML 2020 16

  19. Early stopping regularization Other example of regularization: Early stopping of an iterative procedure applied to noisy data. L.Rosasco, RegML 2020 17

  20. Gradient descent for square loss w t +1 = w t − γ ˆ X ⊤ ( ˆ Xw t − ˆ y ) n � ( y i − w ⊤ x i ) 2 = � ˆ y � 2 Xw − ˆ i =1 ◮ no penalty 2 ◮ stepsize chosen a priori γ = � ˆ X ⊤ ˆ X � L.Rosasco, RegML 2020 18

  21. Early stopping at work Fitting on the training set Iteration #1 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1 L.Rosasco, RegML 2020 19

  22. Early stopping at work Fitting on the training set Iteration #2 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1 L.Rosasco, RegML 2020 19

  23. Early stopping at work Fitting on the training set Iteration #7 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1 L.Rosasco, RegML 2020 19

  24. Early stopping at work Fitting on the training set Iteration #5000 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1 L.Rosasco, RegML 2020 19

  25. Semi-convergence � min w E ( w ) vs min E ( w ) w L.Rosasco, RegML 2020 20

  26. Connection to Tikhonov or TSVD w t +1 = w t − γ ˆ X ⊤ ( ˆ Xw t − ˆ y ) X ⊤ ˆ = ( I − γ ˆ X ) w t + γ ˆ X ⊤ ˆ y ) by induction � t − 1 X ⊤ ˆ ( I − γ ˆ ˆ X ) j X ⊤ ˆ w t = γ y j =0 � �� � Truncated power series L.Rosasco, RegML 2020 21

  27. Neumann series t − 1 � X ⊤ ˆ ( I − γ ˆ X ) j γ j =0 L.Rosasco, RegML 2020 22

  28. Neumann series t − 1 � X ⊤ ˆ ( I − γ ˆ X ) j γ j =0 ◮ | a | < 1 ∞ ∞ � � (1 − a ) − 1 = a − 1 = a j (1 − a ) j = ⇒ j =0 j =0 L.Rosasco, RegML 2020 22

  29. Neumann series t − 1 � X ⊤ ˆ ( I − γ ˆ X ) j γ j =0 ◮ | a | < 1 ∞ ∞ � � (1 − a ) − 1 = a − 1 = a j (1 − a ) j = ⇒ j =0 j =0 ◮ A ∈ R d × d , � A � < 1 , invertible ∞ � A − 1 = ( I − A ) j j =0 L.Rosasco, RegML 2020 22

  30. Stable matrix inversion Truncated Neumann Series ∞ t − 1 � � X ⊤ ˆ X ) − 1 = γ X ⊤ ˆ X ⊤ ˆ ( ˆ ( I − γ ˆ ( I − γ ˆ X ) j X ) j ≈ γ j =0 j =0 compare to X ⊤ ˆ X ⊤ ˆ ( ˆ ( ˆ X ) − 1 X + λnI ) − 1 ≈ L.Rosasco, RegML 2020 23

  31. Spectral filtering Different instances of the same principle. ◮ Tikhonov X ⊤ ˆ X + λnI ) − 1 ˆ w t = ( ˆ X ⊤ ˆ y ◮ Early Stopping t − 1 � X ⊤ ˆ X ) j ˆ ( I − γ ˆ X ⊤ ˆ w t = γ y j =0 ◮ TSVD X ⊤ ˆ w t = ( ˆ X ) − 1 M ˆ X ⊤ ˆ y L.Rosasco, RegML 2020 24

  32. Statistics and optimization t − 1 � X ⊤ ˆ X ) j ˆ ( I − γ ˆ X ⊤ ˆ w t = γ y j =0 The difference is in the computations w t +1 = w t − γ ˆ X ⊤ ( ˆ Xw t − ˆ y ) ◮ Tikhonov - O ( nd 2 + d 3 ) ◮ TSVD - O ( nd 2 + d 2 M ) ◮ GD - O ( ndt ) L.Rosasco, RegML 2020 25

  33. Regularization path and warm restart � min w E ( w ) vs min E ( w ) w L.Rosasco, RegML 2020 26

  34. Beyond linear models Non-linear function p � w i φ i ( x ) = Φ( x ) ⊤ w f ( x ) = i =1 ◮ Replace x by Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) ◮ Replace � X by Φ = (Φ( x 1 ) , . . . Φ( x n )) ⊤ ∈ R n × p . � w t +1 = w t − γ � Φ ⊤ ( � Φ w t − ˆ y ) Computational cost O ( npt ) . L.Rosasco, RegML 2020 27

  35. Early-stopping and kernels � n f ( x ) = K ( x i , x ) c i i =1 By induction c t +1 = c t − γ ( � Kc t − ˆ y ) � K ij = K ( x i , x j ) Computational Complexity O ( n 2 t ) . L.Rosasco, RegML 2020 28

  36. What about other loss functions? ◮ PCA + ERM ◮ Gradient / Subgradient Descent. Iterations for regularization, not only optimization! L.Rosasco, RegML 2020 29

  37. Going big... Bottleneck of Kernel methods Memory � O ( n 2 ) K is L.Rosasco, RegML 2020 30

  38. Approaches to large scale ◮ (Random) features - find � Φ : X → R M , with M ≪ n s.t. K ( x, x ′ ) ≈ � Φ( x ) ⊤ � Φ( x ′ ) L.Rosasco, RegML 2020 31

  39. Approaches to large scale ◮ (Random) features - find � Φ : X → R M , with M ≪ n s.t. K ( x, x ′ ) ≈ � Φ( x ) ⊤ � Φ( x ′ ) ◮ Subsampling (Nystr¨ om) - replace � n � M f ( x ) = K ( x, x i ) c i by f ( x ) = K ( x, ˜ x i ) c i i =1 i =1 ˜ x i subsampled from training set, M L.Rosasco, RegML 2020 31

  40. Approaches to large scale ◮ (Random) features - find � Φ : X → R M , with M ≪ n s.t. K ( x, x ′ ) ≈ � Φ( x ) ⊤ � Φ( x ′ ) ◮ Subsampling (Nystr¨ om) - replace � n � M f ( x ) = K ( x, x i ) c i by f ( x ) = K ( x, ˜ x i ) c i i =1 i =1 ˜ x i subsampled from training set, M ◮ Greedy! ◮ Neural Nets L.Rosasco, RegML 2020 31

  41. This class Regularization beyond penalization ◮ Regularization by projection ◮ Regularization by early stopping L.Rosasco, RegML 2020 32

  42. Next class Multioutput learning ◮ Multitask learning ◮ Vector valued learning ◮ Multiclass learning L.Rosasco, RegML 2020 33

Recommend


More recommend