Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch Boosting, Min-Norm Interpolated Classifiers, and Overparametrization: a precise asymptotic theory Tengyuan Liang joint work with Pragya Sur (Harvard) 1 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch OUTLINE ● Motivation: min-norm interpolants under overparametrized regime ● Classification: boosting on separable data ● precise asymptotics of margin ● fixed point of a non-linear system of equations ● statistical and algorithmic implications ● Proof Sketch: Gaussian comparison and convex geometry tools 2 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch OVERPARAMETRIZED REGIME OF STAT / ML Model class complex enough to interpolate the training data. Zhang, Bengio, Hardt, Recht, and Vinyals (2016) Belkin et al. (2018); Liang and Rakhlin (2018); Bartlett et al. (2019); Hastie et al. (2019) Kernel Regression on MNIST 10 1 digits pair [i,j] [2,5] log(error) [2,9] [3,6] [3,8] [4,7] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda λ = 0: the interpolants on training data. MNIST data from LeCun et al. (2010) 3 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch OVERPARAMETRIZED REGIME OF STAT / ML Model class complex enough to interpolate the training data. Zhang, Bengio, Hardt, Recht, and Vinyals (2016) Belkin et al. (2018); Liang and Rakhlin (2018); Bartlett et al. (2019); Hastie et al. (2019) Kernel Regression on MNIST 10 1 digits pair [i,j] [2,5] [3,5] [4,5] log(error) [2,6] [3,6] [4,6] [2,7] [3,7] [4,7] [2,8] [3,8] [4,8] [2,9] [3,9] [4,9] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda λ = 0: the interpolants on training data. MNIST data from LeCun et al. (2010) 3 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch OVERPARAMETRIZED REGIME OF STAT / ML In fact, many models behave the same on training data. Practical methods or algorithms favor certain functions! Principle : among the models that interpolate , algorithms favor certain form of minimalism . 4 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch OVERPARAMETRIZED REGIME OF STAT / ML Principle : among the models that interpolate , algorithms favor certain form of minimalism . ● overparametrized linear model and matrix factorization ● kernel regression ● support vector machines, Perceptron ● boosting, AdaBoost ● two-layer ReLU networks, deep neural networks (?) 4 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch OVERPARAMETRIZED REGIME OF STAT / ML Principle : among the models that interpolate , algorithms favor certain form of minimalism . ● overparametrized linear model and matrix factorization ● kernel regression ● support vector machines, Perceptron ● boosting, AdaBoost ● two-layer ReLU networks, deep neural networks (?) minimalism typically measured in form of certain norm motivates the study of min-norm interpolants 4 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch MIN - NORM INTERPOLANTS minimalism typically measured in form of certain norm motivates the study of min-norm interpolants Regression ∥ f ∥ norm , s . t . y i = f ( x i ) ∀ i ∈ [ n ] . ̂ f = arg min f Classification ̂ ∥ f ∥ norm , s . t . y i ⋅ f ( x i ) ≥ 1 ∀ i ∈ [ n ] . f = arg min f 5 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch Precise High-Dimensional Asymptotic Theory for Boosting and Min- L 1 -Norm Interpolated Classifiers tyliang.github.io/Tengyuan.Liang/pdf/Liang-Sur-20.pdf Classification ̂ ∥ f ∥ norm , s . t . y i ⋅ f ( x i ) ≥ 1 ∀ i ∈ [ n ] . f = arg min f 6 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch PROBLEM FORMULATION Given n -i.i.d. data pairs {( x i , y i )} 1 ≤ i ≤ n , with ( x , y ) ∼ P y i ∈ { ± 1 } binary labels, x i ∈ R p feature vector (weak learners) Consider when data is linearly separable P ( ∃ θ ∈ R p , y i x ⊺ i θ > 0 for 1 ≤ i ≤ n ) → 1 . Natural to consider overparametrized regime p / n → ψ ∈ ( 0 , ∞ ) . 7 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch B OOSTING /A DA B OOST Initialize θ 0 = 0 ∈ R p , set data weights η 0 = ( 1 / n , ⋯ , 1 / n ) ∈ ∆ n . At time t ≥ 0: t ∶= arg max j ∈[ p ] ∣ η ⊺ t Z e j ∣ , set γ t = η ⊺ 1. Learner/Feature Selection: j ⋆ t Z e j ⋆ ; t 2. Adaptive Stepsize: α t = 1 2 log ( 1 + γ t 1 − γ t ) ; 3. Coordinate Update: θ t + 1 = θ t + α t ⋅ e j ⋆ ; t 4. Weight Update: η t + 1 [ i ] ∝ η t [ i ] exp ( − α t y i x ⊺ ) , normalized η t + 1 ∈ ∆ n . i e j ⋆ t Terminate after T steps, and output the vector θ T . Freund and Schapire (1995, 1996) 8 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch B OOSTING /A DA B OOST “... mystery of AdaBoost as the most important unsolved problem in Machine Learn- ing” Wald Lecture, Breiman (2004) 8 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch KEY : EMPIRICAL MARGIN Empirical margin is key to Generalization and Optimization. Generalization: for all f ( x ) = x ⊺ θ /∥ θ ∥ 1 and κ > 0, √ √ log ( 1 / δ ) n P ( y f ( x ) < 0 ) ≤ 1 log n log p ■ ( y i f ( x i ) < κ ) + + , w.p. 1 − δ ∑ n κ 2 n i = 1 n �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� generalization error empirical margin Schapire, Freund, Bartlett, and Lee (1998) Choose classifier f that maximizes minimal margin κ 1 ≤ i ≤ n y i x ⊺ i θ /∥ θ ∥ 1 κ = max θ ∈ R p min 1 √ n κ ⋅ (log factors, constants) generalization error < 9 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch KEY : EMPIRICAL MARGIN Empirical margin is key to Generalization and Optimization. Generalization: for all f ( x ) = x ⊺ θ /∥ θ ∥ 1 and κ > 0, √ √ log ( 1 / δ ) n P ( y f ( x ) < 0 ) ≤ 1 log n log p ■ ( y i f ( x i ) < κ ) ∑ + + , w.p. 1 − δ n n κ 2 n i = 1 �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� generalization error empirical margin Schapire, Freund, Bartlett, and Lee (1998) “An important open problem is to derive more careful and precise bounds which can be used for this purpose. Besides paying closer attention to constant factors, such an analysis might also involve the measurement of more sophisticated statistics.” Schapire, Freund, Bartlett, and Lee (1998) 9 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch KEY : EMPIRICAL MARGIN Empirical margin is key to Generalization and Optimization. Optimization: for AdaBoost, p -weak learners, Z ∶ = y ○ X ∈ R n × p γ 2 n T ■ ( − y i x ⊺ i θ T > 0 ) ≤ ne ⋅ exp ( − 2 ( 1 + o ( γ t ))) . t ∑ ∑ i = 1 t = 1 By Minimax Thm. ∣ γ t ∣ = ∥ Z ⊺ η t ∥ ∞ ≥ min ∥ Z ⊺ η ∥ ∞ = min ∥ θ ∥ 1 ≤ 1 η ⊺ Z θ = max 1 ≤ i ≤ n e ⊺ max ∥ θ ∥ 1 ≤ 1 min i Z θ ≥ κ η ∈ ∆ n η ∈ ∆ n Freund and Schapire (1995); Zhang and Yu (2005) Stopping time (zero-training error) optimization steps < 1 κ 2 ⋅ (log factors, constants) 10 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch L 1 GEOMETRY , MARGIN , AND INTERPOLATION We consider min- L 1 -norm interpolated classifier on separable data ∥ θ ∥ 1 , s.t. y i x ⊺ i θ ≥ 1 , ∀ i ∈ [ n ] . ˆ θ ℓ 1 = arg min θ Algorithmic: on separable data, Boosting algorithm θ T , s boost with infinitesimal step- size s agrees with the min- L 1 -norm interpolation asymptotically T → ∞ θ T , s boost /∥ θ T , s boost ∥ 1 = ˆ lim s → 0 lim θ ℓ 1 . Freund and Schapire (1995); Rosset et al. (2004); Zhang and Yu (2005) 11 / 35
Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch L 1 GEOMETRY , MARGIN , AND INTERPOLATION min- L 1 -norm interpolation equiv. max- L 1 -margin 1 ≤ i ≤ n y i x ⊺ i θ = ∶ κ ℓ 1 ( X , y ) . ∥ θ ∥ 1 ≤ 1 min max Prior understanding: 1 √ n κ ⋅ (log factors, constants) generalization error < optimization steps < 1 κ 2 ⋅ (log factors, constants) 12 / 35
Recommend
More recommend