Intro. Min-norm Interpolant Regression Classification Theory for Minimum Norm Interpolation: Regression and Classification in High Dimensions Tengyuan Liang Classification: with Pragya Sur (Harvard) Regression: with Sasha Rakhlin (MIT), Xiyu Zhai (MIT) 1 / 37
Intro. Min-norm Interpolant Regression Classification OUTLINE • Motivation: min-norm interpolants • Regression: multiple descent of risk • Classification: boosting on separable data 2 / 37
Intro. Min-norm Interpolant Regression Classification OUTLINE • Motivation: min-norm interpolants • Regression: multiple descent of risk • application to wide neural networks • restricted lower isometry of kernels • small-ball property • Classification: boosting on separable data • precise high-dim asymptotics • convex Gaussian min-max theorem • algorithmic implications on boosting 2 / 37
Intro. Min-norm Interpolant Regression Classification OVER - PARAMETRIZED REGIME OF STAT / ML Model class complex enough to interpolate the training data. Kernel Regression on MNIST 10 1 digits pair [i,j] [2,5] log(error) [2,9] [3,6] [3,8] [4,7] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda λ = 0: the interpolants on training data. MNIST data from LeCun et al. (2010) 3 / 37
Intro. Min-norm Interpolant Regression Classification OVER - PARAMETRIZED REGIME OF STAT / ML Model class complex enough to interpolate the training data. Kernel Regression on MNIST 10 1 digits pair [i,j] [2,5] [3,5] [4,5] log(error) [2,6] [3,6] [4,6] [2,7] [3,7] [4,7] [2,8] [3,8] [4,8] [2,9] [3,9] [4,9] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda λ = 0: the interpolants on training data. MNIST data from LeCun et al. (2010) 3 / 37
Intro. Min-norm Interpolant Regression Classification OVER - PARAMETRIZED REGIME OF STAT / ML Model class complex enough to interpolate the training data. Zhang, Bengio, Hardt, Recht, and Vinyals (2016) In fact, many models behave the same on training data. Practical methods or algorithms favor certain functions! Principle : among the models that interpolate , algorithms favor certain form of minimalism . 4 / 37
Intro. Min-norm Interpolant Regression Classification OVER - PARAMETRIZED REGIME OF STAT / ML Principle : among the models that interpolate , algorithms favor certain form of minimalism . • over-parametrized linear model and matrix factorization • kernel machines • support vector machines • boosting, AdaBoost • two-layer ReLU networks 4 / 37
Intro. Min-norm Interpolant Regression Classification OVER - PARAMETRIZED REGIME OF STAT / ML Principle : among the models that interpolate , algorithms favor certain form of minimalism . • over-parametrized linear model and matrix factorization • kernel machines • support vector machines • boosting, AdaBoost • two-layer ReLU networks minimalism typically measured in form of certain norm motivates the study of min-norm interpolants 4 / 37
Intro. Min-norm Interpolant Regression Classification MIN - NORM INTERPOLANTS minimalism typically measured in form of certain norm motivates the study of min-norm interpolants Regression ∥ f ∥ norm , s . t . y i = f ( x i ) ∀ i ∈ [ n ] . ̂ f = arg min f Classification ̂ ∥ f ∥ norm , s . t . y i ⋅ f ( x i ) ≥ 1 ∀ i ∈ [ n ] . f = arg min f 5 / 37
Intro. Min-norm Interpolant Regression Classification R EGRESSION Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels with Sasha Rakhlin (MIT), Xiyu Zhai (MIT) 6 / 37
Intro. Min-norm Interpolant Regression Classification SHAPE OF RISK CURVE Classic: U-shape curve Recent: double descent curve Belkin, Hsu, Ma, and Mandal (2018); Hastie, Montanari, Rosset, and Tibshirani (2019) Question: shape of the risk curve w.r.t. “over-parametrization” ? 7 / 37
Intro. Min-norm Interpolant Regression Classification SHAPE OF RISK CURVE Classic: U-shape curve Recent: double descent curve Belkin, Hsu, Ma, and Mandal (2018); Hastie, Montanari, Rosset, and Tibshirani (2019) Question: shape of the risk curve w.r.t. “over-parametrization” ? We model the intrinsic dim. d = n α with α ∈ ( 0 , 1 ) , with feature cov. Σ d = I d . We consider the non-linear Kernel Regression model. 7 / 37
Intro. Min-norm Interpolant Regression Classification SHAPE OF RISK CURVE We consider the intrinsic dim. d = n α with α ∈ ( 0 , 1 ) . A non-linear Kernel Regression model. DGP. • { x i } n ∼ µ = P ⊗ d . distribution of each coordinate x ∼ P satisfies weak moment i . i . d i = 1 ∀ t > 0, P (∣ x ∣ > t ) ≤ C ( 1 + t ) − ν . • target f ⋆ ( x ) ∶ = E [ Y ∣ X = x ] , with bounded Var [ Y ∣ X = x ] . Kernel. • h ∈ C ∞ ( R ) , h ( t ) = ∑ ∞ i = 0 α i t i with α i ≥ 0. • inner product kernel k ( x , z ) = h (⟨ x , z ⟩/ d ) . Target Function. • Assume f ⋆ ( x ) = ∫ k ( x , z ) ρ ⋆ ( z ) µ ( dz ) with ∥ ρ ⋆ ∥ µ ≤ C . 8 / 37
Intro. Min-norm Interpolant Regression Classification SHAPE OF RISK CURVE We consider the intrinsic dim. d = n α with α ∈ ( 0 , 1 ) . A non-linear Kernel Regression model. Given n i.i.d. data pairs ( x i , y i ) ∼ P X , Y . Risk curve for minimum RKHS norm ∥ ⋅ ∥ H interpolants ̂ f ? ∥ f ∥ H , s . t . y i = f ( x i ) ∀ i ∈ [ n ] . ̂ f = arg min f 8 / 37
Intro. Min-norm Interpolant Regression Classification SHAPE OF RISK CURVE Theorem (L., Rakhlin & Zhai, ’19) . For any integer ι ≥ 1, consider d = n α where α ∈ ( 1 ι ) . ι + 1 , 1
Intro. Min-norm Interpolant Regression Classification SHAPE OF RISK CURVE Theorem (L., Rakhlin & Zhai, ’19) . For any integer ι ≥ 1, consider d = n α where α ∈ ( 1 ι ) . ι + 1 , 1 With probability at least 1 − δ − e − n / d ι on the design X ∈ R n × d , E [∥̂ µ ∣ X ] ≤ C ⋅ ( d ι f − f ∗ ∥ 2 d ι + 1 ) ≍ n − β , n n + β ∶ = min {( ι + 1 ) α − 1 , 1 − ια } . Here the constant C ( δ , ι , h , P) does not depend on d , n . 9 / 37
Intro. Min-norm Interpolant Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � ���� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. 10 / 37
Intro. Min-norm Interpolant Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � ���� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. 1 ι + 1 / 2 , ι ∈ N • valley : “valley” on the rate curve at d = n 10 / 37
Intro. Min-norm Interpolant Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � ���� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. 1 ι + 1 / 2 , ι ∈ N • valley : “valley” on the rate curve at d = n • over-parametrization : towards over-parametrized regime, the good rate at the bottom of the valley is better 10 / 37
Intro. Min-norm Interpolant Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � ���� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. 1 ι + 1 / 2 , ι ∈ N • valley : “valley” on the rate curve at d = n • over-parametrization : towards over-parametrized regime, the good rate at the bottom of the valley is better • empirical : preliminary empirical evidence of multiple descent 10 / 37
Intro. Min-norm Interpolant Regression Classification EMPIRICAL EVIDENCE empirical evidence of multiple-descent behavior as the scaling d = n α changes. 11 / 37
Intro. Min-norm Interpolant Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � ���� = � − � 1/2 � theory empirical 12 / 37
Intro. Min-norm Interpolant Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � ���� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. • α = 1: Liang and Rakhlin (2018) • α = 0: Rakhlin and Zhai (2018) • α = 1 double descent: Belkin, Hsu, Ma, and Mandal (2018); Hastie, Montanari, Rosset, and Tibshirani (2019); Bartlett, Long, Lugosi, and Tsigler (2019) • general α , stair-case, random fourier feature: Ghorbani, Mei, Misiakiewicz, and Montanari (2019) 13 / 37
Intro. Min-norm Interpolant Regression Classification APPLICATION TO WIDE NEURAL NETWORKS Neural Tangent Kernel (NTK) Jacot, Gabriel, and Hongler (2018); Du, Zhai, Poczos, and Singh (2018)...... 4 π U ( ⟨ x , x ′ ⟩ k NTK ( x , x ′ ) = 1 ∥ x ∥∥ x ′ ∥) √ U ( t ) = 3 t ( π − arccos ( t )) + 1 − t 2 14 / 37
Intro. Min-norm Interpolant Regression Classification APPLICATION TO WIDE NEURAL NETWORKS Neural Tangent Kernel (NTK) Jacot, Gabriel, and Hongler (2018); Du, Zhai, Poczos, and Singh (2018)...... 4 π U ( ⟨ x , x ′ ⟩ k NTK ( x , x ′ ) = 1 ∥ x ∥∥ x ′ ∥) √ U ( t ) = 3 t ( π − arccos ( t )) + 1 − t 2 Corollary (L., Rakhlin & Zhai, ’19) . Our results can be generalized to the following type of kernels α i ⋅ ( ⟨ x , x ′ ⟩ ∞ k ( x , x ′ ) = ∥ x ∥∥ x ′ ∥) i ∑ i = 0 that include NTK. Consider integer ι that satisfies d ι log d ≾ n ≾ d ι + 1 / log d , then Risk ≾ d ι n + n log d d ι + 1 14 / 37
Recommend
More recommend