Minimum-Norm Interpolation in Statistical Learning: new phenomena in - PowerPoint PPT Presentation

Min-Norm Interpolation Regression Classification Minimum-Norm Interpolation in Statistical Learning: new phenomena in high dimensions Tengyuan Liang Regression: with Sasha Rakhlin (MIT), Xiyu Zhai (MIT) Classification: with Pragya Sur (Harvard) 1 / 25

Min-Norm Interpolation Regression Classification OUTLINE ● Motivation: min-norm interpolants for over-parametrized models ● Regression: multiple descent of risk for kernels/neural networks ● Classification: precise asymptotics of boosting algorithms 2 / 25

Min-Norm Interpolation Regression Classification OVERPARAMETRIZED REGIME OF STAT / ML Model class complex enough to interpolate the training data. Zhang, Bengio, Hardt, Recht, and Vinyals (2016) Belkin et al. (2018a,b); Liang and Rakhlin (2018); Bartlett et al. (2019); Hastie et al. (2019) Kernel Regression on MNIST 10 1 digits pair [i,j] [2,5] log(error) [2,9] [3,6] [3,8] [4,7] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda λ = 0: the interpolants on training data. MNIST data from LeCun et al. (2010) 3 / 25

Min-Norm Interpolation Regression Classification OVERPARAMETRIZED REGIME OF STAT / ML Model class complex enough to interpolate the training data. Zhang, Bengio, Hardt, Recht, and Vinyals (2016) Belkin et al. (2018a,b); Liang and Rakhlin (2018); Bartlett et al. (2019); Hastie et al. (2019) Kernel Regression on MNIST 10 1 digits pair [i,j] [2,5] [3,5] [4,5] log(error) [2,6] [3,6] [4,6] [2,7] [3,7] [4,7] [2,8] [3,8] [4,8] [2,9] [3,9] [4,9] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda λ = 0: the interpolants on training data. MNIST data from LeCun et al. (2010) 3 / 25

Min-Norm Interpolation Regression Classification OVERPARAMETRIZED REGIME OF STAT / ML In fact, many models behave the same on training data. Practical methods or algorithms favor certain functions! Principle : among the models that interpolate , algorithms favor certain form of minimalism . 4 / 25

Min-Norm Interpolation Regression Classification OVERPARAMETRIZED REGIME OF STAT / ML Principle : among the models that interpolate , algorithms favor certain form of minimalism . ● overparametrized linear model and matrix factorization ● kernel regression ● support vector machines, Perceptron ● boosting, AdaBoost ● two-layer ReLU networks, deep neural networks 4 / 25

Min-Norm Interpolation Regression Classification OVERPARAMETRIZED REGIME OF STAT / ML Principle : among the models that interpolate , algorithms favor certain form of minimalism . ● overparametrized linear model and matrix factorization ● kernel regression ● support vector machines, Perceptron ● boosting, AdaBoost ● two-layer ReLU networks, deep neural networks minimalism typically measured in form of certain norm motivates the study of min-norm interpolants 4 / 25

Min-Norm Interpolation Regression Classification MIN - NORM INTERPOLANTS minimalism typically measured in form of certain norm motivates the study of min-norm interpolants Regression ∥ f ∥ norm , s . t . y i = f ( x i ) ∀ i ∈ [ n ] . ̂ f = arg min f Classification ̂ ∥ f ∥ norm , s . t . y i ⋅ f ( x i ) ≥ 1 ∀ i ∈ [ n ] . f = arg min f 5 / 25

Min-Norm Interpolation Regression Classification Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels with Sasha Rakhlin (MIT), Xiyu Zhai (MIT) Regression ̂ ∥ f ∥ norm , s . t . y i = f ( x i ) ∀ i ∈ [ n ] . f = arg min f 6 / 25

Min-Norm Interpolation Regression Classification SHAPE OF RISK CURVE Classic: U-shape curve Recent: double descent curve Belkin, Hsu, Ma, and Mandal (2018a); Hastie, Montanari, Rosset, and Tibshirani (2019) Question: shape of the risk curve w.r.t. “over-parametrization” ? 7 / 25

Min-Norm Interpolation Regression Classification SHAPE OF RISK CURVE Classic: U-shape curve Recent: double descent curve Belkin, Hsu, Ma, and Mandal (2018a); Hastie, Montanari, Rosset, and Tibshirani (2019) Question: shape of the risk curve w.r.t. “over-parametrization” ? We model the intrinsic dim. d = n α with α ∈ ( 0 , 1 ) , with feature cov. Σ d = I d . We consider the non-linear Kernel Regression model. 7 / 25

Min-Norm Interpolation Regression Classification D ATA G ENERATING P ROCESS DGP. ● { x i } n i . i . d ∼ µ = P ⊗ d , dist. of each coordinate satisfies weak moment condition. i = 1 ● target f ⋆ ( x ) ∶= E [ Y ∣ X = x ] , with bounded Var [ Y ∣ X = x ] . Kernel. ● h ∈ C ∞ ( R ) , h ( t ) = ∑ ∞ i = 0 α i t i with α i ≥ 0. ● inner product kernel k ( x , z ) = h (⟨ x , z ⟩/ d ) . Target Function. ● Assume f ⋆ ( x ) = ∫ k ( x , z ) ρ ⋆ ( z ) µ ( dz ) with ∥ ρ ⋆ ∥ µ ≤ C . 8 / 25

Min-Norm Interpolation Regression Classification D ATA G ENERATING P ROCESS Given n i.i.d. data pairs ( x i , y i ) ∼ P X , Y . Risk curve for minimum RKHS norm ∥ ⋅ ∥ H interpolants ̂ f ? ̂ ∥ f ∥ H , s . t . y i = f ( x i ) ∀ i ∈ [ n ] . f = arg min f 8 / 25

Min-Norm Interpolation Regression Classification SHAPE OF RISK CURVE Theorem (L., Rakhlin & Zhai, ’19) . For any integer ι ≥ 1, consider d = n α where α ∈ ( 1 ι + 1 , 1 ι ) .

Min-Norm Interpolation Regression Classification SHAPE OF RISK CURVE Theorem (L., Rakhlin & Zhai, ’19) . For any integer ι ≥ 1, consider d = n α where α ∈ ( 1 ι + 1 , 1 ι ) . With probability at least 1 − δ − e − n / d ι on the design X ∈ R n × d , E [∥̂ f − f ∗ ∥ 2 µ ∣ X ] ≤ C ⋅ ( d ι d ι + 1 ) ≍ n − β , n n + β ∶= min {( ι + 1 ) α − 1 , 1 − ια } . Here the constant C ( δ , ι , h , P) does not depend on d , n . 9 / 25

Min-Norm Interpolation Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � �� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. 10 / 25

Min-Norm Interpolation Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � �� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. 1 ● valley : “valley” on the rate curve at d = n ι + 1 / 2 , ι ∈ N 10 / 25

Min-Norm Interpolation Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � �� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. 1 ● valley : “valley” on the rate curve at d = n ι + 1 / 2 , ι ∈ N ● over-parametrization : towards over-parametrized regime, the good rate at the bottom of the valley is better 10 / 25

Min-Norm Interpolation Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � �� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. 1 ● valley : “valley” on the rate curve at d = n ι + 1 / 2 , ι ∈ N ● over-parametrization : towards over-parametrized regime, the good rate at the bottom of the valley is better ● empirical : preliminary empirical evidence of multiple descent 10 / 25

Min-Norm Interpolation Regression Classification EMPIRICAL EVIDENCE empirical evidence of multiple-descent behavior as the scaling d = n α changes. 11 / 25

Min-Norm Interpolation Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � �� = � − � 1/2 � theory empirical 12 / 25

Min-Norm Interpolation Regression Classification APPLICATION TO WIDE NEURAL NETWORKS Neural Tangent Kernel (NTK) Jacot, Gabriel, and Hongler (2018); Du, Zhai, Poczos, and Singh (2018)...... √ k NTK ( x , x ′ ) = U ( ⟨ x , x ′ ⟩ ∥ x ∥∥ x ′ ∥) , with U ( t ) = 1 4 π ( 3 t ( π − arccos ( t )) + 1 − t 2 ) Compositional Kernel of Deep Neural Network (DNN) Daniely et al. (2016); Poole et al. (2016); Liang and Tran-Bach (2020) α i ⋅ ( ⟨ x , x ′ ⟩ ∞ k DNN ( x , x ′ ) = ∥ x ∥∥ x ′ ∥) i ∑ i = 0 13 / 25

Min-Norm Interpolation Regression Classification APPLICATION TO WIDE NEURAL NETWORKS Neural Tangent Kernel (NTK) Jacot, Gabriel, and Hongler (2018); Du, Zhai, Poczos, and Singh (2018)...... √ k NTK ( x , x ′ ) = U ( ⟨ x , x ′ ⟩ ∥ x ∥∥ x ′ ∥) , with U ( t ) = 1 4 π ( 3 t ( π − arccos ( t )) + 1 − t 2 ) Compositional Kernel of Deep Neural Network (DNN) Daniely et al. (2016); Poole et al. (2016); Liang and Tran-Bach (2020) α i ⋅ ( ⟨ x , x ′ ⟩ ∞ ∥ x ∥∥ x ′ ∥) i k DNN ( x , x ′ ) = ∑ i = 0 Corollary (L., Rakhlin & Zhai, ’19) . Multiple descent phenomena hold for kernels including NTK, and compositional kernel of DNN. 13 / 25

Min-Norm Interpolation Regression Classification Precise High-Dimensional Asymptotic Theory for Boosting and Min- ℓ 1 -Norm Interpolated Classifiers with Pragya Sur (Harvard) Classification ̂ ∥ f ∥ norm , s . t . y i ⋅ f ( x i ) ≥ 1 ∀ i ∈ [ n ] . f = arg min f 14 / 25

Min-Norm Interpolation Regression Classification PROBLEM FORMULATION Given n -i.i.d. data pairs {( x i , y i )} 1 ≤ i ≤ n , with ( x , y ) ∼ P y i ∈ { ± 1 } binary labels, x i ∈ R p feature vector (weak learners) Consider when data is linearly separable P ( ∃ θ ∈ R p , y i x ⊺ i θ > 0 for 1 ≤ i ≤ n ) → 1 . Natural to consider overparametrized regime p / n → ψ ∈ ( 0 , ∞ ) . 15 / 25

Minimum-Norm Interpolation in Statistical Learning: new phenomena in - PowerPoint PPT Presentation

Min-Norm Interpolation Regression Classification Minimum-Norm Interpolation in Statistical Learning: new phenomena in high dimensions Tengyuan Liang Regression: with Sasha Rakhlin (MIT), Xiyu Zhai (MIT) Classification: with Pragya Sur

First-Order Interpolation Laura Kov acs Interpolation: Craig Interpolation Use of

Part II: Interpolation and Approximation theory Contents: Review of Lagrange interpolation

Modelling NORM in the Modelling NORM in the environment environment EMRAS Project, NORM Working

EMRAS I (NORM) SUMMARY (Detailed information is in the main EMRAS I NORM working group report)

Theory for Minimum Norm Interpolation: Regression and Classification in High Dimensions Tengyuan

Interpolation Dr. Mihail October 26, 2015 (Dr. Mihail) Interpolation October 26, 2015 1 / 11

3. Interpolation Closing the Gaps of Discretization . . . 3. Interpolation Numerical Programming

Marcinkiewicz interpolation Updated May 18, 2020 Plan 2 Outline: Interpolation of quasinorms

3. Interpolation: Closing the Gaps of Discretization . . . 3. Interpolation: Closing the Gaps of

Stability and Lebesgue constants in Good interpolation points RBF interpolation Results

EMRAS 2 EMRAS 2 Working Group 1 Working Group 1 Legacy Sites and NORM Legacy Sites and NORM

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

CFA Interpolation Detection Leszek Swirski October 15, 2009 Leszek Swirski CFA

On interpolation in theorem proving Maria Paola Bonacina Visiting: Computer Science Laboratory,

The minimum Euclidean norm point in a polytope: Wolfe's method is exponential Luis Rademacher,

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

Lecture 9 - GPU Ray Tracing (2) Welcome! , = (, )

Entropy bounds and the holographic principle Raphael Bousso Berkeley Center for Theoretical

Finite size corrections to the classical radiation reaction. Tams Herpay(KFKI-RMKI)

Eleni Eleni Vatamidou, atamidou, Ivo Ivo Adan, Adan, Ma Maria ria Vlasiou, Vlasiou, and

Selective Restructuring of Bo nding Vol me Hierarchies for Bounding Volume Hierarchies for

Biclosed sets in representation theory Al Garver, UQAM (joint with Thomas McConville and Kaveh

Adv Advanced anced Worksho shop p on n Ea Earthquake Fa Fault Mechanics: The Theory, ,

Second-Order Asymptotics of Sequential Hypothesis Testing Yonglong Li and Vincent Y. F. Tan June