Generalisation error in learning with random features and the hidden manifold model B. Loureiro F. Gerace M. Mézard L. Zdeborová F. Krzakala (IPhT) (IPhT) (ENS) (IPhT) (ENS) ICML 2020
TRAINING A NEURAL NET EXPECTATIONS error variance |bias| complexity
TRAINING A NEURAL NET EXPECTATIONS REALITY [Geiger et al. ’18] error variance |bias| complexity See also [Geman et al. ’92; Opper ’95; Neyshabur, Tomyoka, Srebro, 2015; Advani-Saxe 2017; Belkin, Hsu, Ma, Soumik, Mandal 2019; Nakkiran et al. 2019]
The usual suspects
The usual suspects architecture
The usual suspects algorithms architecture
The usual suspects algorithms architecture data
The two theory cultures DATA What worst-case What typical-case What it really analysis analysis looks like think it looks like think it looks like
Spoiler Feature space Input space
Spoiler Feature space Input space
Spoiler Feature space Input space
Spoiler Feature space Input space
Spoiler
Worst-case vs. typical-case: A concrete example
Concrete example x μ ∼ 𝒪 (0,I d ) y μ = sign ( x μ ⋅ θ 0 ) D = { x μ , y μ } n Dataset with and labels μ =1 1 . 0 Rademacher Radamacher bound Bayes Logistic sim. cross-val for function class 0 . 8 f θ ( x ) = sign ( x ⋅ θ ) generalisation error 10 0 0 . 6 g 10 − 1 0 . 4 Out-of-the-box 10 − 2 Logistic regression 0 . 2 (Sklearn) 10 − 1 10 0 10 1 10 2 0 . 0 2 4 6 8 10 α # datapoints / # dimensions [Abbara, Aubin, Krzakala, Zdeborová ’19]
Can we do better?
Hidden Manifold Model [Goldt, Mézard, Krzakala, Zdeborová ‘19] Idea: dataset where both data points and labels only depend on a subset of latent variables. x μ = σ ( d ) F ⊤ c μ Feature space Input space
Aim: study classification and regression tasks on this dataset
The task Learn the labels using a linear model with empirical risk minimisation where: loss function ridge penalty examples: • Ridge regression: • Logistic regression:
Two alternative points of view [Williams ‘98,‘07; Retch, Raimi ‘07; Montanari, Mei 19’] Dataset D = { c μ , y μ } n μ =1
Two alternative points of view [Williams ‘98,‘07; Retch, Raimi ‘07; Montanari, Mei 19’] Dataset D = { c μ , y μ } n μ =1 Feature map Φ F ( c ) = σ (F ⊤ c ) Φ F ( c ) Φ F ( c ′ � ) → p →∞ K ( c , c ′ � ) Mercer’s theorem
Main result: Asymptotic generalisation error for arbitrary loss and projection F ℓ
̂ ̂ ⃗ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ⃗ Definitions: Consider the unique fixed point of the following system of equations 1 𝔽 ξ , y [ 𝒶 ( y , ω 0 ) ] , ∂ ω η ( y , ω 1 ) V s ( 1 − z g μ ( − z ) ) , V s = 1 V s = α γ κ 2 V m 2 s + ̂ q s [ 1 − 2 zg μ ( − z ) + z 2 g ′ � μ ( − z ) ] 2 q s = ( η ( y , ω 1 ) − ω 1 ) 𝒶 ( y , ω 0 ) V s q s = α γ κ 2 1 𝔽 ξ , y , V 2 q w V s [ − zg μ ( − z ) + z 2 g ′ � μ ( − z ) ] , [ + ℓ ( y , x ) ] − ( x − ω ) 2 V w ) ̂ η ( y , ω ) = argmin ( λ + γ κ 1 𝔽 ξ , y [ ∂ ω 𝒶 ( y , ω 0 ) 2 V ] , ( η ( y , ω 1 ) − ω 1 ) x ∈ℝ m s V s ( 1 − z g μ ( − z ) ) , m s = α m s = V 2 V 0 ( x − ω ) 2 δ ( y − f 0 ( x ) ) 1 2 π V 0 e − dx 𝒶 ( y , ω ) = ∫ ⋆ 𝔽 ξ , y [ 𝒶 ( y , ω 0 ) V w [ γ − 1 + zg μ ( − z ) ] , γ ] , 1 ∂ ω η ( y , ω 1 ) V w = V w = ακ 2 λ + V V w ) 2 [ μ ( − z ) ] , q w 1 γ − 1 + z 2 g ′ � q w = γ 2 ( η ( y , ω 1 ) − ω 1 ) ( λ + 𝒶 ( y , ω 0 ) q w = ακ 2 ⋆ 𝔽 ξ , y , m 2 s + ̂ q s V s [ − zg μ ( − z ) + z 2 g ′ � μ ( − z ) ] , V 2 + V w ) ̂ ( λ + ⋆ V w , V 0 = ρ − M 2 where V = κ 2 1 V s + κ 2 Q , Q = κ 2 1 q s + κ 2 Q ξ and g μ is the Stieltjes transform of FF T ⋆ q w , M = κ 1 m s , ω 0 = M / Q ξ , ω 1 = z μ ∼ 𝒪 ( κ 0 = 𝔽 [ σ ( z ) ] , κ 1 ≡ 𝔽 [ z σ ( z ) ] , κ ⋆ ≡ 𝔽 [ σ ( z ) 2 ] − κ 2 0 − κ 2 0 , I p ) 1 , and In the high-dimensional limit: ϵ gen = 𝔽 λ , ν [ ( f 0 ( ν ) − ̂ ℒ training = λ f ( λ )) 2 ] w + 𝔽 ξ , y [ 𝒶 ( y , ω ⋆ 1 ) ) ] 0 ) ℓ ( y , η ( y , ω ⋆ 2 α q ⋆ 0 ) , ( with ( ν , λ ) ∼ 𝒪 ( Q ⋆ ) M ⋆ ρ 0 with ω ⋆ 0 = M ⋆ / Q ⋆ ξ , ω ⋆ Q ⋆ ξ M ⋆ 1 = Agrees with [Mei-Montanari ’19] who solved a particular case using random matrix theory: ℓ ( x , y ) = ∥ x − y ∥ 2 linear function f 0 , & Gaussian random weights F 2
Technical note: replicated Gaussian Equivalence An important step in the derivation of this result is the observation that the { x μ , y μ } n generalisation and training properties of the dataset are μ =1 x μ , y μ } μ statistically equivalent to the following dataset with the same {˜ μ =1 labels but : 1 x μ = κ 1 F ⊤ c μ + κ ⋆ z μ z μ ∼ 𝒪 (0,I p ) ˜ d where the coefficients are chosen to match κ 1 , κ ⋆ κ 1 = 𝔽 ξ [ ξ σ ( ξ ) ] ⋆ = 𝔽 ξ [ σ ( ξ ) 2 ] − κ 2 κ 2 ξ ∼ 𝒪 (0,1) 1 Generalisation of an observation in [Mei, Montanari 19’; Goldt, Mézard, Krzakala, Zdeborová ‘19]
Drawing the consequences of our formula
̂ Learning in the HMM l ( x , y ) = 1 f 0 = Gaussian F 2 ( x − y ) 2 σ = erf f = sign d / p = 0.1 optimal λ # latent / # input dimensions generalisation error # samples / # input dimensions # samples / # input dimensions Good generalisation performance for small latent space, even for small sample complexities
̂ Classification tasks Gaussian F f 0 = σ = sign f = sign
Random vs. orthogonal projections Ridge regression Logistic regression F i ρ ∼ 𝒪 (0,1/ d ) First layer: random i.i.d. Gaussian Matrix F = U ⊤ DV First layer: subsampled Fourier matrix U, V ∼ Haar
Random vs. orthogonal projections Ridge regression Logistic regression F i ρ ∼ 𝒪 (0,1/ d ) First layer: random i.i.d. Gaussian Matrix F = U ⊤ DV First layer: subsampled Fourier matrix [NIPS, ’17] U, V ∼ Haar
̂ Separability transition in logistic regression l ( x , y ) = log ( 1 + e − xy ) f 0 = σ = erf f = sign Cover theory ’65
Separability transition in logistic regression snr p/n [Sur & Candes, ’18]
Next steps Learning F?
Thank you for your attention! Check our paper @ arXiv: 2002.09339 [mat.ST] contact: brloureiro@gmail.com
References in this talk F. Gerace, B. Loureiro, F. Krzakala, M. Mézard, L. Zdeborobá, “ Generalisation error in learning with random features and the hidden manifold model”, arXiv: 2002.09339 M. Geiger, S. Spigler, S. d’Ascoli, L. Sagun, M. Baity-Jesi, G. Biroli and M. Wyart, “ Jamming transition as a paradigm to understand the loss landscape of deep neural networks”, Physical Review E, 100(1):012115 S. Goldt, M. Mézard, F. Krzakala, L. Zdeborobá, “ Modelling the influence of data structure on learning in neural networks: the hidden manifold model”, arXiv: 1909.11500 A. Abbara, B. Aubin, F. Krzakala, L. Zdeborobá, “ Rademacher complexity and spin glasses: A link between the replica and statistical theories of learning”, arXiv: 1912.02729 C. Williams , “ Computing with infinite networks”, NIPS 98’ A. Rahimi, B. Recht, “ Random Features for Large-Scale Kernel Machines”, NIPS 07’ S. Mei, A. Montanari, “ The generalization error of random features regression: Precise asymptotics and double descent curve”, arXiv: 1908.05355 K. Choromanski, M. Rowland, A. Weller, “ The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings”, NIPS 07’ P. Sur and E.J. Candès, “ A modern maximum-likelihood theory for high-dimensional logistic regression”, PNAS 19’
Recommend
More recommend