generalisation error in learning with random
play

Generalisation error in learning with random features and the hidden - PowerPoint PPT Presentation

Generalisation error in learning with random features and the hidden manifold model B. Loureiro F. Gerace M. Mzard L. Zdeborov F. Krzakala (IPhT) (IPhT) (ENS) (IPhT) (ENS) ICML 2020 TRAINING A NEURAL NET EXPECTATIONS error variance


  1. Generalisation error in learning with random features and the hidden manifold model B. Loureiro F. Gerace M. Mézard L. Zdeborová F. Krzakala (IPhT) (IPhT) (ENS) (IPhT) (ENS) ICML 2020

  2. TRAINING A NEURAL NET EXPECTATIONS error variance |bias| complexity

  3. TRAINING A NEURAL NET EXPECTATIONS REALITY [Geiger et al. ’18] error variance |bias| complexity See also [Geman et al. ’92; Opper ’95; Neyshabur, Tomyoka, Srebro, 2015; Advani-Saxe 2017; Belkin, Hsu, Ma, Soumik, Mandal 2019; Nakkiran et al. 2019]

  4. The usual suspects

  5. The usual suspects architecture

  6. The usual suspects algorithms architecture

  7. The usual suspects algorithms architecture data

  8. The two theory cultures DATA What worst-case What typical-case What it really analysis analysis looks like think it looks like think it looks like

  9. Spoiler Feature space Input space

  10. Spoiler Feature space Input space

  11. Spoiler Feature space Input space

  12. Spoiler Feature space Input space

  13. Spoiler

  14. Worst-case vs. typical-case: A concrete example

  15. Concrete example x μ ∼ 𝒪 (0,I d ) y μ = sign ( x μ ⋅ θ 0 ) D = { x μ , y μ } n Dataset with and labels μ =1 1 . 0 Rademacher Radamacher bound Bayes Logistic sim. cross-val for function class 0 . 8 f θ ( x ) = sign ( x ⋅ θ ) generalisation error 10 0 0 . 6 g 10 − 1 0 . 4 Out-of-the-box 10 − 2 Logistic regression 0 . 2 (Sklearn) 10 − 1 10 0 10 1 10 2 0 . 0 2 4 6 8 10 α # datapoints / # dimensions [Abbara, Aubin, Krzakala, Zdeborová ’19]

  16. Can we do better?

  17. Hidden Manifold Model [Goldt, Mézard, Krzakala, Zdeborová ‘19] Idea: dataset where both data points and labels only depend on a subset of latent variables. x μ = σ ( d ) F ⊤ c μ Feature space Input space

  18. Aim: study classification and regression tasks on this dataset

  19. The task Learn the labels using a linear model with empirical risk minimisation where: loss function ridge penalty examples: • Ridge regression: 
 • Logistic regression: 


  20. Two alternative points of view [Williams ‘98,‘07; Retch, Raimi ‘07; Montanari, Mei 19’] Dataset D = { c μ , y μ } n μ =1

  21. Two alternative points of view [Williams ‘98,‘07; Retch, Raimi ‘07; Montanari, Mei 19’] Dataset D = { c μ , y μ } n μ =1 Feature map Φ F ( c ) = σ (F ⊤ c ) Φ F ( c ) Φ F ( c ′ � ) → p →∞ K ( c , c ′ � ) Mercer’s theorem

  22. Main result: Asymptotic generalisation error for arbitrary loss and projection F ℓ

  23. ̂ ̂ ⃗ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ⃗ Definitions: Consider the unique fixed point of the following system of equations 1 𝔽 ξ , y [ 𝒶 ( y , ω 0 ) ] , ∂ ω η ( y , ω 1 ) V s ( 1 − z g μ ( − z ) ) , V s = 1 V s = α γ κ 2 V m 2 s + ̂ q s [ 1 − 2 zg μ ( − z ) + z 2 g ′ � μ ( − z ) ] 2 q s = ( η ( y , ω 1 ) − ω 1 ) 𝒶 ( y , ω 0 ) V s q s = α γ κ 2 1 𝔽 ξ , y , V 2 q w V s [ − zg μ ( − z ) + z 2 g ′ � μ ( − z ) ] , [ + ℓ ( y , x ) ] − ( x − ω ) 2 V w ) ̂ η ( y , ω ) = argmin ( λ + γ κ 1 𝔽 ξ , y [ ∂ ω 𝒶 ( y , ω 0 ) 2 V ] , ( η ( y , ω 1 ) − ω 1 ) x ∈ℝ m s V s ( 1 − z g μ ( − z ) ) , m s = α m s = V 2 V 0 ( x − ω ) 2 δ ( y − f 0 ( x ) ) 1 2 π V 0 e − dx 𝒶 ( y , ω ) = ∫ ⋆ 𝔽 ξ , y [ 𝒶 ( y , ω 0 ) V w [ γ − 1 + zg μ ( − z ) ] , γ ] , 1 ∂ ω η ( y , ω 1 ) V w = V w = ακ 2 λ + V V w ) 2 [ μ ( − z ) ] , q w 1 γ − 1 + z 2 g ′ � q w = γ 2 ( η ( y , ω 1 ) − ω 1 ) ( λ + 𝒶 ( y , ω 0 ) q w = ακ 2 ⋆ 𝔽 ξ , y , m 2 s + ̂ q s V s [ − zg μ ( − z ) + z 2 g ′ � μ ( − z ) ] , V 2 + V w ) ̂ ( λ + ⋆ V w , V 0 = ρ − M 2 where V = κ 2 1 V s + κ 2 Q , Q = κ 2 1 q s + κ 2 Q ξ and g μ is the Stieltjes transform of FF T ⋆ q w , M = κ 1 m s , ω 0 = M / Q ξ , ω 1 = z μ ∼ 𝒪 ( κ 0 = 𝔽 [ σ ( z ) ] , κ 1 ≡ 𝔽 [ z σ ( z ) ] , κ ⋆ ≡ 𝔽 [ σ ( z ) 2 ] − κ 2 0 − κ 2 0 , I p ) 1 , and In the high-dimensional limit: ϵ gen = 𝔽 λ , ν [ ( f 0 ( ν ) − ̂ ℒ training = λ f ( λ )) 2 ] w + 𝔽 ξ , y [ 𝒶 ( y , ω ⋆ 1 ) ) ] 0 ) ℓ ( y , η ( y , ω ⋆ 2 α q ⋆ 0 ) , ( with ( ν , λ ) ∼ 𝒪 ( Q ⋆ ) M ⋆ ρ 0 with ω ⋆ 0 = M ⋆ / Q ⋆ ξ , ω ⋆ Q ⋆ ξ M ⋆ 1 = Agrees with [Mei-Montanari ’19] who solved a particular case using random matrix theory: ℓ ( x , y ) = ∥ x − y ∥ 2 linear function f 0 , & Gaussian random weights F 2

  24. Technical note: replicated Gaussian Equivalence An important step in the derivation of this result is the observation that the { x μ , y μ } n generalisation and training properties of the dataset are μ =1 x μ , y μ } μ statistically equivalent to the following dataset with the same {˜ μ =1 labels but : 1 x μ = κ 1 F ⊤ c μ + κ ⋆ z μ z μ ∼ 𝒪 (0,I p ) ˜ d where the coefficients are chosen to match κ 1 , κ ⋆ κ 1 = 𝔽 ξ [ ξ σ ( ξ ) ] ⋆ = 𝔽 ξ [ σ ( ξ ) 2 ] − κ 2 κ 2 ξ ∼ 𝒪 (0,1) 1 Generalisation of an observation in [Mei, Montanari 19’; Goldt, Mézard, Krzakala, Zdeborová ‘19]

  25. Drawing the consequences of our formula

  26. ̂ Learning in the HMM l ( x , y ) = 1 f 0 = Gaussian F 2 ( x − y ) 2 σ = erf f = sign d / p = 0.1 optimal λ # latent / # input dimensions generalisation error # samples / # input dimensions # samples / # input dimensions Good generalisation performance for small latent space, even for small sample complexities

  27. ̂ Classification tasks Gaussian F f 0 = σ = sign f = sign

  28. Random vs. orthogonal projections Ridge regression Logistic regression F i ρ ∼ 𝒪 (0,1/ d ) First layer: random i.i.d. Gaussian Matrix F = U ⊤ DV First layer: subsampled Fourier matrix U, V ∼ Haar

  29. Random vs. orthogonal projections Ridge regression Logistic regression F i ρ ∼ 𝒪 (0,1/ d ) First layer: random i.i.d. Gaussian Matrix F = U ⊤ DV First layer: subsampled Fourier matrix [NIPS, ’17] U, V ∼ Haar

  30. ̂ Separability transition in logistic regression l ( x , y ) = log ( 1 + e − xy ) f 0 = σ = erf f = sign Cover theory ’65

  31. Separability transition in logistic regression snr p/n [Sur & Candes, ’18]

  32. Next steps Learning F?

  33. Thank you for your attention! Check our paper @ arXiv: 2002.09339 [mat.ST] contact: brloureiro@gmail.com

  34. References in this talk F. Gerace, B. Loureiro, F. Krzakala, M. Mézard, L. Zdeborobá, “ Generalisation error in learning with random features and the hidden manifold model”, arXiv: 2002.09339 M. Geiger, S. Spigler, S. d’Ascoli, L. Sagun, M. Baity-Jesi, G. Biroli and M. Wyart, “ Jamming transition as a paradigm to understand the loss landscape of deep neural networks”, Physical Review E, 100(1):012115 S. Goldt, M. Mézard, F. Krzakala, L. Zdeborobá, “ Modelling the influence of data structure on learning in neural networks: the hidden manifold model”, arXiv: 1909.11500 A. Abbara, B. Aubin, F. Krzakala, L. Zdeborobá, “ Rademacher complexity and spin glasses: A link between the replica and statistical theories of learning”, arXiv: 1912.02729 C. Williams , “ Computing with infinite networks”, NIPS 98’ A. Rahimi, B. Recht, “ Random Features for Large-Scale Kernel Machines”, NIPS 07’ S. Mei, A. Montanari, “ The generalization error of random features regression: Precise asymptotics and double descent curve”, arXiv: 1908.05355 K. Choromanski, M. Rowland, A. Weller, “ The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings”, NIPS 07’ P. Sur and E.J. Candès, “ A modern maximum-likelihood theory for high-dimensional logistic regression”, PNAS 19’

Recommend


More recommend