the committee machine computational to statistical gaps
play

The committee machine: Computational to statistical gaps in - PowerPoint PPT Presentation

The committee machine: Computational to statistical gaps in learning a two-layers neural network Benjamin Aubin , Antoine Maillard, Jean Barbier Nicolas Macris, Florent Krzakala & Lenka Zdeborov Benjamin Aubin - Institut de Physique T


  1. The committee machine: Computational to statistical gaps in learning a two-layers neural network Benjamin Aubin , Antoine Maillard, Jean Barbier Nicolas Macris, Florent Krzakala & Lenka Zdeborovà Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  2. « Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ( X i ) n ( X i ) n ) n f (2) i =1 i =1 i =1 ı Y i samples samples ples f (1) W (2) fixed W ı ∈ R p × K K hidden units ? f (1) f (2) Y i f (1) W (2) fixed W Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  3. « Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ( X i ) n ) n f (2) i =1 i =1 i =1 ı Y i samples samples ples f (1) W (2) fixed W ı ∈ R p × K K hidden units ? f (1) f (2) Y i f (1) W (2) fixed W Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  4. « Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ( X i ) n ) n f (2) i =1 i =1 i =1 ı Y i samples samples ples f (1) W (2) ✓ Committee machine: second layer fixed 
 fixed [Schwarze’93] W ı ∈ R p × K K hidden units ? f (1) f (2) Y i f (1) W (2) fixed W Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  5. « Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ( X i ) n ) n f (2) i =1 i =1 i =1 ı Y i samples samples ples f (1) W (2) ✓ Committee machine: second layer fixed 
 fixed [Schwarze’93] W ı ∈ R p × K ✓ i.i.d samples K hidden units ? f (1) f (2) Y i f (1) W (2) fixed W Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  6. « Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ( X i ) n ) n f (2) i =1 i =1 i =1 ı Y i samples samples ples f (1) W (2) ✓ Committee machine: second layer fixed 
 fixed [Schwarze’93] W ı ∈ R p × K ✓ i.i.d samples K hidden units ? ๏ Student: f (1) f (2) Y i f (1) W (2) fixed W Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  7. « Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ) n f (2) i =1 i =1 ı Y i samples ples f (1) W (2) ✓ Committee machine: second layer fixed 
 fixed [Schwarze’93] W ı ∈ R p × K ✓ i.i.d samples K hidden units ? ๏ Student: f (1) ( X i ) n f (2) i =1 Y i samples f (1) W (2) fixed W Y ı i Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  8. « Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ) n f (2) i =1 i =1 ı Y i samples ples f (1) W (2) ✓ Committee machine: second layer fixed 
 fixed [Schwarze’93] W ı ∈ R p × K ✓ i.i.d samples K hidden units ? ๏ Student: f (1) ( X i ) n f (2) i =1 Y i ✓ Learning task possible ? samples f (1) W (2) fixed W Y ı i Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  9. « Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ) n f (2) i =1 i =1 ı Y i samples ples f (1) W (2) ✓ Committee machine: second layer fixed 
 fixed [Schwarze’93] W ı ∈ R p × K ✓ i.i.d samples K hidden units ? ๏ Student: f (1) ( X i ) n f (2) i =1 Y i ✓ Learning task possible ? samples f (1) W (2) ✓ Computational complexity? fixed W Y ı i Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  10. Motivation ➡ Traditional approach ๏ Worst case scenario/PAC bounds: VC-dim & Rademacher complexity ๏ Numerical experiments Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  11. Motivation ➡ Traditional approach ๏ Worst case scenario/PAC bounds: VC-dim & Rademacher complexity ๏ Numerical experiments ➡ Complementary approach ✓ Revisit the statistical physics typical case scenario [Sompolinsky’92, Mezard’87] : 
 i.i.d data coming from a probabilistic model Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  12. Motivation ➡ Traditional approach ๏ Worst case scenario/PAC bounds: VC-dim & Rademacher complexity ๏ Numerical experiments ➡ Complementary approach ✓ Revisit the statistical physics typical case scenario [Sompolinsky’92, Mezard’87] : 
 i.i.d data coming from a probabilistic model ✓ Theoretical understanding of the generalization performance p → ∞ , n p = Θ (1) ✓ Regime: Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  13. Main result (1) - Generalization error ๏ Information theoretically optimal generalization error 
 (Bayes optimal case) ≡ 1 h� � 2 i ✏ ( p ) ⇥ ⇤ − Y ? ( XW ? ) p →∞ ✏ g ( q ∗ ) Y ( XW ) 2 E X , W ? E W | X − − − → g Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  14. Main result (1) - Generalization error ๏ Information theoretically optimal generalization error 
 (Bayes optimal case) ≡ 1 h� � 2 i ✏ ( p ) ⇥ ⇤ − Y ? ( XW ? ) p →∞ ✏ g ( q ∗ ) Y ( XW ) 2 E X , W ? E W | X EXPLICIT − − − → g Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  15. Main result (1) - Generalization error ๏ Information theoretically optimal generalization error 
 (Bayes optimal case) ≡ 1 h� � 2 i ✏ ( p ) ⇥ ⇤ − Y ? ( XW ? ) p →∞ ✏ g ( q ∗ ) Y ( XW ) 2 E X , W ? E W | X EXPLICIT − − − → g Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  16. Main result (1) - Generalization error ๏ Information theoretically optimal generalization error 
 (Bayes optimal case) ≡ 1 h� � 2 i ✏ ( p ) ⇥ ⇤ − Y ? ( XW ? ) p →∞ ✏ g ( q ∗ ) Y ( XW ) 2 E X , W ? E W | X EXPLICIT − − − → g ๏ : extremizing the variational formulation of this mutual information : q ∗ 1 ψ P 0 ( r ) + α Ψ out ( q ) − 1 n o lim pI ( W ; Y | X ) = − sup inf 2Tr( rq ) + cst p →∞ q ∈ S + r ∈ S + K K Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  17. Main result (1) - Generalization error ๏ Information theoretically optimal generalization error 
 (Bayes optimal case) ≡ 1 h� � 2 i ✏ ( p ) ⇥ ⇤ − Y ? ( XW ? ) p →∞ ✏ g ( q ∗ ) Y ( XW ) 2 E X , W ? E W | X EXPLICIT − − − → g ๏ : extremizing the variational formulation of this mutual information : q ∗ 1 ψ P 0 ( r ) + α Ψ out ( q ) − 1 n o lim pI ( W ; Y | X ) = − sup inf 2Tr( rq ) + cst p →∞ q ∈ S + r ∈ S + K K Heuristic replica mutual information well known in statistical 
 physics since 80’s Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  18. Main result (1) - Generalization error ๏ Information theoretically optimal generalization error 
 (Bayes optimal case) ≡ 1 h� � 2 i ✏ ( p ) ⇥ ⇤ − Y ? ( XW ? ) p →∞ ✏ g ( q ∗ ) Y ( XW ) 2 E X , W ? E W | X EXPLICIT − − − → g ๏ : extremizing the variational formulation of this mutual information : q ∗ 1 ψ P 0 ( r ) + α Ψ out ( q ) − 1 n o lim pI ( W ; Y | X ) = − sup inf 2Tr( rq ) + cst p →∞ q ∈ S + r ∈ S + K K Heuristic replica mutual information well known in statistical 
 physics since 80’s ✓ Main contribution: rigorous proof by adaptive (Guerra) interpolation Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  19. Main result (2) - Message Passing Algorithm Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  20. Main result (2) - Message Passing Algorithm ๏ Traditional approach: ‣ Minimize a loss function. Not optimal for limited number of samples. Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  21. Main result (2) - Message Passing Algorithm ๏ Traditional approach: ‣ Minimize a loss function. Not optimal for limited number of samples. w j P 0 ( w j ) P out ( Y i | X i W ) ๏ Approximate Message Passing (AMP) algorithm: ‣ Expansion of BP equations on a factor graph. Closed set of iterative equations. 
 m j ( w j ) Estimates marginal probabilities m j → i ( w j ) Factor graph representation of the committee machine Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  22. Main result (2) - Message Passing Algorithm ๏ Traditional approach: ‣ Minimize a loss function. Not optimal for limited number of samples. w j P 0 ( w j ) P out ( Y i | X i W ) ๏ Approximate Message Passing (AMP) algorithm: ‣ Expansion of BP equations on a factor graph. Closed set of iterative equations. 
 m j ( w j ) Estimates marginal probabilities ✓ Conjectured to be optimal among polynomial algorithms m j → i ( w j ) ✓ Can be tracked rigorously (state evolution Factor graph representation given by critical points of the replica mutual of the committee machine information) [Montanari-Bayati ‘10] Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

  23. Gaussian weights - sign activation Large number of hidden units K = Θ p (1) Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Recommend


More recommend