The committee machine: Computational to statistical gaps in learning a two-layers neural network Benjamin Aubin , Antoine Maillard, Jean Barbier Nicolas Macris, Florent Krzakala & Lenka Zdeborovà Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
« Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ( X i ) n ( X i ) n ) n f (2) i =1 i =1 i =1 ı Y i samples samples ples f (1) W (2) fixed W ı ∈ R p × K K hidden units ? f (1) f (2) Y i f (1) W (2) fixed W Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
« Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ( X i ) n ) n f (2) i =1 i =1 i =1 ı Y i samples samples ples f (1) W (2) fixed W ı ∈ R p × K K hidden units ? f (1) f (2) Y i f (1) W (2) fixed W Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
« Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ( X i ) n ) n f (2) i =1 i =1 i =1 ı Y i samples samples ples f (1) W (2) ✓ Committee machine: second layer fixed fixed [Schwarze’93] W ı ∈ R p × K K hidden units ? f (1) f (2) Y i f (1) W (2) fixed W Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
« Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ( X i ) n ) n f (2) i =1 i =1 i =1 ı Y i samples samples ples f (1) W (2) ✓ Committee machine: second layer fixed fixed [Schwarze’93] W ı ∈ R p × K ✓ i.i.d samples K hidden units ? f (1) f (2) Y i f (1) W (2) fixed W Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
« Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ( X i ) n ) n f (2) i =1 i =1 i =1 ı Y i samples samples ples f (1) W (2) ✓ Committee machine: second layer fixed fixed [Schwarze’93] W ı ∈ R p × K ✓ i.i.d samples K hidden units ? ๏ Student: f (1) f (2) Y i f (1) W (2) fixed W Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
« Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ) n f (2) i =1 i =1 ı Y i samples ples f (1) W (2) ✓ Committee machine: second layer fixed fixed [Schwarze’93] W ı ∈ R p × K ✓ i.i.d samples K hidden units ? ๏ Student: f (1) ( X i ) n f (2) i =1 Y i samples f (1) W (2) fixed W Y ı i Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
« Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ) n f (2) i =1 i =1 ı Y i samples ples f (1) W (2) ✓ Committee machine: second layer fixed fixed [Schwarze’93] W ı ∈ R p × K ✓ i.i.d samples K hidden units ? ๏ Student: f (1) ( X i ) n f (2) i =1 Y i ✓ Learning task possible ? samples f (1) W (2) fixed W Y ı i Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
« Can we efficiently learn a teacher network from a limited number of samples? » p features K hidden units output f (1) ๏ Teacher: ( X i ) n ) n f (2) i =1 i =1 ı Y i samples ples f (1) W (2) ✓ Committee machine: second layer fixed fixed [Schwarze’93] W ı ∈ R p × K ✓ i.i.d samples K hidden units ? ๏ Student: f (1) ( X i ) n f (2) i =1 Y i ✓ Learning task possible ? samples f (1) W (2) ✓ Computational complexity? fixed W Y ı i Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Motivation ➡ Traditional approach ๏ Worst case scenario/PAC bounds: VC-dim & Rademacher complexity ๏ Numerical experiments Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Motivation ➡ Traditional approach ๏ Worst case scenario/PAC bounds: VC-dim & Rademacher complexity ๏ Numerical experiments ➡ Complementary approach ✓ Revisit the statistical physics typical case scenario [Sompolinsky’92, Mezard’87] : i.i.d data coming from a probabilistic model Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Motivation ➡ Traditional approach ๏ Worst case scenario/PAC bounds: VC-dim & Rademacher complexity ๏ Numerical experiments ➡ Complementary approach ✓ Revisit the statistical physics typical case scenario [Sompolinsky’92, Mezard’87] : i.i.d data coming from a probabilistic model ✓ Theoretical understanding of the generalization performance p → ∞ , n p = Θ (1) ✓ Regime: Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Main result (1) - Generalization error ๏ Information theoretically optimal generalization error (Bayes optimal case) ≡ 1 h� � 2 i ✏ ( p ) ⇥ ⇤ − Y ? ( XW ? ) p →∞ ✏ g ( q ∗ ) Y ( XW ) 2 E X , W ? E W | X − − − → g Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Main result (1) - Generalization error ๏ Information theoretically optimal generalization error (Bayes optimal case) ≡ 1 h� � 2 i ✏ ( p ) ⇥ ⇤ − Y ? ( XW ? ) p →∞ ✏ g ( q ∗ ) Y ( XW ) 2 E X , W ? E W | X EXPLICIT − − − → g Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Main result (1) - Generalization error ๏ Information theoretically optimal generalization error (Bayes optimal case) ≡ 1 h� � 2 i ✏ ( p ) ⇥ ⇤ − Y ? ( XW ? ) p →∞ ✏ g ( q ∗ ) Y ( XW ) 2 E X , W ? E W | X EXPLICIT − − − → g Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Main result (1) - Generalization error ๏ Information theoretically optimal generalization error (Bayes optimal case) ≡ 1 h� � 2 i ✏ ( p ) ⇥ ⇤ − Y ? ( XW ? ) p →∞ ✏ g ( q ∗ ) Y ( XW ) 2 E X , W ? E W | X EXPLICIT − − − → g ๏ : extremizing the variational formulation of this mutual information : q ∗ 1 ψ P 0 ( r ) + α Ψ out ( q ) − 1 n o lim pI ( W ; Y | X ) = − sup inf 2Tr( rq ) + cst p →∞ q ∈ S + r ∈ S + K K Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Main result (1) - Generalization error ๏ Information theoretically optimal generalization error (Bayes optimal case) ≡ 1 h� � 2 i ✏ ( p ) ⇥ ⇤ − Y ? ( XW ? ) p →∞ ✏ g ( q ∗ ) Y ( XW ) 2 E X , W ? E W | X EXPLICIT − − − → g ๏ : extremizing the variational formulation of this mutual information : q ∗ 1 ψ P 0 ( r ) + α Ψ out ( q ) − 1 n o lim pI ( W ; Y | X ) = − sup inf 2Tr( rq ) + cst p →∞ q ∈ S + r ∈ S + K K Heuristic replica mutual information well known in statistical physics since 80’s Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Main result (1) - Generalization error ๏ Information theoretically optimal generalization error (Bayes optimal case) ≡ 1 h� � 2 i ✏ ( p ) ⇥ ⇤ − Y ? ( XW ? ) p →∞ ✏ g ( q ∗ ) Y ( XW ) 2 E X , W ? E W | X EXPLICIT − − − → g ๏ : extremizing the variational formulation of this mutual information : q ∗ 1 ψ P 0 ( r ) + α Ψ out ( q ) − 1 n o lim pI ( W ; Y | X ) = − sup inf 2Tr( rq ) + cst p →∞ q ∈ S + r ∈ S + K K Heuristic replica mutual information well known in statistical physics since 80’s ✓ Main contribution: rigorous proof by adaptive (Guerra) interpolation Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Main result (2) - Message Passing Algorithm Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Main result (2) - Message Passing Algorithm ๏ Traditional approach: ‣ Minimize a loss function. Not optimal for limited number of samples. Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Main result (2) - Message Passing Algorithm ๏ Traditional approach: ‣ Minimize a loss function. Not optimal for limited number of samples. w j P 0 ( w j ) P out ( Y i | X i W ) ๏ Approximate Message Passing (AMP) algorithm: ‣ Expansion of BP equations on a factor graph. Closed set of iterative equations. m j ( w j ) Estimates marginal probabilities m j → i ( w j ) Factor graph representation of the committee machine Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Main result (2) - Message Passing Algorithm ๏ Traditional approach: ‣ Minimize a loss function. Not optimal for limited number of samples. w j P 0 ( w j ) P out ( Y i | X i W ) ๏ Approximate Message Passing (AMP) algorithm: ‣ Expansion of BP equations on a factor graph. Closed set of iterative equations. m j ( w j ) Estimates marginal probabilities ✓ Conjectured to be optimal among polynomial algorithms m j → i ( w j ) ✓ Can be tracked rigorously (state evolution Factor graph representation given by critical points of the replica mutual of the committee machine information) [Montanari-Bayati ‘10] Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Gaussian weights - sign activation Large number of hidden units K = Θ p (1) Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018
Recommend
More recommend