Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms Ashok Vardhan Makkuva University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath
Mixture-of-Experts (MoE) Jacobs, Jordan, Nowlan and Hinton, 1991 y f f ( ✇ ⊺ ① ) 1 − f g ( ❛ ⊺ g ( ❛ ⊺ 1 ① ) 2 ① ) ① ① ① f = sigmoid , g = linear , tanh , ReLU , leakyReLU
Motivation-I: Modern relevance of MoE Outrageously large neural networks
Motivation-II: Gated RNNs Figure: Gated Recurrent Unit (GRU) Key features: Gating mechanism Long term memory
Motivation-II: GRU Gates: z t , r t ∈ [ 0 , 1 ] d depend on the input x t and the past h t − 1 States: h t , ˜ h t ∈ R d Update equations for each t : h t = ( 1 − z t ) ⊙ h t − 1 + z t ⊙ ˜ h t ˜ h t = f ( Ax t + r t ⊙ Bh t − 1 )
MoE: Building blocks of GRU h t = ( 1 − z t ) ⊙ h t − 1 + z t ⊙ ( 1 − r t ) ⊙ f ( Ax t ) + z t ⊙ r t ⊙ f ( Ax t + Bh t − 1 ) ❤ t 1 − z t z t NN-1 r t 1 − r t ① t , ❤ t − 1 NN-2 NN-3 ① t , ❤ t − 1 ① t , ❤ t − 1
MoE: Building blocks of GRU h t = ( 1 − z t ) ⊙ h t − 1 + z t ⊙ ( 1 − r t ) ⊙ f ( Ax t ) + z t ⊙ r t ⊙ f ( Ax t + Bh t − 1 ) ❤ t 1 − z t z t NN-1 r t 1 − r t ① t , ❤ t − 1 NN-2 NN-3 ① t , ❤ t − 1 ① t , ❤ t − 1
What is known about MoE? No provable learning algorithms for parameters 1 � 1 20 years of MoE, MoE: a literature survey
Open problem for 25 + years y f f ( ✇ ⊺ ① ) 1 − f g ( ❛ ⊺ g ( ❛ ⊺ 1 ① ) 2 ① ) ① ① ① ⇔ P y ∣ ① = f ( ✇ ⊺ ① ) ⋅ N ( y ∣ g ( ❛ ⊺ 1 ① ) ,σ 2 ) + ( 1 − f ( ✇ ⊺ ① )) ⋅ N ( y ∣ g ( ❛ ⊺ 2 ① ) ,σ 2 ) Open question Given n i.i.d. samples ( ① ( i ) , y ( i ) ) , does there exist an efficient learning algorithm with provable theoretical guarantees to learn the regressors ❛ 1 , ❛ 2 and the gating parameter ✇ ?
Modular structure Mixture of classification ( ✇ ) and regression ( ❛ 1 , ❛ 2 ) problems
Key observation Key observation If we know the regressors, learning the gating parameter is easy and vice-versa. How to break the gridlock?
Breaking the gridlock: An overview Recall the model for MoE: P y ∣ ① = f ( ✇ ⊺ ① ) ⋅ N( y ∣ g ( ❛ ⊺ 1 ① ) ,σ 2 ) + ( 1 − f ( ✇ ⊺ ① )) ⋅ N( y ∣ g ( ❛ ⊺ 2 ① ) ,σ 2 ) Main message We propose a novel algorithm with first recoverable guarantees We learn ( ❛ 1 , ❛ 2 ) and ✇ separately First recover ( ❛ 1 , ❛ 2 ) without knowing ✇ at all Later learn ✇ using traditional methods like EM Global consistency guarantees (population setting)
Algorithm Samples ① Score function Tensor { ˆ ❛ 1 , ˆ ❛ 2 } decomp. y Cubic Transform EM ˆ ✇
Comparison with EM 2.5 3 Spectral+EM Spectral+EM EM EM 2.5 2 Parameter estimation error Parameter estimation error 2 1.5 1.5 1 1 0.5 0.5 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 No. of EM iterations No. of EM iterations (a) 3 mixtures (b) 4 mixtures Figure: Plot of parameter estimation error
Summary Algorithmic innovation: First provably consistent algorithms for MoE in 25+ years Global convergence: Our algorithms work with global initializations
Conclusion
Poster #210 Thank you!
Poster #210 Thank you!
Recommend
More recommend