breaking the gridlock in mixture of experts consistent
play

Breaking the gridlock in Mixture-of-Experts: Consistent and - PowerPoint PPT Presentation

Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms Ashok Vardhan Makkuva University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath Mixture-of-Experts (MoE) Jacobs,


  1. Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms Ashok Vardhan Makkuva University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath

  2. Mixture-of-Experts (MoE) Jacobs, Jordan, Nowlan and Hinton, 1991 y f f ( ✇ ⊺ ① ) 1 − f g ( ❛ ⊺ g ( ❛ ⊺ 1 ① ) 2 ① ) ① ① ① f = sigmoid , g = linear , tanh , ReLU , leakyReLU

  3. Motivation-I: Modern relevance of MoE Outrageously large neural networks

  4. Motivation-II: Gated RNNs Figure: Gated Recurrent Unit (GRU) Key features: Gating mechanism Long term memory

  5. Motivation-II: GRU Gates: z t , r t ∈ [ 0 , 1 ] d depend on the input x t and the past h t − 1 States: h t , ˜ h t ∈ R d Update equations for each t : h t = ( 1 − z t ) ⊙ h t − 1 + z t ⊙ ˜ h t ˜ h t = f ( Ax t + r t ⊙ Bh t − 1 )

  6. MoE: Building blocks of GRU h t = ( 1 − z t ) ⊙ h t − 1 + z t ⊙ ( 1 − r t ) ⊙ f ( Ax t ) + z t ⊙ r t ⊙ f ( Ax t + Bh t − 1 ) ❤ t 1 − z t z t NN-1 r t 1 − r t ① t , ❤ t − 1 NN-2 NN-3 ① t , ❤ t − 1 ① t , ❤ t − 1

  7. MoE: Building blocks of GRU h t = ( 1 − z t ) ⊙ h t − 1 + z t ⊙ ( 1 − r t ) ⊙ f ( Ax t ) + z t ⊙ r t ⊙ f ( Ax t + Bh t − 1 ) ❤ t 1 − z t z t NN-1 r t 1 − r t ① t , ❤ t − 1 NN-2 NN-3 ① t , ❤ t − 1 ① t , ❤ t − 1

  8. What is known about MoE? No provable learning algorithms for parameters 1 � 1 20 years of MoE, MoE: a literature survey

  9. Open problem for 25 + years y f f ( ✇ ⊺ ① ) 1 − f g ( ❛ ⊺ g ( ❛ ⊺ 1 ① ) 2 ① ) ① ① ① ⇔ P y ∣ ① = f ( ✇ ⊺ ① ) ⋅ N ( y ∣ g ( ❛ ⊺ 1 ① ) ,σ 2 ) + ( 1 − f ( ✇ ⊺ ① )) ⋅ N ( y ∣ g ( ❛ ⊺ 2 ① ) ,σ 2 ) Open question Given n i.i.d. samples ( ① ( i ) , y ( i ) ) , does there exist an efficient learning algorithm with provable theoretical guarantees to learn the regressors ❛ 1 , ❛ 2 and the gating parameter ✇ ?

  10. Modular structure Mixture of classification ( ✇ ) and regression ( ❛ 1 , ❛ 2 ) problems

  11. Key observation Key observation If we know the regressors, learning the gating parameter is easy and vice-versa. How to break the gridlock?

  12. Breaking the gridlock: An overview Recall the model for MoE: P y ∣ ① = f ( ✇ ⊺ ① ) ⋅ N( y ∣ g ( ❛ ⊺ 1 ① ) ,σ 2 ) + ( 1 − f ( ✇ ⊺ ① )) ⋅ N( y ∣ g ( ❛ ⊺ 2 ① ) ,σ 2 ) Main message We propose a novel algorithm with first recoverable guarantees We learn ( ❛ 1 , ❛ 2 ) and ✇ separately First recover ( ❛ 1 , ❛ 2 ) without knowing ✇ at all Later learn ✇ using traditional methods like EM Global consistency guarantees (population setting)

  13. Algorithm Samples ① Score function Tensor { ˆ ❛ 1 , ˆ ❛ 2 } decomp. y Cubic Transform EM ˆ ✇

  14. Comparison with EM 2.5 3 Spectral+EM Spectral+EM EM EM 2.5 2 Parameter estimation error Parameter estimation error 2 1.5 1.5 1 1 0.5 0.5 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 No. of EM iterations No. of EM iterations (a) 3 mixtures (b) 4 mixtures Figure: Plot of parameter estimation error

  15. Summary Algorithmic innovation: First provably consistent algorithms for MoE in 25+ years Global convergence: Our algorithms work with global initializations

  16. Conclusion

  17. Poster #210 Thank you!

  18. Poster #210 Thank you!

Recommend


More recommend