Breaking the gridlock in Mixture-of-Experts: Consistent and - PowerPoint PPT Presentation

Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms Ashok Vardhan Makkuva University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath

Mixture-of-Experts (MoE) Jacobs, Jordan, Nowlan and Hinton, 1991 y f f ( ✇ ⊺ ① ) 1 − f g ( ❛ ⊺ g ( ❛ ⊺ 1 ① ) 2 ① ) ① ① ① f = sigmoid , g = linear , tanh , ReLU , leakyReLU

Motivation-I: Modern relevance of MoE Outrageously large neural networks

Motivation-II: Gated RNNs Figure: Gated Recurrent Unit (GRU) Key features: Gating mechanism Long term memory

Motivation-II: GRU Gates: z t , r t ∈ [ 0 , 1 ] d depend on the input x t and the past h t − 1 States: h t , ˜ h t ∈ R d Update equations for each t : h t = ( 1 − z t ) ⊙ h t − 1 + z t ⊙ ˜ h t ˜ h t = f ( Ax t + r t ⊙ Bh t − 1 )

MoE: Building blocks of GRU h t = ( 1 − z t ) ⊙ h t − 1 + z t ⊙ ( 1 − r t ) ⊙ f ( Ax t ) + z t ⊙ r t ⊙ f ( Ax t + Bh t − 1 ) ❤ t 1 − z t z t NN-1 r t 1 − r t ① t , ❤ t − 1 NN-2 NN-3 ① t , ❤ t − 1 ① t , ❤ t − 1

What is known about MoE? No provable learning algorithms for parameters 1 � 1 20 years of MoE, MoE: a literature survey

Open problem for 25 + years y f f ( ✇ ⊺ ① ) 1 − f g ( ❛ ⊺ g ( ❛ ⊺ 1 ① ) 2 ① ) ① ① ① ⇔ P y ∣ ① = f ( ✇ ⊺ ① ) ⋅ N ( y ∣ g ( ❛ ⊺ 1 ① ) ,σ 2 ) + ( 1 − f ( ✇ ⊺ ① )) ⋅ N ( y ∣ g ( ❛ ⊺ 2 ① ) ,σ 2 ) Open question Given n i.i.d. samples ( ① ( i ) , y ( i ) ) , does there exist an efficient learning algorithm with provable theoretical guarantees to learn the regressors ❛ 1 , ❛ 2 and the gating parameter ✇ ?

Modular structure Mixture of classification ( ✇ ) and regression ( ❛ 1 , ❛ 2 ) problems

Key observation Key observation If we know the regressors, learning the gating parameter is easy and vice-versa. How to break the gridlock?

Breaking the gridlock: An overview Recall the model for MoE: P y ∣ ① = f ( ✇ ⊺ ① ) ⋅ N( y ∣ g ( ❛ ⊺ 1 ① ) ,σ 2 ) + ( 1 − f ( ✇ ⊺ ① )) ⋅ N( y ∣ g ( ❛ ⊺ 2 ① ) ,σ 2 ) Main message We propose a novel algorithm with first recoverable guarantees We learn ( ❛ 1 , ❛ 2 ) and ✇ separately First recover ( ❛ 1 , ❛ 2 ) without knowing ✇ at all Later learn ✇ using traditional methods like EM Global consistency guarantees (population setting)

Algorithm Samples ① Score function Tensor { ˆ ❛ 1 , ˆ ❛ 2 } decomp. y Cubic Transform EM ˆ ✇

Comparison with EM 2.5 3 Spectral+EM Spectral+EM EM EM 2.5 2 Parameter estimation error Parameter estimation error 2 1.5 1.5 1 1 0.5 0.5 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 No. of EM iterations No. of EM iterations (a) 3 mixtures (b) 4 mixtures Figure: Plot of parameter estimation error

Summary Algorithmic innovation: First provably consistent algorithms for MoE in 25+ years Global convergence: Our algorithms work with global initializations

Conclusion

Poster #210 Thank you!

Breaking the gridlock in Mixture-of-Experts: Consistent and - PowerPoint PPT Presentation

Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms Ashok Vardhan Makkuva University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath Mixture-of-Experts (MoE) Jacobs,

Global Warming Gridlock Global Warming Gridlock EES 3310/5310 EES 3310/5310 Global Climate

Breaking Gridlock, Breaking Ground: Tackling Anchorage Housing Affordability Presented by Michele

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

Breaking out of the box Understanding rela5onships between learning and assessment Breaking

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

CSS Modules with BEM Consistent Design Consistent Design Different Module Versions Consistent

General Structure of a PW code Self-Consistent KS eqs. or Global Minimization approach

Public consultation EXPERTS WIPO ADR PRESENTATION AND CURRENT STATE OF THE EXPERTS WIPO ADR

Today Experts/Zero-Sum Games Equilibrium. Boosting and Experts. Routing and Experts. Two person

Overcoming Legislative Gridlock: How Procedural Rules Affect Obstructionism Molly C. Jackman

GROWTH WITHOUT GRIDLOCK An Integrated Transport Strategy for Kent 18 September 2009 Kent

The Expo Line Speed Comfort Capacity Alternative to gridlock! Access to jobs,

Delivering Growth without Gridlock Consultation Draft 13-Oct-16 Page 1 Local Transport Plan

GRIDLOCK Personnel Joan Feigenbaum, Yale (jf@cs.yale.edu) Angelos D. Keromytis, Columbia

Algorithmic Questions in Higher-Order Fourier Analysis Madhur Tulsiani TTI Chicago 1 1 2

Overview Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding

2004 Geothermal Map of North America (Blackwell & Richards) All data sites for US heat flow

Accoun&ng for mul&-scale ver&cal error correla&on within ETKF through

Branch Predic,on J. Nelson Amaral Why Branch Predic,on?

Algorithmic Questions in Higher-Order Fourier Analysis Madhur Tulsiani TTI Chicago 1 1 2

Chapter 2 Chapter 2 Instruction-Level Parallelism and Its Exploitation p 1 Overview

The Godson-3 Multi-Core Processor and its Application in High Performance Computers Weiwu Hu

Breaking the gridlock in Mixture-of-Experts: Consistent and - PowerPoint PPT Presentation

Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms Ashok Vardhan Makkuva University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath Mixture-of-Experts (MoE) Jacobs,

Global Warming Gridlock Global Warming Gridlock EES 3310/5310 EES 3310/5310 Global Climate

Breaking Gridlock, Breaking Ground: Tackling Anchorage Housing Affordability Presented by Michele

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

Breaking out of the box Understanding rela5onships between learning and assessment Breaking

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

CSS Modules with BEM Consistent Design Consistent Design Different Module Versions Consistent

General Structure of a PW code Self-Consistent KS eqs. or Global Minimization approach

Public consultation EXPERTS WIPO ADR PRESENTATION AND CURRENT STATE OF THE EXPERTS WIPO ADR

Today Experts/Zero-Sum Games Equilibrium. Boosting and Experts. Routing and Experts. Two person

Overcoming Legislative Gridlock: How Procedural Rules Affect Obstructionism Molly C. Jackman

GROWTH WITHOUT GRIDLOCK An Integrated Transport Strategy for Kent 18 September 2009 Kent

The Expo Line Speed Comfort Capacity Alternative to gridlock! Access to jobs,

Delivering Growth without Gridlock Consultation Draft 13-Oct-16 Page 1 Local Transport Plan

GRIDLOCK Personnel Joan Feigenbaum, Yale (jf@cs.yale.edu) Angelos D. Keromytis, Columbia

Algorithmic Questions in Higher-Order Fourier Analysis Madhur Tulsiani TTI Chicago 1 1 2

Overview Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding

2004 Geothermal Map of North America (Blackwell &amp; Richards) All data sites for US heat flow

Accoun&amp;ng for mul&amp;-scale ver&amp;cal error correla&amp;on within ETKF through

Branch Predic,on J. Nelson Amaral Why Branch Predic,on?

Algorithmic Questions in Higher-Order Fourier Analysis Madhur Tulsiani TTI Chicago 1 1 2

Chapter 2 Chapter 2 Instruction-Level Parallelism and Its Exploitation p 1 Overview

The Godson-3 Multi-Core Processor and its Application in High Performance Computers Weiwu Hu

2004 Geothermal Map of North America (Blackwell & Richards) All data sites for US heat flow

Accoun&ng for mul&-scale ver&cal error correla&on within ETKF through