Breaking the Softmax Bottleneck via Monotonic Functions Octavian - PowerPoint PPT Presentation

Breaking the Softmax Bottleneck via Monotonic Functions Octavian Ganea, Sylvain Gelly, Gary Bécigneul, Aliaksei Severyn

Softmax Layer (for Language Models) ● Natural language as conditional distributions next word context ● Parametric distributions & softmax:

Softmax Layer (for Language Models) ● ● Natural language as conditional distributions Natural language as conditional distributions next word context ● ● Parametric distributions & softmax: Parametric distributions & softmax: ● Challenge: Can we always find 𝛴 s.t. for all c : ?

Softmax Layer (for Language Models) ● ● ● Natural language as conditional distributions Natural language as conditional distributions Natural language as conditional distributions next word context ● ● ● Parametric distributions & softmax: Parametric distributions & softmax: Parametric distributions & softmax: ● ● Challenge: Can we always find 𝛴 s.t. for all c : ? No , when embedding size < label cardinality (vocab size) !

What is the Softmax Bottleneck (Yang et al, ‘18) ? ● log-P matrix : Label cardinality = Vocabulary size Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al., ICLR 2018

What is the Softmax Bottleneck (Yang et al, ‘18) ? ● log-P matrix : Number of labels = ● Then: Vocabulary size Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al., ICLR 2018

What is the Softmax Bottleneck (Yang et al, ‘18) ? ● log-P matrix : ● Then: Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al., ICLR 2018

Breaking the Softmax Bottleneck [1] ● MoS [1] : Mixture of K Softmaxes [1] Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al., ICLR 2018

Breaking the Softmax Bottleneck [1] ● MoS [1] : Mixture of K Softmaxes ● Improves perplexity [1] Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al., ICLR 2018

Breaking the Softmax Bottleneck [1] ● MoS [1] : Mixture of K Softmaxes ● Improves perplexity ● Slower than vanilla softmax: 2 - 6.4x ● GPU Memory: M x N x K tensor Vanilla softmax [1] Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al., ICLR 2018

Breaking the Softmax Bottleneck [2] ● Sig-Softmax [2] : [2] Sigsoftmax: Reanalysis of the Softmax Bottleneck, S. Kanai et al., NIPS 2018

Breaking the Softmax Bottleneck [2] ● Sig-Softmax [2] : ● Small improvement over vanilla Softmax [2] Sigsoftmax: Reanalysis of the Softmax Bottleneck, S. Kanai et al., NIPS 2018

Breaking the Softmax Bottleneck [2] ● Sig-Softmax [2] : Can we learn the best non-linearity to deform the logits ? ● Small improvement over vanilla Softmax [2] Sigsoftmax: Reanalysis of the Softmax Bottleneck, S. Kanai et al., NIPS 2018

Can we do better ? ● Our idea - learn a pointwise monotonic function on top of logits:

Can we do better ? ● Our idea - learn a pointwise monotonic function on top of logits: should be:

Can we do better ? ● Our idea - learn a pointwise monotonic function on top of logits: should be: 1. With unbounded image set -- to model sparse distributions

Can we do better ? ● Our idea - learn a pointwise monotonic function on top of logits: should be: 1. With unbounded image set -- to model sparse distributions 2. Continuous and (piecewise) differentiable -- for backprop

Can we do better ? ● Our idea - learn a pointwise monotonic function on top of logits: should be: 1. With unbounded image set -- to model sparse distributions 2. Continuous and (piecewise) differentiable -- for backprop 3. Non-linear -- to break the softmax bottleneck

Can we do better ? ● Our idea - learn a pointwise monotonic function on top of logits: should be: 1. With unbounded image set -- to model sparse distributions 2. Continuous and (piecewise) differentiable -- for backprop 3. Non-linear -- to break the softmax bottleneck 4. Monotonic -- to preserve the ranking of logits

Can we do better ? ● Our idea - learn a pointwise monotonic function on top of logits: should be: 1. With unbounded image set -- to model sparse distributions 2. Continuous and (piecewise) differentiable -- for backprop 3. Non-linear -- to break the softmax bottleneck 4. Monotonic -- to preserve the ranking of logits 5. Fast and memory efficient -- comparable w/ vanilla softmax

Can we do better ? ● Our idea - learn a pointwise monotonic function on top of logits: Theorem : these properties are not restrictive in terms of rank deficiency should be: 1. With unbounded image set -- to model sparse distributions 2. Continuous and (piecewise) differentiable -- for backprop 3. Non-linear -- to break the softmax bottleneck 4. Monotonic -- to preserve the ranking of logits 5. Fast and memory efficient -- comparable w/ vanilla softmax

Learnable parametric monotonic real functions ● A neural network with 1 hidden layer and positive (constrained) weights [3] ● Universal approximator for all monotonic functions (when K is large enough !) [3] Monotone and Partially Monotone Neural Networks, Daniels and Velikova, 2010, IEEE TRANSACTIONS ON NEURAL NETWORKS

Synthetic Experiment ● Goal : separate softmax bottleneck from context embedding bottleneck

Synthetic Experiment ● Goal : separate softmax bottleneck from context embedding bottleneck ● Sample N different categorical distributions with M outcomes:

Synthetic Experiment ● Goal : separate softmax bottleneck from context embedding bottleneck ● Sample N different categorical distributions with M outcomes: ● Goal:

Synthetic Experiment ● Goal : separate softmax bottleneck from context embedding bottleneck ● Sample N different categorical distributions with M outcomes: ● Goal: ● Independent context embeddings; shared word embeddings

Synthetic Experiments - Mode Matching ( 𝜷 =0.01) ● Percentage of contexts c for which Vanilla Softmax Monotonic fn (K=100) Ratio 2nd / 1st num words M num words M num words M

Synthetic Experiments - Mode Matching ( 𝜷 =0.01) ● Percentage of contexts c for which Vanilla Softmax Monotonic fn (K=100) Ratio 2nd / 1st num words M num words M num words M ● Similar results for cross-entropy and other values of 𝜷

Piecewise Linear Increasing Functions (PLIF) ● NN w/ 1 hidden layer ⇒ memory hungry: ○ Tensor of size N x M x K on GPU , where K >= 1000

Piecewise Linear Increasing Functions (PLIF) ● NN w/ 1 hidden layer ⇒ memory hungry: ○ Tensor of size N x M x K on GPU , where K >= 1000 ● PLIF:

Piecewise Linear Increasing Functions (PLIF) ● NN w/ 1 hidden layer ⇒ memory hungry: ○ Tensor of size N x M x K on GPU , where K >= 1000 ● PLIF : ● Forward & backward passes: just a lookup in two K dim vectors ● Memory and running time very efficient (comparable with Vanilla Softmax)

Language Modeling Results GPU Memory: N x M GPU Memory: N x M x K

Thank you! Poster #23

Breaking the Softmax Bottleneck via Monotonic Functions Octavian - PowerPoint PPT Presentation

Breaking the Softmax Bottleneck via Monotonic Functions Octavian Ganea, Sylvain Gelly, Gary Bcigneul, Aliaksei Severyn Softmax Layer (for Language Models) Natural language as conditional distributions next word context Parametric

Softmax Alternatives in Neural MT Graham Neubig 5/24/2017 1 Softmax Alternatives in Neural MT

Monotonic Concession Protocols for Multilateral Negotiation Ulle Endriss Institute for Logic,

Real Real- -Time Systems Time Systems Deadline- Deadline -monotonic scheduling monotonic

for Efficient Softmax Inference Shun Liao* 1 , Ting Chen* 2 , Tian Lin 2 , Denny Zhou 2 , Chong

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background:

Large-Margin Softmax Loss for Conv. Neural Networks Weiyang Liu 1* , Yandong Wen 2* , Zhiding Yu 3

von Neumann's bottleneck von Neumann machine One control unit that connects memory and

Breaking out of the box Understanding rela5onships between learning and assessment Breaking

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao,

Symblicit algorithms for optimal strategy synthesis in monotonic Markov decision processes Aaron

Non-monotonic Operators in Strategic Games Krzysztof R. Apt CWI and University of Amsterdam

SAT Modulo Monotonic Theories Sam Bayless , Noah Bayless , Holger H. Hoos , Alan J. Hu

Outline Indicative Conditionals, Strictly 1 Monotonic Patterns William Starr 2 New Data 3 A

Wardrop Equilibria and Price of Stability in Bottleneck Games With Splittable Traffic Vladimir

Statistical mechanics via Answers: GUE asymptotics of symmetric functions Probability via Schur

Vanilla Meta-interpreter prove ( G ) is true when base-level body G is a logical consequence of

Experimental study of the interaction of a strong shock with a spherical density inhomogeneity H.

Computational Complexity of Cosmology in String Theory Michael R. Douglas 1 Simons Center / Stony

A data-centered approach to understanding quantum behaviors in materials Lucas K. Wagner

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine

Record Management vanilladb.org Outline Overview Design Considerations for Record Manager

Lecture 15: Backtracking Steven Skiena Department of Computer Science State University of New

Sequential Importance Resampling (SIR) Particle Filter 1. Algorithm particle_filter ( S t-1 , u t