Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney
Agenda • Background • Methodology • Experiments • Conclusion 2 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Contextual Word Classification • language modeling (LM) and machine translation (MT) • dominated by neural networks (NN) 1 | ... ) = p ( w 0 ) � T , ... ) = � T 1 p ( w t | w t − 1 • p ( w T 1 p ( w t | h t ) 0 • despite many choices to learn the context vector h : – feed-forward NN – recurrent NN – convolutional NN – self-attention NN – ... • output is often modeled with standard softmax and trained with cross entropy: v h ) / � V p( w v | h ) = exp( W T v ′ =1 exp( W T v ′ h ) L = − log p( w v | h ) 3 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Softmax Bottleneck [Yang et al., 2018] • from previous: � � v h ) / � V exp( W T v ′ =1 exp( W T L = − log p( w v | h ) = − log v ′ h ) • exponential-and-logarithm calculation: log p( w v | h ) + log � V v ′ =1 exp( W T v ′ h ) = W T v h • approximate true log posteriors with inner products: p( w v | h ) + C h ≈ W T log ˜ v h 4 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Softmax Bottleneck Cont. • from previous: log ˜ p( w v | h ) + C h ≈ W T v h • in matrix form: W T ˜ p( w 1 | h 1 ) . . . ˜ p( w 1 | h N ) 1 . . � C h 1 . . . C h N � . � h 1 . . . h N � ... . . . log . . + ≈ . V × N d × N W T p( w V | h 1 ) . . . ˜ ˜ p( w V | h N ) V V × N V × d � �� � � �� � rank ∼ V rank ∼ d • factorization the true log posterior matrix: log ˜ P + C ≈ W T H • Softmax Bottleneck: – log ˜ P is high-rank for natural language: rank(log ˜ P ) ∼ V – C decreases the rank of left-hand side by maximum 1 – rank of W T H is bounded by hidden dimension: rank( W T H ) ∼ d – typically V ∼ 100 , 000 and d ∼ 1000 ≈ → �≈ 5 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Breaking Softmax Bottleneck • mixture-of-softmax (mos) [Yang et al., 2018]: K exp( W T v f k ( h )) � p mos ( w v | h ) = π k � V v ′ =1 exp( W T v ′ f k ( h )) k =1 • sigsoftmax [Kanai et al., 2018]: exp( W T v h ) σ ( W T v h ) p sigsoftmax ( w v | h ) = � V v ′ =1 exp( W T v ′ h ) σ ( W T v ′ h ) • weight norm regularization [Herold et al., 2018]: � � V � � � � ( || W v || 2 − ν ) 2 L wnr = − log(p mos ( w v | h )) + ρ � data v =1 V 6 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Breaking Softmax Bottleneck Cont. • let z = W T v h • theoretically, to break softmax bottleneck with activation g( z ) [Kanai et al., 2018]: – nonlinearity of log(g( z )) – numerical stable – non-negative – monotonically increasing 7 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Geometric Explanation of Softmax Bottleneck • an intuitive example: – ˜ p( dog | a common home pet is ... ) ≈ ˜ p( cat | a common home pet is ... ) ≈ 50% – learned word vectors are close: W dog ≈ W cat – posteriors over dog and cat are thus close: p( dog | ... ) ≈ p( cat | ... ) – there exist contexts that would fail the model: – overall an expressiveness problem 8 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Kernel Trick • widely used in support-vector machine (SVM) and logistic regression • improve expressiveness by implicitly transforming data into high dimensional feature spaces [Eric, 2019] 9 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Kernel Trick Cont. • K ( x , y ) = < φ ( x ) , φ ( y ) > � x 1 � � y 1 � � x 1 � � y 1 � √ √ � T = φ T ) = ( x 1 y 1 + x 2 y 2 ) 2 = � � � • K sq ( x 2 2 x 1 x 2 , x 2 y 2 2 y 1 y 2 , y 2 , 1 , 1 , φ 2 2 x 2 y 2 x 2 y 2 • for a kernel function (kernel) to be valid: – positive semidefinite (PSD) Gram matrix – corresponds to a scalar product in some feature space • empirically, non PSD kernels also work well [Lin and Lin, 2003, Boughorbel et al., 2005] → we do not enforce PSD • where there is an inner product, there could be an application of the kernel trick – import vector machine [Zhu and Hastie, 2002] – multilayer kernel machine [Cho and Saul, 2009] – gated softmax [Memisevic et al., 2010] 10 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Non-Euclidean Word Embedding • Gaussian embedding [Vilnis and McCallum, 2015, Athiwaratkun and Wilson, 2017] [Athiwaratkun and Wilson, 2017] 11 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Non-Euclidean Word Embedding Cont. • hyperbolic (Poincar´ e) embedding [Nickel and Kiela, 2017, Dhingra et al., 2018] [Nickel and Kiela, 2017] 12 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Quick Recap • contextual word classification • softmax bottleneck • breaking softmax bottleneck • geometric explanation of softmax bottleneck • kernel trick • non-Euclidean word embedding → kernels in softmax 13 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Agenda • Background • Methodology • Experiments • Conclusion 14 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Methodology Generalized Softmax • model posterior: K exp(S k ( W v , f k ( h ))) � p( w v | h ) = π k � V v ′ =1 exp(S k ( W v ′ , f k ( h ))) k =1 • mixture weight: exp( M T k h ) π k = � K k ′ =1 exp( M T k ′ h ) • nonlinearly transformed context: f k ( h ) = tanh( Q T k h ) • with trainable parameters: W ∈ R d × V , M ∈ R d × K and Q k ∈ R d × d 15 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Methodology Generalized Softmax Cont. • from previous: K exp(S k ( W v , f k ( h ))) � p( w v | h ) = π k � V v ′ =1 exp(S k ( W v ′ , f k ( h ))) k =1 • note: – replace inner product with kernels: W T v h → S k ( W v , f k ( h )) – replace single softmax with a mixture: 1 → � K k =1 π k – replace context vector with transformed ones: h → f k ( h ) – shared word vectors due to memory restriction • motivations: – different kernels give different feature spaces – based on context, model chooses which feature space is suitable – for each feature space, the context vector could be different – ideally, for each feature space, the word vector could also be different 16 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Methodology Individual Kernels S lin ( W v , h ) = W T v h S log ( W v , h ) = − log( || W v − h || p + 1) S pow ( W v , h ) = −|| W v − h || p S pol ( W v , h ) = ( α W T v h + c ) p S rbf ( W v , h ) = exp( − γ || W v − h || 2 ) S wav ( W v , h ) = cos( || W v − h || 2 ) exp( −|| W v − h || 2 ) a b � S ssg ( W v , h ) = log N ( µ W v , Σ W v ) N ( µ h , Σ h ) � � S mog ( W v , h ) = log N ( µ i , W v , Σ i , W v ) N ( µ j , h , Σ j , h ) i , j 2 || W v − h || 2 S hpb ( W v , h ) = − acos(1 + (1 − || W v || 2 )(1 − || h || 2 )) 17 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Recommend
More recommend