Token-level and sequence-level loss smoothing for RNN language models Maha Elbayad 1,2 , Laurent Besacier 1 ,and Jakob Verbeek 2 1 LIG , 2 INRIA, Grenoble, France ACL 2018 Melbourne, Australia
Language generation | Equivalence in the target space • Ground truth sequences lie in a union of low-dimensional subspaces where sequences convey the same message. ◮ France won the world cup for the second time. ◮ France captured its second world cup title. • Some words in the vocabulary share the same meaning. ◮ Capture, conquer, win, gain, achieve, accomplish, . . . ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 1
Contributions Take into consideration the nature of the target language space with: • A token-level smoothing for a “robust” multi-class classification. • A sequence-level smoothing to explore relevant alternative sequences. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 2
Maximum likelihood estimation (MLE) For a pair ( x , y ) , we model the conditional distribution: | y | � p θ ( y | x ) = p θ ( y t | y < t , x ) (1) t Given the ground truth target sequence y ⋆ : ℓ MLE ( y ⋆ , x ) = − ln p θ ( y ⋆ | x ) = D KL ( δ ( y | y ⋆ ) � p θ ( y | x )) (2) | y ⋆ | � D KL ( δ ( y t | y ⋆ t ) � p θ ( y t | y ⋆ = < t , x )) (3) t = 1 ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 3
Maximum likelihood estimation (ML) ℓ MLE ( y ⋆ , x ) = − ln p θ ( y ⋆ | x ) = D KL ( δ ( y | y ⋆ ) � p θ ( y | x )) (2) T � D KL ( δ ( y t | y ⋆ = t ) � p θ ( y t | h t )) (3) t = 1 Issues: • Zero-one loss, all the outputs y � = y ⋆ are treated equally. • Discrepancy at the sentence level between the training (1-gram) and evaluation metric (4-gram). ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 4
Loss smoothing smoothing r τ ( y | y ⋆ ) δ ( y ⋆ ) ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (Norouzi et al, 2016) D KL ( δ ( y | y ⋆ ) � p θ ( y | x )) ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 5
Loss smoothing smoothing r τ ( y | y ⋆ ) ( resp . r τ ( y t | y ⋆ t )) δ ( y ⋆ ) ( resp . δ ( y ⋆ t )) ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (Norouzi et al, 2016) D KL ( δ ( y | y ⋆ ) � p θ ( y | x )) T T � ℓ tok RAML ( y ⋆ , x ) = D KL ( r τ ( y t | y ⋆ t ) � p θ ( y t | h t )) � D KL ( δ ( y t | y ⋆ t ) � p θ ( y t | h t )) t = 1 t = 1 ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 5
Token-level smoothing ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 6
Loss smoothing | Token-level T ℓ tok RAML ( y ⋆ , x ) = � D KL ( r τ ( y t | y ⋆ t ) � p θ ( y t | h t )) (4) t = 1 • Uniform label smoothing over all words in the vocabulary: r τ ( y t | y ⋆ t ) = δ ( y t | y ⋆ t ) + τ. u ( V ) (Szegedy et al. 2016) • We can leverage word co-occurrence statistics to build a non-uniform and “meaningful” distribution. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 7
Loss smoothing T � ℓ tok RAML ( y ⋆ , x ) = D KL ( r τ ( y t | y ⋆ t ) � p θ ( y t | h t )) (4) t = 1 Prerequisite: A word embedding w (e.g. Glove) in the target space and a distance d . � − d( w ( y t ) , w ( y ⋆ � t ) = 1 t )) r τ ( y t | y ⋆ Z exp , τ with a temperature τ st. r τ − τ → 0 δ . − − → � r τ ( y t | y ⋆ Z st. t ) = 1 y t ∈V ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 8
Loss smoothing | Token-level τ = 0 . 12 τ = 0 . 70 ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 9
Loss smoothing | Token-level T ℓ tok RAML ( y ⋆ , x ) = � D KL ( r τ ( y t | y ⋆ t ) � p θ ( y t | h t )) (4) t = 1 T � r τ ( y t | y ⋆ t ) � � � r τ ( y t | y ⋆ = t ) log (5) p θ ( y t | h t ) t = 1 y t ∈V We can estimate the exact KL divergence for every target token. No approximation needed. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 10
Sequence-level smoothing ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 11
Loss smoothing | Sequence-level ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (6) Prerequisite: A distance d in the sequences space V n , n ∈ N . � − d( y , y ⋆ ) r τ ( y | y ⋆ ) = 1 � Z exp , τ � r τ ( y | y ⋆ ) = 1 Z st. y ∈V n , n ∈ N Possible (pseudo-)distances: • Hamming • Edit • 1 − BLEU • 1 − CIDEr ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 12
Loss smoothing | Sequence-level Can we evaluate the partition function Z for a given reward? � − d( y , y ⋆ ) � t ) = 1 r τ ( y t | y ⋆ Z exp , τ � − d( y , y ⋆ ) � � Z = exp τ y ∈V n , n ∈ N We can approximate Z for Hamming distance. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 13
Loss smoothing | Sequence-level | Hamming distance Assumption: consider only sequences of the same length as y ⋆ ( d ( y , y ′ ) = 0 if | y | � = | y ′ | ). We partition the set of sequences V T w.r.t. their distance to the ground truth y ⋆ : S d = { y ∈ V T sub | d( y , y ⋆ ) = d } , V T = ∪ d S d , ∀ d , d ′ : S d ∩ S d ′ = ∅ . • The reward in each subset is a constant. • The cardinality of each subset is known. � − d � � Z = | S d | exp τ d ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 14
Loss smoothing | Sequence-level | Hamming distance We can easily draw from r τ with Hamming distance: 1 Sample a distance d from { 0 , . . . , T } . 2 Pick d positions in the sequence to be changed among { 1 , . . . , T } . 3 Sample substitutions from V of the vocabulary. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 15
Loss smoothing | Sequence-level | Hamming distance We can easily draw from r τ with Hamming distance: 1 Sample a distance d from { 0 , . . . , T } . 2 Pick d positions in the sequence to be changed among { 1 , . . . , T } . 3 Sample substitutions from V of the vocabulary. Monte Carlo estimation: ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (6) = − E r τ [log p θ ( . | x )] + cst (7) L ≈ − 1 ( y l ∼ r τ ) � log p θ ( y l | x ) (8) L l = 1 ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 15
Loss smoothing | Sequence-level | Other distances We cannot “easily” sample from more complicated rewards such as BLEU or CIDEr. Importance sampling: ℓ seq RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] (9) = − E q [ r τ q log p θ ] (10) L ( y l ∼ q ) ≈ − 1 � ω l log p θ ( y l | x ) (11) L l = 1 r τ ( y l | y ⋆ ) / q ( y l | y ⋆ ) ω l ≈ , � L k = 1 r τ ( y k | y ⋆ ) / q ( y k | y ⋆ ) Choose q the reward distribution relative to Hamming distance. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 16
Loss smoothing | Sequence-level | Support reduction ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (6) Can we reduce the support of r τ ? � − d( y , y ⋆ ) � − d( y , y ⋆ ) r τ ( y | y ⋆ ) = 1 � � � Z exp , Z = exp τ τ y ∈V T Reduce the support from V | y ⋆ | to V | y ⋆ | sub where V sub ⊂ V . • V sub = V batch : tokens occuring in the SGD mini-batch. • V sub = V refs : tokens occuring in the available references. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 17
Loss smoothing | Sequence-level | Lazy training Default training Lazy training ℓ seq RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] ℓ seq RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] L L ≈ − 1 ≈ − 1 � � log p θ ( y l | x ) log p θ ( y l | x ) L L l = 1 l = 1 ∀ l , y l is: ∀ l , y l is: 1 forwarded in the RNN. 1 not forwarded in the RNN. 2 used as target. 2 used as target. log p θ ( y l | y ⋆ , x ) log p θ ( y l | y l , x ) ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 18
Loss smoothing | Sequence-level | Lazy training Default training Lazy training ℓ seq ℓ seq RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] L L ≈ − 1 ≈ − 1 � � log p θ ( y l | x ) log p θ ( y l | x ) L L l = 1 l = 1 ∀ l , y l is: ∀ l , y l is: 1 forwarded in the RNN. 1 not forwarded in the RNN. 2 used as target. 2 used as target. log p θ ( y l | y ⋆ , x ) log p θ ( y l | y l , x ) Complexity : O ( 2 L .λ ) Complexity: O (( L + 1 ) λ ) λ = | y || θ cell | , where θ cell are the cell parameters. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 18
Experiments ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 19
Recommend
More recommend