Token-level and sequence-level loss smoothing for RNN language - PowerPoint PPT Presentation

Token-level and sequence-level loss smoothing for RNN language models Maha Elbayad 1,2 , Laurent Besacier 1 ,and Jakob Verbeek 2 1 LIG , 2 INRIA, Grenoble, France ACL 2018 Melbourne, Australia

Language generation | Equivalence in the target space • Ground truth sequences lie in a union of low-dimensional subspaces where sequences convey the same message. ◮ France won the world cup for the second time. ◮ France captured its second world cup title. • Some words in the vocabulary share the same meaning. ◮ Capture, conquer, win, gain, achieve, accomplish, . . . ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 1

Contributions Take into consideration the nature of the target language space with: • A token-level smoothing for a “robust” multi-class classification. • A sequence-level smoothing to explore relevant alternative sequences. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 2

Maximum likelihood estimation (MLE) For a pair ( x , y ) , we model the conditional distribution: | y | � p θ ( y | x ) = p θ ( y t | y < t , x ) (1) t Given the ground truth target sequence y ⋆ : ℓ MLE ( y ⋆ , x ) = − ln p θ ( y ⋆ | x ) = D KL ( δ ( y | y ⋆ ) � p θ ( y | x )) (2) | y ⋆ | � D KL ( δ ( y t | y ⋆ t ) � p θ ( y t | y ⋆ = < t , x )) (3) t = 1 ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 3

Maximum likelihood estimation (ML) ℓ MLE ( y ⋆ , x ) = − ln p θ ( y ⋆ | x ) = D KL ( δ ( y | y ⋆ ) � p θ ( y | x )) (2) T � D KL ( δ ( y t | y ⋆ = t ) � p θ ( y t | h t )) (3) t = 1 Issues: • Zero-one loss, all the outputs y � = y ⋆ are treated equally. • Discrepancy at the sentence level between the training (1-gram) and evaluation metric (4-gram). ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 4

Loss smoothing smoothing r τ ( y | y ⋆ ) δ ( y ⋆ ) ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (Norouzi et al, 2016) D KL ( δ ( y | y ⋆ ) � p θ ( y | x )) ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 5

Loss smoothing smoothing r τ ( y | y ⋆ ) ( resp . r τ ( y t | y ⋆ t )) δ ( y ⋆ ) ( resp . δ ( y ⋆ t )) ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (Norouzi et al, 2016) D KL ( δ ( y | y ⋆ ) � p θ ( y | x )) T T � ℓ tok RAML ( y ⋆ , x ) = D KL ( r τ ( y t | y ⋆ t ) � p θ ( y t | h t )) � D KL ( δ ( y t | y ⋆ t ) � p θ ( y t | h t )) t = 1 t = 1 ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 5

Token-level smoothing ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 6

Loss smoothing | Token-level T ℓ tok RAML ( y ⋆ , x ) = � D KL ( r τ ( y t | y ⋆ t ) � p θ ( y t | h t )) (4) t = 1 • Uniform label smoothing over all words in the vocabulary: r τ ( y t | y ⋆ t ) = δ ( y t | y ⋆ t ) + τ. u ( V ) (Szegedy et al. 2016) • We can leverage word co-occurrence statistics to build a non-uniform and “meaningful” distribution. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 7

Loss smoothing T � ℓ tok RAML ( y ⋆ , x ) = D KL ( r τ ( y t | y ⋆ t ) � p θ ( y t | h t )) (4) t = 1 Prerequisite: A word embedding w (e.g. Glove) in the target space and a distance d . � − d( w ( y t ) , w ( y ⋆ � t ) = 1 t )) r τ ( y t | y ⋆ Z exp , τ with a temperature τ st. r τ − τ → 0 δ . − − → � r τ ( y t | y ⋆ Z st. t ) = 1 y t ∈V ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 8

Loss smoothing | Token-level τ = 0 . 12 τ = 0 . 70 ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 9

Loss smoothing | Token-level T ℓ tok RAML ( y ⋆ , x ) = � D KL ( r τ ( y t | y ⋆ t ) � p θ ( y t | h t )) (4) t = 1 T � r τ ( y t | y ⋆ t ) � � � r τ ( y t | y ⋆ = t ) log (5) p θ ( y t | h t ) t = 1 y t ∈V We can estimate the exact KL divergence for every target token. No approximation needed. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 10

Sequence-level smoothing ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 11

Loss smoothing | Sequence-level ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (6) Prerequisite: A distance d in the sequences space V n , n ∈ N . � − d( y , y ⋆ ) r τ ( y | y ⋆ ) = 1 � Z exp , τ � r τ ( y | y ⋆ ) = 1 Z st. y ∈V n , n ∈ N Possible (pseudo-)distances: • Hamming • Edit • 1 − BLEU • 1 − CIDEr ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 12

Loss smoothing | Sequence-level Can we evaluate the partition function Z for a given reward? � − d( y , y ⋆ ) � t ) = 1 r τ ( y t | y ⋆ Z exp , τ � − d( y , y ⋆ ) � � Z = exp τ y ∈V n , n ∈ N We can approximate Z for Hamming distance. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 13

Loss smoothing | Sequence-level | Hamming distance Assumption: consider only sequences of the same length as y ⋆ ( d ( y , y ′ ) = 0 if | y | � = | y ′ | ). We partition the set of sequences V T w.r.t. their distance to the ground truth y ⋆ :  S d = { y ∈ V T sub | d( y , y ⋆ ) = d } ,    V T = ∪ d S d ,  ∀ d , d ′ : S d ∩ S d ′ = ∅ .   • The reward in each subset is a constant. • The cardinality of each subset is known. � − d � � Z = | S d | exp τ d ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 14

Loss smoothing | Sequence-level | Hamming distance We can easily draw from r τ with Hamming distance: 1 Sample a distance d from { 0 , . . . , T } . 2 Pick d positions in the sequence to be changed among { 1 , . . . , T } . 3 Sample substitutions from V of the vocabulary. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 15

Loss smoothing | Sequence-level | Hamming distance We can easily draw from r τ with Hamming distance: 1 Sample a distance d from { 0 , . . . , T } . 2 Pick d positions in the sequence to be changed among { 1 , . . . , T } . 3 Sample substitutions from V of the vocabulary. Monte Carlo estimation: ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (6) = − E r τ [log p θ ( . | x )] + cst (7) L ≈ − 1 ( y l ∼ r τ ) � log p θ ( y l | x ) (8) L l = 1 ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 15

Loss smoothing | Sequence-level | Other distances We cannot “easily” sample from more complicated rewards such as BLEU or CIDEr. Importance sampling: ℓ seq RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] (9) = − E q [ r τ q log p θ ] (10) L ( y l ∼ q ) ≈ − 1 � ω l log p θ ( y l | x ) (11) L l = 1 r τ ( y l | y ⋆ ) / q ( y l | y ⋆ ) ω l ≈ , � L k = 1 r τ ( y k | y ⋆ ) / q ( y k | y ⋆ ) Choose q the reward distribution relative to Hamming distance. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 16

Loss smoothing | Sequence-level | Support reduction ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (6) Can we reduce the support of r τ ? � − d( y , y ⋆ ) � − d( y , y ⋆ ) r τ ( y | y ⋆ ) = 1 � � � Z exp , Z = exp τ τ y ∈V T Reduce the support from V | y ⋆ | to V | y ⋆ | sub where V sub ⊂ V . • V sub = V batch : tokens occuring in the SGD mini-batch. • V sub = V refs : tokens occuring in the available references. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 17

Loss smoothing | Sequence-level | Lazy training Default training Lazy training ℓ seq RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] ℓ seq RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] L L ≈ − 1 ≈ − 1 � � log p θ ( y l | x ) log p θ ( y l | x ) L L l = 1 l = 1 ∀ l , y l is: ∀ l , y l is: 1 forwarded in the RNN. 1 not forwarded in the RNN. 2 used as target. 2 used as target. log p θ ( y l | y ⋆ , x ) log p θ ( y l | y l , x ) ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 18

Loss smoothing | Sequence-level | Lazy training Default training Lazy training ℓ seq ℓ seq RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] L L ≈ − 1 ≈ − 1 � � log p θ ( y l | x ) log p θ ( y l | x ) L L l = 1 l = 1 ∀ l , y l is: ∀ l , y l is: 1 forwarded in the RNN. 1 not forwarded in the RNN. 2 used as target. 2 used as target. log p θ ( y l | y ⋆ , x ) log p θ ( y l | y l , x ) Complexity : O ( 2 L .λ ) Complexity: O (( L + 1 ) λ ) λ = | y || θ cell | , where θ cell are the cell parameters. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 18

Experiments ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 19

Token-level and sequence-level loss smoothing for RNN language - PowerPoint PPT Presentation

Token-level and sequence-level loss smoothing for RNN language models Maha Elbayad 1,2 , Laurent Besacier 1 ,and Jakob Verbeek 2 1 LIG , 2 INRIA, Grenoble, France ACL 2018 Melbourne, Australia Language generation | Equivalence in the target

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

PIV Token Issuance PIV Token Issuance Ketan Mehta Mehta_Ketan@bah.com October 6, 2004 1

HOTNOW HOT Token HOTNOW l TOKEN SALE THE FIRST UTILITY TOKEN WITH REAL INTRINSIC VALUE REINVENT

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

TREE = TOKEN The Frontier of Impact Finance T TREE T TREE Token = oken = 1 The Frontier

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Token to Words Expanding identified token to words numbers+type = word list

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Shortbelly Rockfish Supplemental GMT Presentation 2 June 2020 Reconsideration of FPA New

Title Slide Lifespan Respite Grantee and Partner Learning Symposium Lori Stalbaum ACL Federal

5 Complications in ACL Surgery How to Avoid Them John C. Richmond, MD New England Baptist

Anti-Infectives Space New Class of Antibacterials Based on a Completely New Mechanism of Action

Balancing Fidelity and Reality: Maines Savvy Caregiver Translation Linda Samia, PhD, RN, CNL

Minimising ACL ruptures in female athletes Jacinta Horan & Jess Meyer WHISPA

PRIVATE EVENTS PrivateEvents@ACL-LIVE.com (512)404-1318 MEET THE TEAM Director of Director of

Tab B, No. 10a Amendment 53 is being developed to revise the red grouper allocation between

Token-level and sequence-level loss smoothing for RNN language - PowerPoint PPT Presentation

Token-level and sequence-level loss smoothing for RNN language models Maha Elbayad 1,2 , Laurent Besacier 1 ,and Jakob Verbeek 2 1 LIG , 2 INRIA, Grenoble, France ACL 2018 Melbourne, Australia Language generation | Equivalence in the target

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN &amp; Gated RNN

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

PIV Token Issuance PIV Token Issuance Ketan Mehta Mehta_Ketan@bah.com October 6, 2004 1

HOTNOW HOT Token HOTNOW l TOKEN SALE THE FIRST UTILITY TOKEN WITH REAL INTRINSIC VALUE REINVENT

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

TREE = TOKEN The Frontier of Impact Finance T TREE T TREE Token = oken = 1 The Frontier

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Token to Words Expanding identified token to words numbers+type = word list

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Shortbelly Rockfish Supplemental GMT Presentation 2 June 2020 Reconsideration of FPA New

Title Slide Lifespan Respite Grantee and Partner Learning Symposium Lori Stalbaum ACL Federal

5 Complications in ACL Surgery How to Avoid Them John C. Richmond, MD New England Baptist

Anti-Infectives Space New Class of Antibacterials Based on a Completely New Mechanism of Action

Balancing Fidelity and Reality: Maines Savvy Caregiver Translation Linda Samia, PhD, RN, CNL

Minimising ACL ruptures in female athletes Jacinta Horan &amp; Jess Meyer WHISPA

PRIVATE EVENTS PrivateEvents@ACL-LIVE.com (512)404-1318 MEET THE TEAM Director of Director of

Tab B, No. 10a Amendment 53 is being developed to revise the red grouper allocation between

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Minimising ACL ruptures in female athletes Jacinta Horan & Jess Meyer WHISPA