Sequence-to-Sequence Learning as Beam-Search Optimization Sam - PowerPoint PPT Presentation

Sequence-to-Sequence Learning as Beam-Search Optimization Sam Wiseman and Alexander M. Rush

Seq2Seq as a General-purpose NLP/Text Generation Tool Machine Translation ???? Luong et al. [2015] Question Answering ? Conversation ? Parsing Vinyals et al. [2015] Sentence Compression Filippova et al. [2015] Summarization ? Caption Generation ? Video-to-Text ? Grammar Correction ?

Room for Improvement? Despite its tremendous success, there are some potential issues with standard Seq2Seq [Ranzato et al. 2016; Bengio et al. 2015] : (1) Train/Test mismatch (2) Seq2Seq models next-words, rather than whole sequences Goal of the talk : describe a simple variant of Seq2Seq — and corresponding beam-search training scheme — to address these issues.

Review: Sequence-to-sequence (Seq2Seq) Models Encoder RNN (red) encodes source into a representation x Decoder RNN (blue) generates translation word-by-word

Review: Seq2Seq Generation Details h 3 = RNN( w 3 , h 2 ) h 1 h 2 w 1 w 2 w 3 Probability of generating t ’th word: p ( w t | w 1 , . . . , w t − 1 , x ; θ ) = softmax( W out h t − 1 + b out )

Review: Train and Test Train Objective : Given source-target pairs ( x, y 1: T ) , minimize NLL of each word independently, conditioned on gold history y 1: t − 1 � NLL ( θ ) = − ln p ( w t = y t | y 1: t − 1 , x ; θ ) t Test Objective : Structured prediction � y 1: T = arg max ˆ ln p ( w t | w 1: t − 1 , x ; θ ) w 1: T t Typical to approximate the arg max with beam-search

Review: Beam Search at Test Time ( K = 3 ) a the red For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

Review: Beam Search at Test Time ( K = 3 ) a red the dog red blue For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

Review: Beam Search at Test Time ( K = 3 ) a red dog the dog dog red blue cat For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

Review: Beam Search at Test Time ( K = 3 ) a red dog smells the dog dog barks red blue cat walks For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

Review: Beam Search at Test Time ( K = 3 ) a red dog smells home the dog dog barks quickly red blue cat walks straight For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

Review: Beam Search at Test Time ( K = 3 ) a red dog smells home today the dog dog barks quickly Friday red blue cat walks straight now For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

Seq2Seq Issues Revisited Issue #1: Train/Test Mismatch (cf., Ranzato et al. [2016] ) � NLL ( θ ) = − ln p ( w t = y t | y 1: t − 1 , x ; θ ) t (a) Training conditions on true history (“Exposure Bias”) (b) Train with word-level NLL, but evaluate with BLEU-like metrics Idea #1: Train with beam-search Use a loss that incorporates (sub)sequence-level costs

Idea #1: Train with Beam Search Replace NLL with loss that penalizes search-error: � � y ( K ) y ( K ) y ( K ) � L ( θ ) = ∆(ˆ 1: t ) 1 − s ( y t , y 1: t − 1 ) + s (ˆ , ˆ 1: t − 1 ) t t y ( K ) y 1: t is the gold prefix; ˆ is the K ’th prefix on the beam 1: t y ( k ) y ( k ) y ( k ) y ( k ) s (ˆ , ˆ 1: t − 1 ) is the score of history (ˆ , ˆ 1: t − 1 ) t t y ( K ) y ( K ) ∆(ˆ 1: t ) allows us to scale loss by badness of predicting ˆ 1: t

Seq2Seq Issues Revisited Issue #2: Seq2Seq models next-word probabilities : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) (a) Sequence score is sum of locally normalized word-scores; gives rise to “Label Bias” [Lafferty et al. 2001] (b) What if we want to train with sequence-level constraints? Idea #2: Don’t locally normalize

Sequence-to-Sequence Learning as Beam-Search Optimization Sam - PowerPoint PPT Presentation

Sequence-to-Sequence Learning as Beam-Search Optimization Sam Wiseman and Alexander M. Rush Seq2Seq as a General-purpose NLP/Text Generation Tool Machine Translation ???? Luong et al. [2015] Question Answering ? Conversation ? Parsing Vinyals et

Beam Search Shahrzad Kiani and Zihao Chen CSC2547 Presentation Beam Search Greedy Search: Always

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

E-lens related beam-beam experiment Xiaofeng Gu 1 IP10 -- e-beam collision with proton only (up

Beam-beam Studies, Tool Development and Tests EIC Collaboration Meeting, Jlab, Oct. 29-Nov. 1,

Intra-Pulse Beam-Beam Scans at the NLC IP Steve Smith SLAC Nanobeams 2002 Beam-Beam Scans

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

NuMI Primary Beam November 7, 2003 NuMI NuMI Primary Beam NBI03 Nov. 7-11, 2003 NuMI Primary

APEX 04/08/2015 1. BTF: different beam size and different beam current 2. BTF: Octupole 3. BTF:

Introduction to Apache Beam Dan Halperin JB Onofr Google Talend Beam podling PMC Beam

Bridge Beam E In Posi5on Bridge Beam A At TCO 1 APA#1 is loaded on Bridge Beam A At TCO 2

Modelling and implementation of the 6D beam -beam interaction G. Iadarola, R. De Maria, Y.

J-PARC Neutrino Beam-line Upgrade T. Nakadaira for J-PARC neutrino beam-line construction group

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Logical agents Chapter 7 Chapter 7 1 Outline Knowledge-based agents Wumpus world

Verification of Data-Centric Dynamic Systems Babak Bagheri Hariri Supervisor: Diego Calvanese

Decision Gates and Color Team Reviews Ed Alexander, PPF.APMP Presenter: Vice President, Shipley

Spinning Gold: The Financial Returns to External Stakeholder Engagement Witold J. Henisz,

The Gold Grabbing Game Gold Grabbing on Paths Gold Grabbing on Trees Deborah E. Seacrest

Pegs and Pain Stephanie Schmitt-Groh e Mart n Uribe Columbia University March 12,

Aerial distributed beamforming < 2 > Beamforming concept < 3 > The transmitters

From Dependency Parsing to Imitation Learning CMSC 723 / LING 723 / INST 725 Marine Carpuat Fig

Sequence-to-Sequence Learning as Beam-Search Optimization Sam - PowerPoint PPT Presentation

Sequence-to-Sequence Learning as Beam-Search Optimization Sam Wiseman and Alexander M. Rush Seq2Seq as a General-purpose NLP/Text Generation Tool Machine Translation ???? Luong et al. [2015] Question Answering ? Conversation ? Parsing Vinyals et

Beam Search Shahrzad Kiani and Zihao Chen CSC2547 Presentation Beam Search Greedy Search: Always

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

E-lens related beam-beam experiment Xiaofeng Gu 1 IP10 -- e-beam collision with proton only (up

Beam-beam Studies, Tool Development and Tests EIC Collaboration Meeting, Jlab, Oct. 29-Nov. 1,

Intra-Pulse Beam-Beam Scans at the NLC IP Steve Smith SLAC Nanobeams 2002 Beam-Beam Scans

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

NuMI Primary Beam November 7, 2003 NuMI NuMI Primary Beam NBI03 Nov. 7-11, 2003 NuMI Primary

APEX 04/08/2015 1. BTF: different beam size and different beam current 2. BTF: Octupole 3. BTF:

Introduction to Apache Beam Dan Halperin JB Onofr Google Talend Beam podling PMC Beam

Bridge Beam E In Posi5on Bridge Beam A At TCO 1 APA#1 is loaded on Bridge Beam A At TCO 2

Modelling and implementation of the 6D beam -beam interaction G. Iadarola, R. De Maria, Y.

J-PARC Neutrino Beam-line Upgrade T. Nakadaira for J-PARC neutrino beam-line construction group

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Logical agents Chapter 7 Chapter 7 1 Outline Knowledge-based agents Wumpus world

Verification of Data-Centric Dynamic Systems Babak Bagheri Hariri Supervisor: Diego Calvanese

Decision Gates and Color Team Reviews Ed Alexander, PPF.APMP Presenter: Vice President, Shipley

Spinning Gold: The Financial Returns to External Stakeholder Engagement Witold J. Henisz,

The Gold Grabbing Game Gold Grabbing on Paths Gold Grabbing on Trees Deborah E. Seacrest

Pegs and Pain Stephanie Schmitt-Groh e Mart n Uribe Columbia University March 12,

Aerial distributed beamforming &lt; 2 &gt; Beamforming concept &lt; 3 &gt; The transmitters

From Dependency Parsing to Imitation Learning CMSC 723 / LING 723 / INST 725 Marine Carpuat Fig

Aerial distributed beamforming < 2 > Beamforming concept < 3 > The transmitters