CS 533: Natural Language Processing Conditional Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/53
Language Models Considered So Far p Y | X ( y | x 1:100 ) ◮ Classical trigram models: q Y | X ( y | x 99 , x 100 ) ◮ Training: closed-form solution ◮ Log-linear models: softmax y ([ w ⊤ φ (( x 99 , x 100 ) , y ′ )] y ′ ) ◮ Training: gradient descent on convex loss ◮ Neural models ◮ Feedforward: softmax y (FF([ E x 99 , E x 100 ])) ◮ Recurrent: softmax y (FF( h ( x 1:99 ) , E x 100 )) ◮ Training: gradient descent on nonconvex loss Karl Stratos CS 533: Natural Language Processing 2/53
Conditional Language Models ◮ Machine translation ⇒ And the programme has been implemented Le programme a ´ et´ e mis en application ◮ Summarization russian defense minister ivanov called sunday for the creation of a joint front for ⇒ combating global terrorism russia calls for joint front against terrorism ◮ Data-to-text generation (Wiseman et al., 2017) ◮ Image captioning ⇒ the dog saw the cat Karl Stratos CS 533: Natural Language Processing 3/53
Encoder-Decoder Models Much of machine learning is learning x �→ y where x, y are some complicated structures Encoder-decoder models are conditional models that handle this wide class of problems in two steps: 1. Encode the given input x using some architecture. 2. Decode output y . Training: again minimize cross entropy � � − ln q θ min E Y | X (output | input) θ (input , output) ∼ p Y | X Karl Stratos CS 533: Natural Language Processing 4/53
Agenda 1. MT 2. Attention in detail 3. Beam Search Karl Stratos CS 533: Natural Language Processing 5/53
Machine Translation (MT) ◮ Goal: Translate text from one language to another. ◮ One of the oldest problems in artificial intelligence. Karl Stratos CS 533: Natural Language Processing 6/53
Some History ◮ Early ’90s: Rise of statistical MT (SMT) ◮ Exploit parallel text . And the programme has been implemented Le programme a ´ et´ e mis en application ◮ Infer word alignment (“IBM” models, Brown et al., 1993) Karl Stratos CS 533: Natural Language Processing 7/53
SMT: Huge Pipeline 1. Use IBM models to extract word alignment, phrase alignment (Koehn et al., 2003). 2. Use syntactic analyzers (e.g., parser) to extract features and manipulate text (e.g., phrase re-ordering). 3. Use a separate language model to enforce fluency. 4. . . . Multiple independently trained models patched together ◮ Really complicated, prone to error propogation Karl Stratos CS 533: Natural Language Processing 8/53
Rise of Neural MT Started taking off around 2014 ◮ Replaced the entire pipeline with a single model ◮ Called “end-to-end” training/prediction Input : Le programme a ´ et´ e mis en application Output : And the programme has been implemented ◮ Revolution in MT ◮ Better performance, way simpler system ◮ A hallmark of the recent neural domination in NLP ◮ Key: attention mechanism Karl Stratos CS 533: Natural Language Processing 9/53
Recap: Recurrent Neural Network (RNN) ◮ Always think of an RNN as a mapping φ : R d × R d ′ → R d ′ Input : an input vector x ∈ R d , a state vector h ∈ R d ′ Output : a new state vector h ′ ∈ R d ′ ◮ Left-to-right RNN processes input sequence x 1 . . . x m ∈ R d as h i = φ ( x i , h i − 1 ) where h 0 is an initial state vector. ◮ Idea: h i is a representation of x i that has incorporated all inputs to the left . h i = φ ( x i , φ ( x i − 1 , φ ( x i − 2 , · · · φ ( x 1 , h 0 ) · · · ))) Karl Stratos CS 533: Natural Language Processing 10/53
Variety 1: “Simple” RNN ◮ Parameters U ∈ R d ′ × d and V ∈ R d ′ × d ′ h i = tanh ( Ux i + V h i − 1 ) Karl Stratos CS 533: Natural Language Processing 11/53
Picture Karl Stratos CS 533: Natural Language Processing 12/53
Stacked Simple RNN ◮ Parameters U (1) . . . U ( L ) ∈ R d ′ × d and V (1) . . . V ( L ) ∈ R d ′ × d ′ � � h (1) U (1) x i + V (1) h (1) = tanh i i − 1 � � h (2) U (2) h (1) + V (2) h (2) = tanh i − 1 i i . . . � � h ( L ) U ( L ) h ( L − 1) + V ( L ) h ( L ) = tanh i i i − 1 ◮ Think of it as mapping φ : R d × R Ld ′ → R Ld ′ . h (1) h (1) i − 1 i . . . . x i �→ . . h ( L ) h ( L ) i − 1 i Karl Stratos CS 533: Natural Language Processing 13/53
Variety 2: Long Short-Term Memory (LSTM) ◮ Parameters U q , U c , U o ∈ R d ′ × d , V q , V c , V o , W q , W o ∈ R d ′ × d ′ q i = σ ( U q x i + V q h i − 1 + W q c i − 1 ) c i = (1 − q i ) ⊙ c i − 1 + q i ⊙ tanh ( U c x i + V c h i − 1 ) o i = σ ( U o x i + V o h i − 1 + W o c i ) h i = o i ⊙ tanh( c i ) ◮ Idea: “Memory cells” c i can carry long-range information. ◮ What happens if q i is close to zero? ◮ Can be stacked as in simple RNN. Karl Stratos CS 533: Natural Language Processing 14/53
Translation Problem ◮ Vocabulary of the source language V src � � V src = ᄀ ᅳ , ᄀ ᅢᄀ ᅡ , ᄇ ᅩ ᄋ ᆻ ᄃ ᅡ ᅡ , ᄉ ᅩ ᄉ ᅵ ᆨ , 2017, 5 ᄋ ᅯ ᆯ . . . ◮ Vocabulary of the target language V trg � � V trg = the, dog, cat, 2021, May, . . . ◮ Task. Given any sentence x 1 . . . x m ∈ V src , produce a corresponding translation y 1 . . . y n ∈ V trg . ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻ ᄃ ᅡ = ⇒ the dog barked Karl Stratos CS 533: Natural Language Processing 15/53
Evaluating Machine Translation ◮ T : human-translated sentences ◮ � T : machine-translated sentences ◮ p n : precision of n -grams in � T against n -grams in T (sentence-wise) ◮ BLEU: Controversial but popular scheme to automatically evaluate translation quality � � � 4 � 1 � � � � T � � 4 1 , × BLEU = min p n | T | n =1 Karl Stratos CS 533: Natural Language Processing 16/53
Translation Model: Conditional Language Model A translation model defines a probability distribution p ( y 1 . . . y n | x 1 . . . x m ) over all sentences y 1 . . . y n ∈ V trg conditioning on any sentence x 1 . . . x m ∈ V src . Goal : Design a good translation model p ( the dog barked | ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻ ᄃ ᅡ ) > p ( the cat barked | ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻ ᄃ ᅡ ) > p ( dog the barked | ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻ ᄃ ᅡ ) > p ( oqc shgwqw#w 1g0 | ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻ ᄃ ᅡ ) How can we use an RNN to build a translation model? Karl Stratos CS 533: Natural Language Processing 17/53
Basic Encoder-Decoder Framework Model parameters ◮ Vector e x ∈ R d for every x ∈ V src ◮ Vector e y ∈ R d for every y ∈ V trg ∪ { * } ◮ Encoder RNN ψ : R d × R d ′ → R d ′ for V src ◮ Decoder RNN φ : R d × R d ′ → R d ′ for V trg ◮ Feedforward f : R d ′ → R | V trg | +1 Basic idea 1. Transform x 1 . . . x m ∈ V src with ψ into some representation ξ . 2. Build a language model φ over V trg conditioning on ξ . Karl Stratos CS 533: Natural Language Processing 18/53
Encoder For i = 1 . . . m , � � h ψ e x i , h ψ i = ψ i − 1 � � � � � ��� h ψ e x 1 , h ψ m = ψ e x m , ψ e x m − 1 , ψ e x m − 2 , · · · ψ · · · 0 Karl Stratos CS 533: Natural Language Processing 19/53
Decoder Initialize h φ 0 = h ψ m and y 0 = * . For i = 1 , 2 , . . . , the decoder defines a probability distribution over V trg ∪ { STOP } as � � h φ e y i − 1 , h φ i = φ i − 1 p Θ ( y | x 1 . . . x m , y 0 . . . y i − 1 ) = softmax y ( f ( h φ i )) Probability of translation y 1 . . . y n given x 1 . . . x m : n � p Θ ( y 1 . . . y n | x 1 . . . x m ) = p Θ ( y i | x 1 . . . x m , y 0 . . . y i − 1 ) × i =1 p Θ ( STOP | x 1 . . . x m , y 0 . . . y n ) Karl Stratos CS 533: Natural Language Processing 20/53
Slide Credit: Danchi Chen & Karthik Narasimhan Encoder Sentence: This cat is cute h h t +3 h t −1 h t h t +1 h t +2 x t +3 x t x t +1 x t +2 word embedding This cat is cute Karl Stratos CS 533: Natural Language Processing 21/53
Slide Credit: Danchi Chen & Karthik Narasimhan Encoder Sentence: This cat is cute h h t +3 h 0 h 1 h t +1 h t +2 x t +3 x 1 x t +1 x t +2 word embedding This cat is cute Karl Stratos CS 533: Natural Language Processing 22/53
Slide Credit: Danchi Chen & Karthik Narasimhan Encoder Sentence: This cat is cute h h t +3 h 4 h 0 h 1 h 2 h t +2 h 3 x t +3 x 1 x 4 x 2 x t +2 x 3 word embedding This cat is cute Karl Stratos CS 533: Natural Language Processing 23/53
Slide Credit: Danchi Chen & Karthik Narasimhan Encoder (encoded representation) Sentence: This cat is cute h enc h 4 h 0 h 1 h 2 h 3 x 1 x 4 x 2 x 3 word embedding This cat is cute Karl Stratos CS 533: Natural Language Processing 24/53
Slide Credit: Danchi Chen & Karthik Narasimhan Decoder est mignon <e> ce chat o o o o o z 4 z 5 h enc z 1 z 2 z 3 x ′ � x ′ � x ′ � x ′ � x ′ � 5 1 4 2 3 word embedding <s> ce chat est mignon Karl Stratos CS 533: Natural Language Processing 25/53
Recommend
More recommend