Recurrent Networks, and Attention, for Statistical Machine - PowerPoint PPT Presentation

Recurrent Networks, and Attention, for Statistical Machine Translation Michael Collins, Columbia University

Mapping Sequences to Sequences ◮ Learn to map input sequences x 1 . . . x n to output sequences y 1 . . . y m where y m = STOP. ◮ Can decompose this as m � p ( y 1 . . . y m | x 1 . . . x n ) = p ( y j | y 1 . . . y j − 1 , x 1 . . . x n ) j =1 ◮ Encoder/decoder framework: use an LSTM to map x 1 . . . x n to a vector h ( n ) , then model p ( y j | y 1 . . . y j − 1 , x 1 . . . x n ) = p ( y j | y 1 . . . y j − 1 , h ( n ) ) using a “decoding” LSTM

The Computational Graph

Training A Recurrent Network for Translation Inputs: A sequence of source language words x 1 . . . x n where each x j ∈ R d . A sequence of target language words y 1 . . . y m where y m = STOP. Definitions: θ F = parameters of an “encoding” LSTM. θ D = parameters of a “decoding” LSTM. LSTM ( x ( t ) , h ( t − 1) ; θ ) maps an input x ( t ) together with a hidden state h ( t − 1) to a new hidden state h ( t ) . Here θ are the parameters of the LSTM

Training A Recurrent Network for Translation (continued) Computational Graph: ◮ Initialize h (0) to some values (e.g. vector of all zeros) ◮ ( Encoding step: ) For t = 1 . . . n ◮ h ( t ) = LSTM ( x ( t ) , h ( t − 1) ; θ F ) ◮ Initialize β (0) to some values (e.g., vector of all zeros) ◮ ( Decoding step: ) For j = 1 . . . m ◮ β ( j ) = LSTM ( CONCAT ( y j − 1 , h ( n ) ) , β ( j − 1) ; θ D ) ◮ l ( j ) = V × CONCAT ( β ( j ) , y j − 1 , h ( n ) ) + γ, q ( j ) = LS ( l ( j ) ) , o ( j ) = − q ( j ) y j ◮ ( Final loss is sum of losses: ) m � o ( j ) o = j =1

Greedy Decoding with A Recurrent Network for Translation ◮ Encoding step: Calculate h ( n ) from the input x 1 . . . x n ◮ j = 1 . Do: ◮ y j = arg max y p ( y | y 1 . . . y j − 1 , h ( n ) ) ◮ j = j + 1 ◮ Until: y j − 1 = STOP

Greedy Decoding with A Recurrent Network for Translation Computational Graph: ◮ Initialize h (0) to some values (e.g. vector of all zeros) ◮ ( Encoding step: ) For t = 1 . . . n ◮ h ( t ) = LSTM ( x ( t ) , h ( t − 1) ; θ F ) ◮ Initialize β (0) to some values (e.g., vector of all zeros) ◮ ( Decoding step: ) j = 1 . Do: ◮ β ( j ) = LSTM ( CONCAT ( y j − 1 , h ( n ) ) , β ( j − 1) ; θ D ) ◮ l ( j ) = V × CONCAT ( β ( j ) , y j − 1 , h ( n ) ) + γ ◮ y j = arg max y l ( j ) y ◮ j = j + 1 ◮ Until y j − 1 = STOP ◮ Return y 1 . . . y j − 1

A bi-directional LSTM (bi-LSTM) for Encoding Inputs: A sequence x 1 . . . x n where each x j ∈ R d . Definitions: θ F and θ B are parameters of a forward and backward LSTM. Computational Graph: ◮ h (0) , η ( n +1) are set to some inital values. ◮ For t = 1 . . . n ◮ h ( t ) = LSTM ( x ( t ) , h ( t − 1) ; θ F ) ◮ For t = n . . . 1 ◮ η ( t ) = LSTM ( x ( t ) , η ( t +1) ; θ B ) ◮ For t = 1 . . . n ◮ u ( t ) = CONCAT ( h ( t ) , η ( t ) ) ⇐ encoding for position t

Incorporating Attention ◮ Old decoder: ◮ c ( j ) = h ( n ) ⇐ context used in decoding at j ’th step ◮ β ( j ) = LSTM ( CONCAT ( y j − 1 , c ( j ) ) , β ( j − 1) ; θ D ) ◮ l ( j ) = V × CONCAT ( β ( j ) , y j − 1 , c ( j ) ) + γ ◮ y j = arg max y l ( j ) y

Incorporating Attention ◮ New decoder: ◮ Define n c ( j ) = � a i,j u ( i ) i =1 where a i,j = exp { s i,j } � n i =1 s i,j and s i,j = A ( β ( j − 1) , u ( i ) ; θ A ) where A ( . . . ) is a non-linear function (e.g., a feedforward network) with parameters θ A

Greedy Decoding with Attention ◮ ( Decoding step: ) j = 1 . Do: ◮ For i = 1 . . . n , s i,j = A ( β ( j − 1) , u ( i ) ; θ A ) ◮ For i = 1 . . . n , a i,j = exp { s i,j } � n i =1 s i,j ◮ Set c ( j ) = � n i =1 a i,j u ( i ) ◮ β ( j ) = LSTM ( CONCAT ( y j − 1 , c ( j ) ) , β ( j − 1) ; θ D ) ◮ l ( j ) = V × CONCAT ( β ( j ) , y j − 1 , c ( j ) ) + γ ◮ y j = arg max y l ( j ) y ◮ j = j + 1 ◮ Until y j − 1 = STOP ◮ Return y 1 . . . y j − 1

Training with Attention ◮ ( Decoding step: ) For j = 1 . . . m ◮ For i = 1 . . . n , s i,j = A ( β ( j − 1) , u ( i ) ; θ A ) ◮ For i = 1 . . . n , a i,j = exp { s i,j } � n i =1 s i,j ◮ Set c ( j ) = � n i =1 a i,j u ( i ) ◮ β ( j ) = LSTM ( CONCAT ( y j − 1 , c ( j ) ) , β ( j − 1) ; θ D ) ◮ l ( j ) = V × CONCAT ( β ( j ) , y j − 1 , c ( j ) ) + γ, q ( j ) = LS ( l ( j ) ) , o ( j ) = − q ( j ) y j ◮ ( Final loss is sum of losses: ) m � o ( j ) o = j =1

Results from Wu et al. 2016 ◮ From Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , Wu et al. 2016. Human evaluations are on a 1-6 scale (6 is best). PBMT is a phrase-based translation system, using IBM alignment models as a starting point.

Results from Wu et al. 2016 (continued)

Conclusions ◮ Directly model m � p ( y 1 . . . y m | x 1 . . . x n ) = p ( y j | y 1 . . . y j − 1 , x 1 . . . x n ) j =1 ◮ Encoding step: map x 1 . . . x n to u (1) . . . u ( n ) using a bidirectional LSTM ◮ Decoding step: use an LSTM in decoding together with attention

Recurrent Networks, and Attention, for Statistical Machine - PowerPoint PPT Presentation

Recurrent Networks, and Attention, for Statistical Machine Translation Michael Collins, Columbia University Mapping Sequences to Sequences Learn to map input sequences x 1 . . . x n to output sequences y 1 . . . y m where y m = STOP. Can

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

Machine Learning for Computational Linguistics Recurrent neural networks (RNNs) ar

Recurrent Neural Networks LING572 Advanced Statistical Methods for NLP March 5 2020 1 Outline

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

Topic 10: Modelling for SAT and SMT (Version of 22nd February 2018) Jean-No el Monette

Deep Encode: Machine Learning for Per-Title Encoding Daniel Silhavy| IBC20| Per-Title Encoding

in In-Memory Key-Value Storage Matt M. T. Yiu, Helen H. W. Chan, Patrick P. C. Lee The Chinese

Learn to Represent Queries and Videos for Ad-hoc Video Search Xirong Li , Chaoxi Xu , Jianfeng

Mistakes Are Proof That You Are Trying: On Verifying Software Encoding Schemes Resistance to

Combinatorial entropy and succinct data structures Gilles Schaeffer based in part on joined

CNF encodings of DNNFs and BDMCs Petr Kuera 1 Petr Savick 2 1 Charles University, Czech

Intro to Live S treaming Andy Beach Techgeist, Inc @ andybeach Types of Live S treams