RNN: Extensions (3/3) β’ Stack LSTM β A LSTM for representing stack structure β’ Extend the standard LSTM with a stack pointer β’ Previously, only push() operation is allowed β’ Now, Pop() operation is supported β’ Memory-augmented LSTMs β Neural Turing machine β Differentiable neural computer β C.f. ) Neural encoder-decoder, Stack LSTM: Special cases of MALSTM β’ RNN architecture search with reinforcement learning β Training neural architectures that maximize the expected accuracy on a specific task
The Challenge of Long-Term Dependencies β’ Example: a very simple recurrent network β’ No nonlinear activation function, no inputs β π (π’) = πΏ π π (π’β1) β π (π’) = (πΏ π’ ) π π (0) πΏ = πΉπ³πΉ π π (π’) = πΉ π π³ π πΉπ (0) Ill-posed form
The Challenge of Long-Term Dependencies β’ Gradients are vanished or explored in deep models β’ BPTT for recurrent neural networks is a typical example π β² π 1 π β² π 2 π β² π π β πΏ π πΌ πΊ π’ π β² π 2 π β² π π β πΏ π πΌ πΊ π’ Delta is obtained by the Error signal π β² π π β πΏ π πΊ π repeated multiplication πΊ π of W πΏ πΏ πΏ πΏ = πΎππππ π πΎ β1 π π π π π π π π πΏ π = πΎππππ π π πΎ β1 π π π π π π π π Explode if | π π | > 1 π π = πΏ π πΌ πΊ π β π(π 1 πβ1 ) Vanish if | π π | < 1
Exploding and vanishing gradients [Bengio β94; Pascanu β13] πΊ πΌβπ = π β² π πΌβπ β πΏ π πΊ πΌ πΊ πΌ πΈπππ π β² π πβπ πΏ π πΊ π = ΰ· πΊ πΌ πΏ πΏ β¦ π π π<πβ€π π π β’ Let: πΈπππ π β² π πβπ β β€ πΏ β’ for bounded nonlinear functions π β² π¦ β π 1 : the largest singular value of πΏ β’ Sufficient condition for Vanishing gradient problem β π 1 < 1/πΏ πΈπππ π β² π πβπ πΈπππ π β² π πβπ 1 πΏ πΏ =1 πΏ π πΏ π ο§ β€ < β’ Necessary condition for Exploding gradient problem β π 1 > 1/πΏ ο§ obtained by just inverting the condition for vanishing gradient problem
Gradient clipping [Pascanu β 13] β’ Deal with exploring gradients β’ Clip the norm | π | of the gradient π just before parameter update ππ€ If | π | > π€ : π β | π | Without gradient clipping with clipping
Long Short Term Memory (LSTM) β’ LSTM : makes it easier for RNNs to capture long- term dependencies ο¨ Using gated units β Basic LSTM [Hochreiter and Schmidhuer, 98] β’ Cell state unit π (π’) : as an internal memory β’ Introduces input gate & output gate β’ Problem: The output is close to zero as long as the output gate is closed. β Modern LSTM: Uses forget gate [Gers et al β00] β Variants of LSTM β’ Add peephole connections [Gers et al β02] β Allow all gates to inspect the current cell state even when the output gate is closed.
Long Short Term Memory (LSTM) LSTM Recurrent neural networks π πΏ π π π π½ π Memory cell (cell state unit) π
Long Short Term Memory (LSTM) β’ Memory cell π : gated unit β Controlled by input/output/forget gates π Gated flow π π π π π = π―Β°π π π π π : forget gate π : input gate π π : output gate
Long Short Term Memory (LSTM) LSTM π Computing gate values π (π’) = π π (π π’ , π (π’β1) )( forget gate) π (π’) = π π (π π’ , π (π’β1) ) (input gate) π π (π’) = π π (π π’ , π (π’β1) ) (output gate) π π ΰ·€ π π (new memory cell) π π’ = π’ππβ π π π¦ π’ + π π β π’β1 Memory cell ΰ·€ (cell state unit) π (π’) = π (π’) Β° ΰ·€ π π’ + π (π’) Β° π π’β1 π π (π’) = π (π’) Β° tanh(π π’ )
Long Short Term Memory (LSTM) LSTM Controling by gate values π π Computing gate values π π (π’) = π π (π π’ , π (π’β1) ) π π π (π’) = π π (π π’ , π (π’β1) ) π π (π’) = π π (π π’ , π (π’β1) ) π ΰ·€ π π Memory cell (cell state unit) (new memory cell) π π’ = π’ππβ π π π¦ π’ + π π β π’β1 ΰ·€ π π (π’) = π (π’) Β° ΰ·€ π π’ + π (π’) Β° π π’β1 π (π’) = π (π’) Β° tanh(π π’ )
Long Short Term Memory (LSTM): Cell Unit Notation (Simplified) π π’ = π’ππβ π π π¦ π’ + π π β π’β1 ΰ·€ π π (π’) = π (π’) Β° ΰ·€ π π’ + π (π’) Β° π π’β1 π π (π’) = π (π’) Β° tanh(π π’ ) π π π π
Long Short Term Memory (LSTM): Long-term dependencies π 1 ο¨ π (4) : early inputs can be preserved in the memory cell during long time steps by controlling mechanism π (4) π (1) π (2) π (3) π π π π π (4) π (1) π (2) π (3) π π π π π π π π π (4) π (2) π (3) π (1)
Η LSTM: Update Formula π (π) = π(π π , π (πβπ) ) β’ π π’ = π π π π¦ π’ + π π β π’β1 (Input gate) β’ π π’ = π π π π¦ π’ + π π β π’β1 (Forget gate) β’ π π’ = π π π π¦ π’ + π π β π’β1 (Output/Exposure gate) π π’ = π’ππβ π π π¦ π’ + π π β π’β1 β’ (New memory cell) β’ π (π’) = π (π’) Β°π π’β1 + π (π’) Β° Η π π’ (Final memory cell) β’ β (π’) = π (π’) Β°tanh(π π’ )
LSTM: Memory Cell π (π) π (π’) = π (π’) Β°π π’β1 + π (π’) Β° Η π π’ β (π’) = π (π’) Β°tanh(π π’ ) π (πβπ) π (π) π π π (π’) π (π’) π (π’) ΰ·€ π π π π π π π (π’β1) π π π (π’β1) π (π’β1) π (π’β1)
LSTM: Memory Cell β’ π (π’) : behaves like a memory = MEMORY β π (π’) = π (π’) Β°π π’β1 + π (π’) Β° Η π π’ β’ M(t) = FORGET * M(t-1) + INPUT * NEW_INPUT β’ H(t) = OUTPUT * M(t) β’ FORGET: Erase operation (or memory reset) β’ INPUT: Write operation β’ OUTPUT: Read operation
Memory Cell - Example New input M[t] -1 0.5 1 -1 0 0.5 0 1 0 0 1 1 0 1 1 1 Forget gate Input gate 0 0.5 1 1 0 0 0 1 tahn New memory M[t+1] 0 0.5 1 2 Output gate H[t] 0 0 1 1 0 0 0.76 0.96 45
Long Short Term Memory (LSTM): Backpropagation β’ Error signal in gated flow πΊπ π π π π = π―Β°π = πΈπππ π π πΊπ² = πΈπππ π πΌ πΊπ = π Β° πΊπ
Long Short Term Memory (LSTM): Backpropagation π π’ = π π π π’ + π π π π’β1 π π’ = π π’ Β° tanh(π π’ ) + π π’ Β° π π’β1 π π’ = π π’ Β° tanh π π’ π π’ π π’β1 πΊ π π’ π π’β1 π π’ π π’ π π’β1
Long Short Term Memory (LSTM): Backpropagation πΊπ π’ = π’ππβ β² π π’ Β°π π’ Β°πΊπ π’ π π’ = π π π π’ + π π π π’β1 π πΊπ π’ πΊπ π’β1 = π½ π π π’ = π π’ Β° tanh(π π’ ) + π π’ Β° π π’β1 π π’ = π π’ Β° tanh π π’ πΊπ π’β1 = π’ππβ β² π π’ Β°π π’ Β°π½ π π πΊπ π’ π π’ π π’β1 πΊ π π’ π π’β1 π π’ π π’ π π’β1
Long Short Term Memory (LSTM): Backpropagation πΊπ π’β1 = π’ππβ β² π π’ Β°π π’ Β°π½ π π πΊπ π’ π π’ = π π π π’ + π π π π’β1 π π’ = π π’ Β° tanh(π π’ ) + π π’ Β° π π’β1 π π’ = π π’ Β° tanh π π’ π π’ π π’β1 πΊ π π’ π π’β1 π π’ π π’ π π’β1 πΊπ π’β1 = π’ππβ β² π π’β1 Β°π π’β1 Β°πΊπ π’β1 + π π’ Β°πΊπ π πΊπ π’β1 = π’ππβ β² π π’β1 Β°π π’ Β°π’ππβ β² π π’ Β°π π’ Β°π½ π π πΊπ π’ + π π’ Β°πΊπ π
Long Short Term Memory (LSTM): Backpropagation π π’ = πΏ π π π’ + π½ π π π’β1 π πΊπ π’ πΊπ π’β1 = π’ππββ²(π π’ )Β°π π’ Β°π½ π π π’ = π π’ Β° tanh(π π’ ) + π π’ Β° π π’β1 π π’ = π π’ Β° tanh π π’ π π’ π π’β1 tanh β² π¦ = 1 β π’ππβ 2 π¦ πΊ π π’ π π’β1 π π’ πΊπ π’β1 = (π’ππβ β² π π’β1 Β°π π’β1 Β°π’ππβ β² π π’ Β°π π’ Β°π½ π π π’ π π π’β1 πΊπ π’β1 = π’ππβ β² π π’β1 Β°π π’β1 Β°πΊπ π’β1 + π π’ Β°πΊπ π πΊπ π’β1 = π’ππβ β² π π’β1 Β°π π’β1 Β°π’ππββ²(π π’ )Β°π π’ Β°π½ π π πΊπ π’ + π π’ Β°πΊπ π
LSTM vs. Vanilla RNN: Backpropagation Vanilla RNN LSTM tanh π¦ = π(π¦) πΏ π π’ π π’ π π’β1 π π’β1 πΊ π π’ π π’ π π’ = πΏπ π’β1 + π½π π’ π π’ π π’ = π’ππβ(π π ) π π’ π π’β1 πΊπ πβπ = π β² π π β πΏ π πΊπ π πΊπ π’β1 = πβ² π π’β1 Β°πβ² π π’ Β°π π’β1 Β°π π’ Β°π½ π + π π’ Β°πΊπ π This additive term is the key for dealing with vanishing gradient problems
Exercise: Backpropagation for LSTM π π’ π π’ Complete flow graph & derive weight update formula π π’ π π’β1 π π’ ΰ·€ π π’ π π’β1 π π’ π π’ memory cell π π’ new input
Gated Recurrent Units [Cho et al β14] β’ Alternative architecture to handle long-term dependencies π (π) = π(π π , π (πβπ) ) β’ π¨ π’ = π π π¨ π¦ π’ + π π¨ β π’β1 (Update gate) β’ π π’ = π π π π¦ π’ + π π β π’β1 (Reset gate) β π’ = π’ππβ π (π’) Β°πβ π’β1 + ππ¦ (π’) β’ ΰ·¨ (New memory) β’ β (π’) = (1 β π¨ (π’) ) Β°ΰ·¨ β π’ + π¨ (π’) Β°β π’β1 (Hidden state)
LSTM CRF: RNN with Output Dependency β’ The output layer of RNN takes a directed graphical model that contains edges from some π (π) in the past to the current output β This model is able to perform a CRF-style of tagging π (1) π (2) π (3) π (π’) π (1) π (2) π (3) π (π’) πΏ πΏ πΏ π (1) π (2) π (3) π (π’)
Recurrent Language Model π (πβπ) β’ Introducing the state variable in the graphical model of the RNN
Bidirectional RNN β’ Combine two RNNs β Forward RNN: an RNN that moves forward beginning from the start of the sequence β Backward RNN: an RNN that moves backward beginning from the end Backward RNN of the sequence β It can make a prediction of y ( t ) depend on the whole input sequence. Forward RNN
Bidirectional LSTM CRF [Huang β15] β’ One of the state-of-the art models for sequence labelling tasks BI-LSTM-CRF model applied to named entity tasks
Bidirectional LSTM CRF [Huang β15] Comparison of tagging performance on POS, chunking and NER tasks for various models [Huang et al. 15]
Neural Machine Translation β’ RNN encoder-decoder β Neural encoder-decoder : Conditional recurrent language model β’ Neural machine translation with attention mechanism β Encoder: Bidirectional LSTM β Decoder: Attention Mechanism [Bahdanau et al β15] β’ Character based NMT β Hierarchical RNN Encoder- Decoder [Ling β16] β Subword-level Neural MT [Sennrich β15] β Hybrid NMT [Luong & Manning β16] β Googleβs NMT [Wu et al β16]
Neural Encoder-Decoder Input Translated text text Credit: http://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf
Neural Encoder-Decoder: Conditional Recurrent Language Model Credit: http://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf
Neural Encoder- Decoder [Cho et al β14] β’ Computing the log of translation probability πππ π(π§|π¦) by two RNNs Encoder: RNN Decoder: Recurrent language model
Decoder: Recurrent Language Model Credit: http://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf
Neural Encoder-Decoder with Attention Mechanism [Bahdanau et al β15] Attention Attention condition Sampling a word Sampling a word β’ Decoder with attention mechanism β Apply attention first to the encoded representations before generating a next target word β Attention: find aligned source words for a target word β’ Considered as implicit alignment process β Context vector c: β’ Previously, the last hidden state from RNN encoder[Cho et al β14] β’ Now, content-sensitively chosen with a mixture of hidden states of input sentence at generating each target word
Decoder with Attention Mechanism Encoded representations π (π π’β1 , ΰ΄₯ β’ Attention: π‘πππ’πππ¦(π π° π‘ ) ) Attention scoring function π‘πππ π π π’β1 , ΰ΄₯ π π‘ = π π tanh πΏπ π’β1 + πΎΰ΄€ π π‘ Directly computes a soft alignment π π’β1 π‘πππ’πππ¦ exp(π‘πππ π(π π’β1 , ΰ΄₯ π π‘ )) π π (π‘) = Ο π‘β² exp(π‘πππ π(π π’β1 , ΰ΄₯ π π‘β² )) Expected annotation π π‘ : a source hidden state ΰ΄₯ πΌ π = [ΰ΄€ β 1 , β― , ΰ΄€ ΰ΄₯ β π ]
Decoder with Attention Mechanism β’ Original scoring function [Bahdanau et al β15] = π π tanh πΏπ π’β1 + πΎΰ΄₯ π‘πππ π π π’β1 , ΰ΄₯ π π‘ π π‘ β’ Extension of scoring functions [Luong et al β15] Bilinear function
Neural Encoder-Decoder with Attention Mechanism [Luong et al β15] β’ Computation path: π π’ β π π’ β π π’ β ΰ·© π π’ - Previously, π π’β1 β π π’ β π π’ β π π’ β’ Attention scoring function http://aclweb.org/anthology/D15-1166
Neural Encoder-Decoder with Attention Mechanism [Luong et al β15] β’ Input-feeding approach t μμ attentional 벑ν°κ° λ€μ μ λ ₯벑ν°μ concat λ μ΄ t+1 μ μ λ ₯μ κ΅¬μ± Attentional vectors α π’ t are fed as inputs to the next time steps to inform the model about past alignment decisions
GNMT: Googleβs Neural Machine Translation [Wu et al β16] Deep LSTM network with 8 encoder and 8 decoder layers using residual connections as well as attention connections from the decoder network to the encoder. Trained by Googleβs Tensor Processing Unit (TPU)
GNMT: Googleβs Neural Machine Translation [Wu et al β16] Mean of side-by-side scores on production data Reduces translation errors by an average of 60% compared to Googleβs phrase-based production system.
Pointer Network β’ Attention as a pointer to select a member of the input sequence as the output. Attention as output Neural encoder-decoder Pointer network
Neural Conversational Model [Vinyals and Le β 15] β’ Using neural encoder-decoder for conversations β Response generation http://arxiv.org/pdf/1506.05869.pdf
BIDAF for Machine Reading Comprehension [Seo β17] Bidirectional attention flow
Memory Augmented Neural Networks β Extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes β’ Writing & Reading mechanisms are added β’ Examples ο§ Neural Turing Machine ο§ Differentiable Neural Computer ο§ Memory networks
Neural Turing Machine [Graves β14] β’ Two basic components: A neural network controller and a memory bank . β’ The controller network receives inputs from an external environment and emits outputs in response. β It also reads to and writes from a memory matrix via a set of parallel read and write heads .
Memory β’ Memory π π’ β The contents of the π x π memory matrix at time π’
Read/Write Operations for Memory β’ Read from memory (βblurryβ) β π π’ : a vector of weightings over the N locations emitted by a read head at time π’ ( Ο π π₯ π’ (π) = 1 ) β π π’ : The length M read vector β’ Write to memory (βblurryβ) β Each write: an erase followed by an add β π π’ : Erase vector, π π’ : Add vector
Addressing by Content β’ Based on Attention mechanism β Focuses attention on locations based on the similarity b/w the current values and values emitted by the controller β π π’ : The length M key vector β πΎ π’ : a key strength, which can amplify or attenuate the precision of the focus β K[ u , v ]: similarity measure ο¨ cosine similarity
Addressing β’ Interpolating content-based weights with previous weights β which results in the gated weighting β’ A scalar interpolation gate π π’ β Blend between the weighing π π’β1 produced by the head at the previous time and the weighting π π produced by the content system at the current time- step
Addressing by Location β’ Based on Shifting β π π’ : shift weighting that defines a normalized distribution over the allowed integer shifts β’ E.g.) The simplest way: to use a softmax layer β’ Scalar-based: if the shift scholar is 6.7, then π‘ π’ (6)=0.3 , π‘ π’ (7)=0.7, and the rest of π π’ is zero β πΏ π’ : an additional scalar which sharpen the final weighting
Addressing: Architecture
Controller Output for read head Output for write head π β π π π β π π β’ π π’ β’ π π’ , π π’ , π π’ π β 0,1 π π β 0,1 π β’ π π’ β’ π π’ π β π + π β π + β’ πΎ π’ β’ πΎ π’ π β π β₯1 π β π β₯1 β’ πΏ π’ β’ πΏ π’ π β (0,1) π β (0,1) β’ π π’ β’ π π’ Controller The network for controller: FNN or RNN Input External output π π’ β π π
NTM vs. LSTM: Copy task β’ Task: Copy sequences of eight bit random vectors, where sequence lengths were randomised b/w 1 and 20
NTM vs. LSTM: Mult copy
Differentiable Neural Computers β’ Extension of NTM by advancing Memory addressing β’ Memory addressing are defined by three main attention mechanisms β Content (also used in NTM) β memory allocation β Temporal order β’ The controller interpolates among these mechanisms using scalar gates Credit: http://people.idsia.ch/~rupesh/rnnsymposium2016/slides/graves.pdf
DNC: Overall architecture
DNC: bAbI Results β’ Each story is treated as a separate sequence and presented it to the network in the form of word vectors, one word at a time. mary journeyed to the kitchen. mary moved to the bedroom. john went back to the hallway. john picked up the milk there. what is john carrying ? - john travelled to the garden. john journeyed to the bedroom. what is john carrying ? - mary travelled to the bathroom. john took the apple there. what is john carrying ? - - The answers required at the βββ symbols, grouped by question into braces, are {milk}, {milk}, {milk apple} The network was trained to minimize the cross-entropy of the softmax outputs with respect to the target words
DNC: bAbI Results http://www.nature.com/nature/journal/v538/n7626/full/nature20101.html
Deep learning for Natural language processing β’ Short intro to NLP β’ Word embedding β’ Deep learning for NLP
Natural Language Processing β’ What is NLP? β The automatic processing of human language β’ Give computers the ability to process human language β Its goal enables computers to achieve human-like comprehension of texts/languages β’ Tasks β Text processing β’ POS Tagging / Parsing / Discourse analysis β Information extraction β Question answering β Dialog system / Chatbot β Machine translation
Linguistics and NLP β’ Many NLP tasks correspond to structural subfields of linguistics Subfields of linguistics NLP Tasks Phonetics Speech recognition Phonology Morphology Word segmentation POS tagging Parsing Syntax Semantics Word sense disambiguation Semantic role labeling Semantic parsing Pragmatics Named entity recognition/disambiguation Reading comprehension
Information Extraction According to Robert Callahan, president of Easternβs flight attendants union, the past practice of Easternβs parent, Houston-based Texas Air Corp., has involved ultimatums to unions to accept the carrierβs terms <Empolyee_Of> Entity extraction Robert Callahan Easternβs <Located_In> 92 Texas Air Corp Huston According to <Per> Robert Callahan </Per>, president of <Org> Easternβs </Org> flight attendants union, the Relation extraction past practice of <Org> Easternβs </Org> parent, <Loc> Houston </Loc> -based <Org> Texas Air Corp. </Org>, has involved ultimatums to unions to accept the carrierβs terms
POS Tagging β’ Input: Plays well with others β’ Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS β’ Output: Plays/VBZ well/RB with/IN others/NNS
Parsing β’ Sentence: βJohn ate the appleβ β’ Parse tree (PSG tree) S S β NP VP NP β N NP VP NP β DET N VP β V NP NP N V N β John DET N V β ate DET β the John ate apple the N β apple
Dependency Parsing John ate the apple S NP VP ate OBJ SUBJ NP N V apple John DET N MOD the apple John ate the Dependency tree PSG tree
Semantic Role Labeling Semantic roles Description Agent Initiator of action, Jim gave the book to the professor capable of volition Patient Affected by action, undergoes change of state [Agent Jim] gave [Patient the book] Theme Entity moving, or being βlocatedβ [Goal to the professor.] Experiencer Perceives action but not in control Location Beneficiary Instrument Source Goal
Sentiment analysis Posted by: big John (1) I bought a Samsung camera and my friends brought a Canon camera yesterday . (2) In the past week, we both used the cameras a lot . (3) The photos from my Samy are not that great, and the battery life is short too . (4) My friend was very happy with his camera and loves its picture quality . (5) I want a camera that can take good photos . (6) I am going to return it tomorrow . (Samsung, picture_quality, negative, big John) (Samsung, battery_life, negative, big John) (Canon, GENERAL, positive, big Johnβs_friend ) (Canon, picture_quality, positive, big Johnβs_friend )
Coreference Resolution [A man named Lionel Gaedi] went to [the Port-au-Prince morgue]2 in search of [[his] brother], [Josef], but was unable to find [[his] body] among [the piles of corpses that had been left [there] ]. [A man named Lionel Gaedi]1 went to [the Port-au-Prince morgue]2 in search of [[his]1 brother]3, [ Josef ]3, but was unable to find [[his]3 body]4 among [the piles of corpses that had been left [there]2 ]5.
Question Answering β’ One of the oldest NLP tasks β’ Modern QA systems β IBMβs Watson, Appleβs Siri, etc. β’ Examples of Factoid questions Questions Answers In Paris, France Where is the Louvre Museum located? Whatβs the abbreviation for limited partnership? L.P . What are the names of Odinβs ravens? Huginn and Muninn What currency is used in China? The yuan
Example: IBM Watson System β’ Open-domain question answering system (DeepQA) β In 2011, Watson defeated Brad Rutter and Ken Jennings in the Jeopardy! Challenge
Recommend
More recommend