Machine Learning Lecture 10: Recurrent Neural Networks Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and references listed at the end. Nevin L. Zhang (HKUST) Machine Learning 1 / 43
Introduction Outline 1 Introduction 2 Recurrent Neural Networks 3 Long Short-Term Memory (LSTM) RNN 4 RNN Architectures 5 Attention Nevin L. Zhang (HKUST) Machine Learning 2 / 43
Introduction Introduction So far, we have been talking about neural network models for labelled data: { x i , y i } N i =1 − → P ( y | x ) , where each training example consists of one input x i and one output y i . Next, we will talk about neural network models for sequential data: { ( x (1) , . . . , x ( τ i ) ) , ( y (1) , . . . , y ( τ i ) ) } N → P ( y ( t ) | x (1) , . . . , x ( t ) ) , i =1 − i i i i where each training example consists of a sequence of inputs ( x (1) , . . . , x ( τ i ) ) and a sequence of outputs ( y (1) , . . . , y ( τ i ) ), i i i i and the current output y ( t ) depends not only on the current input, but also all previous inputs. Nevin L. Zhang (HKUST) Machine Learning 3 / 43
Introduction Introduction: Language Modeling Data: A collection of sentences. For each sentence, create an output sequence by shifting it: (” what ” , ” is ” , ” the ” , ” problem ”) , (” is ” , ” the ” , ” problem ” , − ) From the training pairs, we can learn a neural language model : It is used to predict the next word: P ( w k | w 1 , . . . , w k − 1 ). It also defines a probability distribution over sentences: P ( w 1 , w 2 , . . . , w n ) = P ( w 1 ) P ( w 2 | w 1 ) P ( w 3 | w 2 , w 1 ) P ( w 4 | w 3 , w 2 , w 1 ) . . . Nevin L. Zhang (HKUST) Machine Learning 4 / 43
Introduction Introduction: Dialogue and Machine Translation Data: A collection of matched pairs. ”How are you?” ; ”I am fine.” We can still thinking of having an input and an output at each time point, except some inputs and outputs are dummies. (” How ” , ” are ” , ” you ” , − , − , − ) , ( − , − , − , ” I ” , ” am ” , ” fine ”) . From the training pairs, we can learn neural model for dialogue or machine translation. Nevin L. Zhang (HKUST) Machine Learning 5 / 43
Recurrent Neural Networks Outline 1 Introduction 2 Recurrent Neural Networks 3 Long Short-Term Memory (LSTM) RNN 4 RNN Architectures 5 Attention Nevin L. Zhang (HKUST) Machine Learning 6 / 43
Recurrent Neural Networks Recurrent Neural Networks A circuit diagram , aka recurrent graph , (left), where a black square indicates time-delayed dependence, or A unfolded computational graph , aka unrolled graph , (right). The length of the unrolled graph is determined by the length the input. In other words, the unrolled graphs for different sequences can be of different lengths. Nevin L. Zhang (HKUST) Machine Learning 7 / 43
Recurrent Neural Networks Recurrent Neural Networks The input tokens x ( t ) are represented as embedding vectors , which are determined together with other model parameters during learning. The hidden states h ( t ) are also vectors. The current state h ( t ) depends on the current input x ( t ) and the previous state h ( t − 1) as follows: b + Wh ( t − 1) + Ux ( t ) a ( t ) = h ( t ) tanh( a ( t ) ) = where b , W and U are model parameters. They are independent of time t . Nevin L. Zhang (HKUST) Machine Learning 8 / 43
Recurrent Neural Networks Recurrent Neural Networks The output sequence is produced as follows: o ( t ) c + Vh ( t ) = y ( t ) softmax( o ( t ) ) ˆ = where c and V are model parameters. They are independent of time t . Nevin L. Zhang (HKUST) Machine Learning 9 / 43
Recurrent Neural Networks Recurrent Neural Networks This is the loss for one training pair : τ L ( { x (1) , . . . , x ( τ ) } , { y (1) , . . . , y ( τ ) } ) = − � log P model( y ( t ) | x (1) , . . . , x ( t ) ) , t =1 where log P model( y ( t ) | x (1) , . . . , x ( t ) ) is obtained by reading the entry for y ( t ) y ( t ) . from the models output vector ˆ When there are multiple input-target sequence pairs, the losses are added up. Training objective: Minimize the total loss of all training pairs w.r.t the model parameters and embedding vectors: W , U , V , b , c , θ em where θ em are the embedding vectors. Nevin L. Zhang (HKUST) Machine Learning 10 / 43
Recurrent Neural Networks Training RNNs RNNs are trained using stochastic gradient descent. We need gradients: ∇ W L , ∇ U L , ∇ V L , ∇ b L , ∇ c L , ∇ θ em L They are computed using Backpropagation Through Time (BPTT) , which is an adaption of Backpropagation to the unrolled computational graph. BPTT is implemented in deep learning packages such as Tensorflow. Nevin L. Zhang (HKUST) Machine Learning 11 / 43
Recurrent Neural Networks RNN and Self-Supervised Learning Self-supervised learning is a learning technique where the training data is automatically labelled. It is still supervised learning, but the datasets do not need to be manually labelled by a human , but they can e.g. be labelled by finding and exploiting the relations between different input signals. RNN training is self-supervised learning. Nevin L. Zhang (HKUST) Machine Learning 12 / 43
Long Short-Term Memory (LSTM) RNN Outline 1 Introduction 2 Recurrent Neural Networks 3 Long Short-Term Memory (LSTM) RNN 4 RNN Architectures 5 Attention Nevin L. Zhang (HKUST) Machine Learning 13 / 43
Long Short-Term Memory (LSTM) RNN Basic Idea Long Short-Term Memory (LSTM) unit is a widely used technique to address long-term dependencies. The key idea is to use memory cells and gates : c ( t ) = f t c ( t − 1) + i t a ( t ) c ( t ) : memory state at t ; a ( t ) : new input at t . If the forget gate f t is open (i.e, 1) and the input gate i t is closed (i.e., 0), the current memory is kept. If the forget gate f t is closed (i.e, 0) and the input gate i t is open (i.e., 1), the current memory is erased and replaced by new input. If we can learn f t and i t from data, then we can automatically determine how much history to remember/forget. Nevin L. Zhang (HKUST) Machine Learning 14 / 43
Long Short-Term Memory (LSTM) RNN Basic Idea In the case of vectors, c ( t ) = f t ⊗ c ( t − 1) + i t ⊗ a ( t ) , where ⊗ means pointwise product. f t is called the forget gate vector because it determines which components of the previous state and how much of them to remember/forget. i t is called the input gate vector because it determines which components of the input from h ( t − 1) and x ( t ) and how much of them should go into the current state. If we can learn f t and i t from data, then we can automatically determine which component to remember/forget and how much of them to remember/forget. Nevin L. Zhang (HKUST) Machine Learning 15 / 43
Long Short-Term Memory (LSTM) RNN LSTM Cell In standard RNN, a ( t ) = b + Wh ( t − 1) + Ux ( t ) , h ( t ) = tanh( a ( t ) ) In LSTM, we introduce a cell state vector c t , and set f t ⊗ c t − 1 + i t ⊗ a ( t ) c t = h ( t ) = tanh( c t ) where f t and i t are vectors. Nevin L. Zhang (HKUST) Machine Learning 16 / 43
Long Short-Term Memory (LSTM) RNN LSTM Cell: Learning the Gates f t is determined based on current input x ( t ) and previous hidden unit h ( t − 1) : f t = σ ( W f x ( t ) + U f h ( t − 1) + b f ) , where W f , U f , b f are parameters to be learned from data. i t is also determined based on current input x ( t ) and previous hidden unit h ( t − 1) : i t = σ ( W i x ( t ) + U i h ( t − 1) + b i ) where W i , U i , b i are parameters to be learned from data. Note the sigmoid activation function is used for the gates so that their values are often close to 0 or 1. In contrast, tanh is used for the output h ( t ) so as to have strong gradient signal during backprop. Nevin L. Zhang (HKUST) Machine Learning 17 / 43
Long Short-Term Memory (LSTM) RNN LSTM Cell: Output Gate We can also have a output gate to control which components of the state vector c t and how much of them should be outputted: o t = σ ( W q x ( t ) + U q h ( t − 1) + b q ) where W q , U q , b q are the learnable parameters, and set h ( t ) = o t ⊗ tanh( c t ) Nevin L. Zhang (HKUST) Machine Learning 18 / 43
Long Short-Term Memory (LSTM) RNN LSTM Cell: Summary A standard RNN cell : a ( t ) = b + Wh ( t − 1) + Ux ( t ) , h ( t ) = tanh( a ( t ) ) An LSTM Cell : σ ( W f x ( t ) + U f h ( t − 1) + b f ) f t = (forget gate, 0 � in figure) σ ( W i x ( t ) + U i h ( t − 1) + b i ) i t = (input gate, 1 � in figure) σ ( W q x ( t ) + U q h ( t − 1) + b q ) o t = (output gate, 3 � in figure) f t ⊗ c t − 1 + i t ⊗ tanh( Ux ( t ) + Wh ( t − 1) + b ) c t = (update memory) h ( t ) = o t ⊗ tanh( c t ) (next hidden unit) Nevin L. Zhang (HKUST) Machine Learning 19 / 43
Long Short-Term Memory (LSTM) RNN Gated Recurrent Unit The Gated Recurrent Unit (GRU) is another a gating mechanism to allow RNNs to efficiently learn long-range dependency. It is similar to LSTM (no memory and no output unit) and hence has fewer paramters. is a simplified version of an LSTM unit with fewer parameters. Performance also similar to LSTM, except better on small datasets. Nevin L. Zhang (HKUST) Machine Learning 20 / 43
Long Short-Term Memory (LSTM) RNN Gated Recurrent Unit Nevin L. Zhang (HKUST) Machine Learning 21 / 43
Recommend
More recommend