CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent - PowerPoint PPT Presentation

CSCE 496/896 Lecture 6: Recurrent CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent Architectures Introduction Basic Idea I/O Mappings Stephen Scott Examples Training (Adapted from Vinod Variyam and Ian Goodfellow) Deep RNNs LSTMs GRUs sscott@cse.unl.edu 1 / 35

Introduction CSCE 496/896 Lecture 6: Recurrent All our architectures so far work on fixed-sized inputs Architectures Stephen Scott Recurrent neural networks work on sequences of inputs Introduction Basic Idea E.g., text, biological sequences, video, audio I/O Mappings Can also try 1D convolutions, but lose long-term Examples relationships in input Training Deep RNNs Especially useful for NLP applications: translation, LSTMs speech-to-text, sentiment analysis GRUs Can also create novel output: e.g., Shakespearean text, music 2 / 35

Outline CSCE 496/896 Lecture 6: Recurrent Architectures Stephen Scott Basic RNNs Introduction Input/Output Mappings Basic Idea Example Implementations I/O Mappings Examples Training Training Long short-term memory Deep RNNs Gated Recurrent Unit LSTMs GRUs 3 / 35

Basic Recurrent Cell A recurrent cell (or recurrent neuron) has CSCE 496/896 connections pointing backward as well as Lecture 6: Recurrent forward Architectures Stephen Scott At time step (frame) t , neuron receives input vector x ( t ) as usual, but also receives its own Introduction output y ( t − 1 ) from previous step Basic Idea I/O Mappings Examples Training Deep RNNs LSTMs GRUs 4 / 35

Basic Recurrent Layer CSCE 496/896 Lecture 6: Recurrent Architectures Can build a layer of recurrent cells, where each node Stephen Scott gets both the vector x ( t ) and the vector y ( t − 1 ) Introduction Basic Idea I/O Mappings Examples Training Deep RNNs LSTMs GRUs 5 / 35

Basic Recurrent Layer CSCE 496/896 Lecture 6: Recurrent Architectures Each node in the recurrent layer has independent Stephen Scott weights for both x ( t ) and y ( t − 1 ) Introduction For a single recurrent node, denote by w x and w y Basic Idea I/O Mappings For the entire layer, combine into matrices W x and W y Examples For activation function φ and bias vector b , output Training vector is Deep RNNs LSTMs � � W ⊤ x x ( t ) + W ⊤ y ( t ) = φ y y ( t − 1 ) + b GRUs 6 / 35

Memory and State CSCE Since a node’s output depends on its past, it can be 496/896 Lecture 6: thought of having memory or state Recurrent Architectures State at time t is h ( t ) = f ( h ( t − 1 ) , x ( t ) ) and output Stephen Scott y ( t ) = g ( h ( t − 1 ) , x ( t ) ) State could be the same as the output, or separate Introduction Can think of h ( t ) as storing important information about Basic Idea input sequence I/O Mappings Examples Analogous to convolutional outputs summarizing Training important image features Deep RNNs LSTMs GRUs 7 / 35

Input/Output Mappings Sequence to Sequence CSCE 496/896 Many ways to employ this basic architecture: Lecture 6: Recurrent Architectures Sequence to sequence: Input is a sequence and Stephen Scott output is a sequence Introduction E.g., series of stock predictions, one day in advance Basic Idea I/O Mappings Examples Training Deep RNNs LSTMs GRUs 8 / 35

Input/Output Mappings Sequence to Vector CSCE Sequence to vector: Input is sequence and output a 496/896 Lecture 6: vector/score/ classification Recurrent Architectures E.g., sentiment score of movie review Stephen Scott Introduction Basic Idea I/O Mappings Examples Training Deep RNNs LSTMs GRUs 9 / 35

Input/Output Mappings Vector to Sequence CSCE 496/896 Vector to sequence: Input is a single vector (zeroes Lecture 6: Recurrent for other times) and output is a sequence Architectures Stephen Scott E.g., image to caption Introduction Basic Idea I/O Mappings Examples Training Deep RNNs LSTMs GRUs 10 / 35

Input/Output Mappings Encoder-Decoder Architecture CSCE Encoder-decoder: Sequence-to-vector ( encoder ) 496/896 Lecture 6: followed by vector-to-sequence ( decoder ) Recurrent Architectures Input sequence ( x 1 , . . . , x T ) yields hidden outputs Stephen Scott ( h 1 , . . . , h T ) , then mapped to context vector Introduction c = f ( h 1 , . . . , h T ) Basic Idea Decoder output y t ′ depends on previously output I/O Mappings ( y 1 , . . . , y t ′ − 1 ) and c Examples Example application: neural machine translation Training Deep RNNs LSTMs GRUs 11 / 35

Input/Output Mappings Encoder-Decoder Architecture: NMT Example CSCE 496/896 Pre-trained word embeddings fed into input Lecture 6: Recurrent Encoder maps word sequence to vector, decoder maps Architectures to translation via softmax distribution Stephen Scott After training, do translation by feeding previous Introduction translated word y ′ ( t − 1 ) to decoder Basic Idea I/O Mappings Examples Training Deep RNNs LSTMs GRUs 12 / 35

Input/Output Mappings Encoder-Decoder Architecture CSCE 496/896 Lecture 6: Recurrent Works through an embedded space like an Architectures autoencoder, so can represent the entire input as an Stephen Scott embedded vector prior to decoding Introduction Issue: Need to ensure that the context vector fed into Basic Idea decoder is sufficiently large in dimension to represent I/O Mappings context required Examples Training Can address this representation problem via attention Deep RNNs mechanism mechanism LSTMs Encodes input sequence into a vector sequence rather GRUs than single vector As it decodes translation, decoder focuses on relevant subset of the vectors 13 / 35

Input/Output Mappings E-D Architecture: Attention Mechanism (Bahdanau et al., 2015) CSCE 496/896 Bidirectional RNN reads input Lecture 6: Recurrent forward and backward Architectures Stephen Scott simultaneously Encoder builds annotation h j Introduction as concatenation of − → h j and ← − Basic Idea h j I/O Mappings ⇒ h j summarizes preceding Examples and following inputs Training i th context vector Deep RNNs c i = � T j = 1 α ij h j , where LSTMs exp( e ij ) α ij = GRUs � T k = 1 exp( e ik ) and e ij is an alignment score between inputs around j and outputs around i 14 / 35

Input/Output Mappings E-D Architecture: Attention Mechanism (Bahdanau et al., 2015) The i th element of CSCE 496/896 attention vector α j tells Lecture 6: Recurrent us the probability that Architectures target output y i is aligned Stephen Scott to (or translated from) Introduction input x j Basic Idea Then c i is expected I/O Mappings annotation over all Examples annotations with Training probabilities α j Deep RNNs LSTMs GRUs Alignment score e ij indicates how much we should focus on word encoding h j when generating output y i (in decoder state s i − 1 ) Can compute e ij via dot product h ⊤ j s i − 1 , bilinear function h ⊤ j W s i − 1 , or nonlinear activation 15 / 35

Example Implementation Static Unrolling for Two Time Steps CSCE 496/896 Lecture 6: Recurrent Architectures X0 = tf.placeholder(tf.float32, [None, n_inputs]) Stephen Scott X1 = tf.placeholder(tf.float32, [None, n_inputs]) Wx = tf.Variable(tf.random_normal(shape=[n_inputs, n_neurons],dtype=tf.float32)) Wy = tf.Variable(tf.random_normal(shape=[n_neurons,n_neurons],dtype=tf.float32)) Introduction b = tf.Variable(tf.zeros([1, n_neurons], dtype=tf.float32)) Basic Idea Y0 = tf.tanh(tf.matmul(X0, Wx) + b) Y1 = tf.tanh(tf.matmul(Y0, Wy) + tf.matmul(X1, Wx) + b) I/O Mappings Examples Input: Training Deep RNNs # Mini-batch: instance 0, instance 1, instance 2, instance 3 X0_batch = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 0, 1]]) # t = 0 LSTMs X1_batch = np.array([[9, 8, 7], [0, 0, 0], [6, 5, 4], [3, 2, 1]]) # t = 1 GRUs 16 / 35

Example Implementation Static Unrolling for Two Time Steps CSCE 496/896 Lecture 6: Recurrent Architectures Stephen Scott Can achieve the same thing more compactly via static_rnn() Introduction Basic Idea X0 = tf.placeholder(tf.float32, [None, n_inputs]) X1 = tf.placeholder(tf.float32, [None, n_inputs]) I/O Mappings basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons) output_seqs, states = tf.contrib.rnn.static rnn(basic_cell, [X0, X1], Examples dtype=tf.float32) Y0, Y1 = output_seqs Training Deep RNNs Automatically unrolls into length-2 sequence RNN LSTMs GRUs 17 / 35

Example Implementation Automatic Static Unrolling Can avoid specifying one placeholder per time step via CSCE 496/896 tf.stack and tf.unstack Lecture 6: Recurrent Architectures X = tf.placeholder(tf.float32, [None, n steps, n_inputs]) X_seqs = tf.unstack(tf.transpose(X, perm=[1, 0, 2])) Stephen Scott basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons) output_seqs, states = tf.contrib.rnn.static_rnn(basic_cell, X_seqs, dtype=tf.float32) Introduction outputs = tf.transpose(tf.stack(output_seqs), perm=[1, 0, 2]) ... Basic Idea X_batch = np.array([ # t=0 t=1 I/O Mappings [[0, 1, 2], [9, 8, 7]], # instance 0 Examples [[3, 4, 5], [0, 0, 0]], # instance 1 [[6, 7, 8], [6, 5, 4]], # instance 2 Training [[9, 0, 1], [3, 2, 1]], # instance 3 ]) Deep RNNs LSTMs GRUs Uses static_rnn() again, but on all time steps folded into a single tensor Still forms a large, static graph (possible memory issues) 18 / 35

CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent - PowerPoint PPT Presentation

CSCE 496/896 Lecture 6: Recurrent CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent Architectures Introduction Basic Idea I/O Mappings Stephen Scott Examples Training (Adapted from Vinod Variyam and Ian Goodfellow) Deep

Why Are We Here? CSCE CSCE 496/896 496/896 Lecture 10: Lecture 10: CSCE 496/896 Lecture 10:

Introduction CSCE CSCE 496/896 496/896 Lecture 6: Lecture 6: Recurrent Recurrent CSCE

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 9: Lecture 9: word2vec and word2vec and To

Introduction Supervised Learning CSCE CSCE 496/896 496/896 Lecture 2: Lecture 2: Basic

Welcome to CSCE 496/896: Deep Learning! Welcome to CSCE 496/896: Deep Learning! Please check

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

CSCE 496/896 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

CSCE 496/896 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

CSCE 496/896 Lecture 8: Good Research Talk How to Give a Good Research Talk Stephen Scott

CSCE 496/896 Lecture 7: Learning Stephen Scott Reinforcement Learning Introduction MDPs Q

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU 2 Shell CSCE 625 TAMU

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU CSCE 625: Artificial

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Introduction CSCE CSCE 471/871 471/871 Lecture 6: Lecture 6: Multiple Multiple CSCE

Outline CSCE CSCE 471/871 471/871 Lecture 5: Lecture 5: Building Building CSCE 471/871

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

Differential Categories, Recurrent Neural Networks, and Machine Learning Shin-ya Katsumata and

Gated Orthogonal Recurrent Units: On Learning to Forget Li Jing, a lar Glehre, John

Recurrent Neural Networks for Language Modeling CSE392 - Spring 2019 Special Topic in CS Tasks

Recurrent machines for likelihood-free inference Arthur Pesah Antoine Wehenkel Gilles Louppe

Machine Learning for NLP Sequential NN models Aurlie Herbelot 2019 Centre for Mind/Brain

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent Neural Networks (RNNs) Caio

PixelCNN Models with Auxiliary Variables for Natural Image Modeling Alexander Kolesnikov*,