Neural Network Part 4: Recurrent Neural Networks Yingyu Liang - PowerPoint PPT Presentation

Neural Network Part 4: Recurrent Neural Networks Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, Pedro Domingos, and Geoffrey Hinton.

Goals for the lecture you should understand the following concepts • sequential data • computational graph • recurrent neural networks (RNN) and the advantage • training recurrent neural networks • bidirectional RNNs • encoder-decoder RNNs 2

Introduction

Recurrent neural networks • Dates back to (Rumelhart et al. , 1986) • A family of neural networks for handling sequential data, which involves variable length inputs or outputs • Especially, for natural language processing (NLP)

Sequential data • Each data point: A sequence of vectors 𝑦 (𝑢) , for 1 ≤ 𝑢 ≤ 𝜐 • Batch data: many sequences with different lengths 𝜐 • Label: can be a scalar, a vector, or even a sequence • Example • Sentiment analysis • Machine translation

Example: machine translation Figure from: devblogs.nvidia.com

More complicated sequential data • Data point: two dimensional sequences like images • Label: different type of sequences like text sentences • Example: image captioning

Image captioning Figure from the paper “ DenseCap : Fully Convolutional Localization Networks for Dense Captioning”, by Justin Johnson, Andrej Karpathy, Li Fei-Fei

Computational graphs

A typical dynamic system 𝑡 (𝑢+1) = 𝑔(𝑡 𝑢 ; 𝜄) Figure from Deep Learning , Goodfellow, Bengio and Courville

A system driven by external data 𝑡 (𝑢+1) = 𝑔(𝑡 𝑢 , 𝑦 (𝑢+1) ; 𝜄) Figure from Deep Learning , Goodfellow, Bengio and Courville

Compact view 𝑡 (𝑢+1) = 𝑔(𝑡 𝑢 , 𝑦 (𝑢+1) ; 𝜄) Figure from Deep Learning , Goodfellow, Bengio and Courville

square: one step time delay Compact view 𝑡 (𝑢+1) = 𝑔(𝑡 𝑢 , 𝑦 (𝑢+1) ; 𝜄) Key: the same 𝑔 and 𝜄 Figure from Deep Learning , for all time steps Goodfellow, Bengio and Courville

Recurrent neural networks (RNN)

Recurrent neural networks • Use the same computational function and parameters across different time steps of the sequence • Each time step: takes the input entry and the previous hidden state to compute the output entry • Loss: typically computed at every time step

Label Recurrent neural networks Loss Output State Input Figure from Deep Learning , by Goodfellow, Bengio and Courville

Recurrent neural networks Math formula: Figure from Deep Learning , Goodfellow, Bengio and Courville

Advantage • Hidden state: a lossy summary of the past • Shared functions and parameters: greatly reduce the capacity and good for generalization in learning • Explicitly use the prior knowledge that the sequential data can be processed by in the same way at different time step (e.g., NLP)

Advantage • Hidden state: a lossy summary of the past • Shared functions and parameters: greatly reduce the capacity and good for generalization in learning • Explicitly use the prior knowledge that the sequential data can be processed by in the same way at different time step (e.g., NLP) • Yet still powerful (actually universal): any function computable by a Turing machine can be computed by such a recurrent network of a finite size (see, e.g., Siegelmann and Sontag (1995))

Training RNN • Principle: unfold the computational graph, and use backpropagation • Called back-propagation through time (BPTT) algorithm • Can then apply any general-purpose gradient-based techniques

Training RNN • Principle: unfold the computational graph, and use backpropagation • Called back-propagation through time (BPTT) algorithm • Can then apply any general-purpose gradient-based techniques • Conceptually: first compute the gradients of the internal nodes, then compute the gradients of the parameters

Recurrent neural networks Math formula: Figure from Deep Learning , Goodfellow, Bengio and Courville

Recurrent neural networks Gradient at 𝑀 (𝑢) : (total loss is sum of those at different time steps) Figure from Deep Learning , Goodfellow, Bengio and Courville

Recurrent neural networks Gradient at 𝑝 (𝑢) : Figure from Deep Learning , Goodfellow, Bengio and Courville

Recurrent neural networks Gradient at 𝑡 (𝜐) : Figure from Deep Learning , Goodfellow, Bengio and Courville

Recurrent neural networks Gradient at 𝑡 (𝑢) : Figure from Deep Learning , Goodfellow, Bengio and Courville

Recurrent neural networks Gradient at parameter 𝑊 : Figure from Deep Learning , Goodfellow, Bengio and Courville

The problem of exploding/vanishing gradient • • What happens to the magnitude of In an RNN trained on long the gradients as we backpropagate sequences ( e.g. 100 time steps) through many layers? the gradients can easily explode or vanish. – If the weights are small, the – We can avoid this by initializing gradients shrink exponentially. the weights very carefully. – If the weights are big the • gradients grow exponentially. Even with good initial weights, its very hard to detect that the current • Typical feed-forward neural nets target output depends on an input can cope with these exponential from many time-steps ago. effects because they only have a – So RNNs have difficulty few hidden layers. dealing with long-range dependencies.

The Popular LSTM Cell x t h t-1 x t h t-1 æ ö æ x t ö W o W i f t = s W f ÷ + b f ç ç ÷ Input Gate i t Output Gate o t è ø h t - 1 è ø Similarly for i t , o t W x t Cell c t-1 h t h t-1 c t = f t Ä c t - 1 + æ ö x t i t Ä tanh W ç ÷ è ø h t - 1 f t Forget Gate W f h t = o t Ä tanh c t x t h t-1 29 * Dashed line indicates time-lag

Some Other Variants of RNN

RNN • Use the same computational function and parameters across different time steps of the sequence • Each time step: takes the input entry and the previous hidden state to compute the output entry • Loss: typically computed every time step • Many variants • Information about the past can be in many other forms • Only output at the end of the sequence

Example: use the output at the previous step Figure from Deep Learning , Goodfellow, Bengio and Courville

Example: only output at the end Figure from Deep Learning , Goodfellow, Bengio and Courville

Bidirectional RNNs • Many applications: output at time 𝑢 may depend on the whole input sequence • Example in speech recognition: correct interpretation of the current sound may depend on the next few phonemes, potentially even the next few words • Bidirectional RNNs are introduced to address this

BiRNNs Figure from Deep Learning , Goodfellow, Bengio and Courville

Encoder-decoder RNNs • RNNs: can map sequence to one vector; or to sequence of same length • What about mapping sequence to sequence of different length? • Example: speech recognition, machine translation, question answering, etc

Figure from Deep Learning , Goodfellow, Bengio and Courville

Neural Network Part 4: Recurrent Neural Networks Yingyu Liang - PowerPoint PPT Presentation

Neural Network Part 4: Recurrent Neural Networks Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven,

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Network II Neural Network II Week 8 1 Team Homework Assignment #10 Team Homework

Neural Network Part 3: Convolutional Neural Networks CS 760@UW-Madison Goals for the lecture

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

Neural network applications ALVINN (Pomerleau, mid 1990s) Autonomous Land Vehicle in Neural

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

Geometry-Aware Deep Visual Learning Katerina Fragkiadaki zebras How this talk fits the workshop

Sparse Attentive Backtracking: Temporal credit assignment through reminding Nan Rosemary Ke 1,2 ,

Normalizing tweets with edit scripts and recurrent neural embeddings Grzegorz Chrupaa |

Understanding Hidden Memories of Recurrent Neural Networks Yao Ming , Shaozu Cao, Ruixiang Zhang,

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) CMSC 678 UMBC Recap

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2.

Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection