Long-term Recurrent Convolutional Networks for Visual Recognition - PowerPoint PPT Presentation

Long-term Recurrent Convolutional Networks for Visual Recognition and Description Donahue et al. Berkan Demirel

Overview of LRCN  LRCN is a class of models that is both spatially and temporally deep.  It has the fmexibility to be applied to a variety of vision tasks involving sequential inputs and outputs. Image credit: main paper

Convolutional Neural Networks Krizhevsky, Alex, Ilya Sutskever, and Geofgrey E. Hinton. "Imagenet classifjcation with deep convolutional neural networks." Advances in neural information processing systems. 2012.

Limitation 1 Fixed-size, static input - 224x224x3

Limitation 2 Output is a single choice from list of options

Background: Sequence Learning Jefg Donahue, CVPR Cafge Tutorial, June 6, 2015

Contributions  Mapping variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text).  LRCN is directly connected to modern visual convnet models.  It is suitable for large-scale visual learning which is end-to- end trainable.

Sequential inputs /outputs Image credit: main paper

LRCN Model  LRCN model works by passing each visual input (an image in isolation, or a frame from a video) through a feature transformation parametrized by V to produce a fixed-length vector representation.  In its most general form, a sequence model, parametrized by W, maps an input x t and a previous timestep hidden state h t-1 to an output z t and update hidden state h t .  The final step in predicting a distribution P(y t ) at timestep t is to take a softmax over the outputs z t of the sequential mode.

LRCN Model - Activity Recognition  Sequential input, fixed outputs: <x 1 ,x 2 ,x 3 ,... x T > → y  With sequential inputs and scalar outputs, we take a late fusion approach to merging the per-timestep predictions into a single prediction for the full sequence. Image credit: main paper

LRCN Model – Image Description  Fixed input, sequential outputs: x → <y 1 ,y 2 ,y 3 ,... y T >  With fixed-size inputs and sequential outputs, we simply duplicate the input x at all T timesteps. Image credit: main paper

LRCN Model – Video Description  Fixed input, sequential outputs: <x 1 ,x 2 ,x 3 ,... x T > → <y 1 ,y 2 ,y 3 ,... y T' >  For a sequence-to-sequence problem with (in general) different input and output lengths, take an “encoder-decoder” approach.  In this approach, one sequence model, the encoder, is used to map the input sequence to a fixed length vector, then another sequence model, the decoder, is used to unroll this vector to sequential outputs of arbitrary length. Image credit: main paper

LRCN Model Under the proposed system, the weights (V;W) of the model’s visual and sequential components can be learned jointly by maximizing the likelihood of the ground truth outputs.

Activity Recognition • T individual frames are input to T convolutional networks which are then connected to a single layer LSTM with 256 hidden units. • The CNN base of the LRCN is a hybrid of the Caffe reference model, a minor variant of AlexNet, and the network used by Zeiler & Fergus which is pre-trained on the 1.2M image ILSVRC-2012 classification training subset of the ImageNet dataset.

Activity Recognition • Two variants of the LRCN architecture are used: one in which the LSTM is placed after the first fully connected layer of the CNN (LRCN-fc6) and another in which the LSTM is placed after the second fully connected layer of the CNN (LRCN-fc7). • Networks are trained with video clips of 16 frames. The LRCN predicts the video class at each time step and we average these predictions for final classification. • Consider both RGB and flow inputs.

Activity recognition CNN CNN CNN CNN LSTM LSTM LSTM LSTM jumping jumping running sitting Average jumping

Evaluation Architecture is evaluated on the UCF-101 dataset which consists of over 12,000 videos categorized into 101 human action classes.

Image Description • In contrast to activity recognition, the static image description task only requires a single convolutional network. • At each timestep, both the image features and the previous word are provided as inputs to the sequential model, in this case a stack of LSTMs (each with 1000 hidden units), • This is used to learn the dynamics of the time-varying output sequence, natural language.

Image Description CNN LSTM LSTM LSTM LSTM LSTM <BOS> jumpi <EOS> a dog is ng

Image Description CNN LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM a jumping dog <EOS> <BOS> is

Image Description Two layered factor CNN LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM a jumping dog <EOS> <BOS> is

Evaluation – Image Retrieval  Model is trained on the combined training sets of the Flickr30k (28,000 training images) and COCO2014 dataset (80,000 training images).  Results are reported on Flickr30k ( 1000 images each for test and validation).

Evaluation – Image Retrieval  Image retrieval results for variants of the LRCN architectures.

Evaluation – Sentence Generation  BLEU (bilingual evaluation understudy ) metric is used.  Additionally, Authors report results on the new COCO2014 dataset which has 80,000 training images, and 40,000 validation images.  Authors isolate 5,000 images from the validation set for testing purposes and the results are reported

Evaluation – Human Evaluation Rankings Human evaluator rankings from 1-6(low is good) averaged for each method and criterion.

Image Description Results

Video Description • Due to limitations of available video description datasets authors take a different path. • They rely on more "traditional" activity and video recognition processing for the input and use LSTMs for generating a sentence. • They assume they have predictions of objects, subjects, and verbs present in the video from a CRF based on the full video input.

Video Description Pre-trained detector predictions LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM jumping a dog <EOS> <BOS> is

LSTM Encoder & Decoder Figure credit: supplementary material

LSTM Decoder with CRF Max Figure credit: supplementary material

LSTM Decoder with CRF Prob. Figure credit: supplementary material

Evaluation – Video Description TACoS multilevel dataset, which has 44,762 video/sentence pairs (about 40,000 for training/validation).

Video Description Figure credit: supplementary material

Conclusion  LRCN is a flexible framework for vision problems involving sequences  Able to handle: ✔ Sequences in the input (video) ✔ Sequences in the output (natural language description)

Future Directions Image credit: Hu, Ronghang, Marcus Rohrbach, and Trevor Darrell. "Segmentation from Natural Language Expressions." arXiv preprint arXiv:1603.06180 (2016).

Thank You!

Long-term Recurrent Convolutional Networks for Visual Recognition - PowerPoint PPT Presentation

Long-term Recurrent Convolutional Networks for Visual Recognition and Description Donahue et al. Berkan Demirel Overview of LRCN LRCN is a class of models that is both spatially and temporally deep. It has the fmexibility to be applied

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Convolutional Neural Networks ---- Off the shelf top notch performances Convolutional Neural

15-780 Graduate Artificial Intelligence: Convolutional and recurrent networks J. Zico Kolter

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS Neural networks Fully connected networks

Introduction CSCE 970 CSCE 970 Lecture 4: Lecture 4: Convolutional Convolutional Neural

Convolutional Kuan-Ting Lai 2020/3/31 Neural Network Convolutional Neural Networks (CNN)

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) L1 Scalar Processor L0

The Hardware/So=ware Interface CSE351 Winter 2013 x86 Programming

Summer of NYTD, 2018 National Data Archive On Child Abuse and Neglect Bronfenbrenner Center for

Longitudinal Analysis for Continuous Outcomes Brandon LeBeau Assistant Professor DataCamp

Improving code reuse in clang tools with clangmetatool Daniel Ruoso druoso@bloomberg.net

Analysis of the controllability of space-time fractional diffusion and super diffusion equations

Backjumping learn Revision: 1.14 1 x x y y If y has never been used to derive a conflict, then

Continuation and Exceptions Control Flow In Sequential Languages cs3723 1 Imperative

assembly / ISAs strategy write with gotos fjrst leaq (%rax, %rax, 2), %rax subq %rsp, %rax #