Natural Language Video Description using Deep Recurrent Neural - PowerPoint PPT Presentation

Natural Language Video Description using Deep Recurrent Neural Networks Thesis Proposal 23 Nov. 2015 Subhashini Venugopalan University of Texas at Austin 1

Problem Statement Generate descriptions for events depicted in video clips A monkey pulls a dog’s tail and is chased by the dog. 2

Applications Image and video retrieval by content. Video description service. Children are wearing green shirts. They are dancing as they sing the carol. Human Robot Interaction Video surveillance 3

Outline ● ● ○ ○ ● ○ ○ ○ ● 4

Related Work 5

Related Work - 1: Language & Vision Language: Increasingly focused on grounding meaning in perception. Vision: Exploit linguistic ontologies to “tell a story” from images. [Farhadi et. al. ECCV’10] [Kulkarni et. al. CVPR’11] Many early works on Image Description Farhadi et. al. ECCV’10, Kulkarni et. al. CVPR’11, Mitchell et. al. EACL’12, Kuznetsova et. al. ACL’12 & ACL’13 Identify objects and attributes, and combine with linguistic knowledge to “tell a story”. There are one cow and one sky. (animal, stand, ground) Dramatic increase in interest the past year. The golden cow is by the blue sky. (8 papers in CVPR’15) [Donahue et. al. CVPR’15] Relatively little on Video Description Need videos for semantics of wider range of actions. A group of young men playing a game of soccer. 6

Related Work - 2: Video Description ● Extract object and action descriptors. ● Learn object, action, scene classifiers. ● Use language to bias visual interpretation. ● Estimate most likely agents and actions. [Krishnamurthy, et al. AAAI’13] ● Template to generate sentence. Others: Guadarrama ICCV’13, Thomason COLING’14 Limitations: ● Narrow Domains ● Small Grammars [Yu and Siskind, ACL’13] ● Template based sentences ● Several features and classifiers Which objects/actions/scenes should [Rohrbach et. al. ICCV’13] we build classifiers for? 7

Can we learn directly from video sentence pairs? Without having to explicitly learn object/action/scene classifiers for our dataset. [Venugopalan et. al. NAACL’15] 8

Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [Venugopalan et. al. decoder NAACL’15] (this work) Key Insight: Generate feature representation of the video and “decode” it to a sentence 9

In this section ● Background - Recurrent Neural Networks ● 2 Deep methods for video description ■ First, learns from image description. (ignores temporal frame sequence in videos) ■ Second is temporally sensitive to input. 10

[Background] Recurrent Neural Networks Successful in translation, speech. Cell x t h t RNNs can map an input to an output h t-1 sequence. y t Output Pr(out y t | input, out y 0 ...y t-1 ) RNN Unit Insight: Each time step has a layer with the same weights. y t-1 RNN time x t-1 Problems: h t-1 1. Hard to capture long term dependencies 2. Vanishing gradients (shrink through many layers) y t RNN x t h t Solution: Long Short Term Memory (LSTM) unit 11

[Background] LSTM [Hochreiter and Schmidhuber ‘97] [Graves ‘13] x t h t-1 x t h t-1 Output Input Gate Gate Memory Cell x t h t (=z t ) h t-1 Input Modulation Gate Forget Gate LSTM Unit x t h t-1 12

[Background] LSTM Sequence decoders Functions are differentiable. Full gradient is computed by backpropagating through time. Weights updated using Stochastic Gradient Descent. input LSTM out t0 t=0 Matches state-of-the-art on: Speech Recognition [Graves & Jaitly ICML’14] input LSTM out t1 t=1 Machine Translation (Eng-Fr) [Sutskever et al. NIPS’14] Image-Description input LSTM out t2 t=2 [Donahue et al. CVPR’15] [Vinyals et al. CVPR’15] time input LSTM out t3 t=3 13

LSTM Sequence decoders Two LSTM layers - 2nd layer of depth in temporal processing. Softmax over the vocabulary to predict the output at each time step. SoftMax input LSTM out t0 LSTM t=0 input LSTM SoftMax LSTM out t1 t=1 input LSTM SoftMax LSTM out t2 t=2 time input LSTM SoftMax LSTM out t3 t=3 14

Translating Videos to Natural Language CNN [Venugopalan et. al. NAACL’15] 15

Test time: Step 1 CNN (a) Input Sample frames Video @1/10 227x227 Frame Scale (b) 16

[Background] Convolutional Neural Networks (CNNs) Successful in semantic visual recognition tasks. Layer - linear filters followed by non linear function. Stack layers. Learn a hierarchy of features of increasing semantic richness. Image Credit: Maurice Peeman >> Krizhevsky, Sutskever, Hinton 2012 ImageNet classification breakthrough 17

Test time: Step 2 Feature extraction CNN fc7: 4096 dimension CNN “feature vector” Forward propagate Output: “fc7” features (activations before classification layer) 18

Test time: Step 3 Mean pooling CNN CNN Mean across all frames CNN Arxiv: http://arxiv.org/abs/1505.00487 19

Test time: Step 4 Generation Input Video Convolutional Net Recurrent Net Output 20

Training Annotated video data is scarce. Key Insight: Use supervised pre-training on data-rich auxiliary tasks and transfer. 21

Step1: CNN pre-training fc7: 4096 dimension CNN “feature vector” ● Based on Alexnet [Krizhevsky et al. NIPS’12] ● 1.2M+ images from ImageNet ILSVRC-12 [Russakovsky et al.] ● Initialize weights of our network. 22

Step2: Image-Caption training CNN 23

Step3: Fine-tuning 1. Video Dataset 2. Mean pooled feature 3. Lower learning rate 24

Experiments: Dataset Microsoft Research Video Description dataset [Chen & Dolan, ACL’11] Link: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/ ● 1970 YouTube video snippets ○ 10-30s each ○ typically single activity ○ no dialogues ○ 1200 training, 100 validation, 670 test ● Annotations ○ Descriptions in multiple languages ○ ~40 English descriptions per video ○ descriptions and videos collected on AMT 25

Augment Image datasets MSCOCO # Training videos - 1300 Flickr30k - 30,000 images, 150,000 descriptions MSCOCO - 120,000 images, 600,000 descriptions 26

Sample video and gold descriptions ● A man appears to be plowing a rice field with a ● A man is walking on a rope . plow being pulled by two oxen . ● A man is walking across a rope . ● A team of water buffalo pull a plow through a rice paddy. ● A man is balancing on a rope . ● Domesticated livestock are helping a man plow . ● A man is balancing on a rope at the beach. ● A man leads a team of oxen down a muddy path. ● A man walks on a tightrope at the beach. ● Two oxen walk through some mud. ● A man is balancing on a volleyball net . ● A man is tilling his land with an ox pulled plow. ● A man is walking on a rope held by poles ● Bulls are pulling an object. ● A man balanced on a wire . ● Two oxen are plowing a field. ● The man is balancing on the wire . ● The farmer is tilling the soil. ● A man is walking on a rope . ● A man in ploughing the field. ● A man is standing in the sea shore. 27

Evaluation ● Machine Translation Metrics ○ BLEU ○ METEOR ● Human evaluation 28

Results - Generation MT metrics (BLEU, METEOR) to compare the system generated sentences against (all) ground truth references. [Thomason et al. COLING’14] 29

Human Evaluation Relevance Grammar Rank sentences based on how accurately Rate the grammatical correctness of the they describe the event depicted in the video. following sentences . No two sentences can have the same rank. Multiple sentences can have same rating. 30

Results - Human Evaluation Model Relevance Grammar 2.26 3.99 [Thomason et al. COLING’14] 2.74 3.84 2.93 3.64 4.65 4.61 31

More Examples 32

Translating Videos to Natural Language CNN Does not consider temporal sequence of frames. [Venugopalan et. al. NAACL’15] 33

Can our model be sensitive to temporal structure? Allowing both input (sequence of frames) and output (sequence of words) of variable length. [Venugopalan et. al. ICCV’15] 34

Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [V. NAACL’15] decoder RNN RNN Encode Sentence [Venugopalan et. al.x encoder decoder ICCV’15] (this work) 35

S2VT: Sequence to Sequence Video to Text CNN CNN CNN CNN Now decode it to a sentence! LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM ... LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM A man is talking ... Encoding stage Decoding stage [Venugopalan et. al. ICCV’15] 36

1. Train on Imagenet 1000 categories [Krizhevsky et al. NIPS’15] CNN 2. Take activations from layer before classification fc7: 4096 dimension CNN “feature vector” Forward propagate Output: “fc7” features (activations before classification layer) Frames: RGB 37

Natural Language Video Description using Deep Recurrent Neural - PowerPoint PPT Presentation

Natural Language Video Description using Deep Recurrent Neural Networks Thesis Proposal 23 Nov. 2015 Subhashini Venugopalan University of Texas at Austin 1 Problem Statement Generate descriptions for events depicted in video clips A monkey

Natural-Language Video Description with Deep Recurrent Neural Networks June 2017 Subhashini

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

Introduction CSCE CSCE 496/896 496/896 Lecture 6: Lecture 6: Recurrent Recurrent CSCE

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Jeff

Natural Language Understanding We want to communicate with computers using natural language

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

Welcome! Sergey Gorinsky Jnos Tapolcai IMDEA Networks Institute, Budapest University of

Intern ternatio ional al Servi ervice ce Rice husking machine Malaysia 1967 - 1968 1983

Patent Pools, Litigation and Innovation Jay Pil Choi (UNSW, Michigan State) Heiko Gerlach (U of

U.S. CMS Detector Operations Outline Cathy Newman-Holmes 13 February 2013 Overview of CMS

Mellin vu du ciel Mellin, seen from the sky Philippe Flajolet INRIA Rocquencourt March 10, 2008

Quantum gases in disorder Gora Shlyapnikov LPTMS, Orsay, France University of Amsterdam, The

Disorder physics Disorder physics with Bose with Bose-Einstein condensates Einstein condensates

EDINA and Data Library OR 2013 Workshop: IRs dealing with data Charlottetown, Canada: 8 July,