Multimodal Compact Bilinear Pooling for VQA Akira Fukui 1,2 , Dong - PowerPoint PPT Presentation

Multimodal Compact Bilinear Pooling for VQA Akira Fukui 1,2 , Dong Huk Park 1 , Daylen Yang 1 , Anna Rohrbach 1,3 , Trevor Darrell 1 , Marcus Rohrbach 1 1 UC Berkeley EECS, CA, United States 2 Sony Corp., Tokyo, Japan 3 Max Planck Institute for Informatics, Saarbrücken, Germany

Multimodal language and visual understanding Description A table full of food for a feast

Multimodal language and visual understanding Grounding The bowl with the brown souce

Multimodal language and visual understanding Visual Question Answering What is the brown souce? Gravy

How to Combine Image Representation and Question Representation? spoon plate bowl table CNN food corn … Yes person Is? Is this going to be feast LSTM going to be a feast? … ¨ All elements can interact ¨ Multiplicative interaction

How to Combine Image Representation and Question Representation? spoon plate bowl Concatenate table CNN food corn … RELU Yes person FC FC Is? Is this going to be feast LSTM going to be a feast? … þ All elements can interact ¨ Multiplicative interaction - Difficult to learn output classification

How to Combine Image Representation and Question Representation? spoon plate Elementwise bowl table Multiplication CNN food corn … Yes person RELU ⨀ FC FC Is? Is this going to be feast LSTM going to be a feast? … ¨ All elements can interact þ Multiplicative interaction - Difficult to learn input embedding

How to Combine Image Representation and Question Representation? Outer Product / spoon Bilinear Pooling [Lin ICCV 2015] plate bowl table CNN food corn … Yes person FC Is? Is this going to be feast LSTM going to be a feast? … þ All elements can interact þ Multiplicative interaction [Lin ICCV 2015] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear CNN models for fine-grained visual recognition. ICCV 2015

How to Combine Image Representation and Question Representation? Outer Product / [Lin ICCV 2015] spoon Bilinear Pooling plate bowl 2048 table CNN food corn … Yes person 4 million 2048 FC Is? 4 million x 1000 Is this going to be feast LSTM going to be a feast? … þ All elements can interact þ Multiplicative interaction ¨ High #activations & computation ¨ High #parameters [Lin ICCV 2015] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear CNN models for fine-grained visual recognition. ICCV 2015

Multimodal Compact Bilinear Pooling Compact spoon Bilinear Pooling [Gao CVPR 16] plate bowl 2048 table CNN food corn … Yes person 2048 FC MCB Is? 16k x 1000 Is this going to be feast LSTM going to be a feast? … þ All elements can interact þ Multiplicative interaction [ICLR Workshops 2016] Fine-grained pose prediction, normalization, and þ Low #activations & computation recognitionNZhang, E Shelhamer, Y Gao, T Darrell þ Low #parameters [Gao CVPR 16] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear pooling. CVPR 2016

Multimodal Compact Bilinear Pooling CNN Random Projection: Yes Countsketch Ψ FC Is this going LSTM to be a feast? 16k x 1000 Pham & Pagh (2013): Ψ x⨂𝑧 = Ψ 𝑦 ∗ Ψ(𝑧) þ All elements can interact þ Multiplicative interaction ¨ Low #activations & computation [Countsketch] M. Charikar, K. Chen, M. Farach-Colton. Finding frequent items in data streams. Automata, languages and programming 2002. [Pham&Pagh 13] N. Pham, R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. þ Low #parameters KDD 2013

Multimodal Compact Bilinear Pooling CNN Ψ Convolution ∗ Yes FC Ψ Is this going LSTM to be a feast? 16k x 1000 Pham & Pagh (2013): Ψ x⨂𝑧 = Ψ 𝑦 ∗ Ψ(𝑧) þ All elements can interact þ Multiplicative interaction þ Low #activations & computation [Countsketch] M. Charikar, K. Chen, M. Farach-Colton. Finding frequent items in data streams. Automata, languages and programming 2002. [Pham&Pagh 13] N. Pham, R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. þ Low #parameters KDD 2013

Multimodal Compact Bilinear Pooling CNN Ψ Convolution FFT FFT -1 ⨀ Yes FC FFT Ψ Is this going LSTM to be a feast? 16k x 1000 Pham & Pagh (2013): Ψ x⨂𝑧 = Ψ 𝑦 ∗ Ψ(𝑧) þ All elements can interact þ Multiplicative interaction þ Low #activations & computation [Countsketch] M. Charikar, K. Chen, M. Farach-Colton. Finding frequent items in data streams. Automata, languages and programming 2002. [Pham&Pagh 13] N. Pham, R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. þ Low #parameters KDD 2013

Related work • Alternative approach to multiplicative interactions - DPP Net: Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. Image question answering using convolutional neural network with dynamic parameter prediction . CVPR 2016

Experimental setup (without Attention) • Solver - Cross-entropy-loss, Adam, learning rate 0.0007 • Feature Extraction - ResNet 152, image: 448x448 • Answers - 3000 most frequent on train - Sampling with probability of answers • Trained on train / validated on val / tested on test-dev L2 Normalization (ResNet152) 2048 CNN L2 Normalization Full Connected Signed Sqrt Softmax Yes MCB 16k 16k 16k 3000 Embed,Tanh LSTM, drop LSTM, drop Is this going to be a feast? 13k ~ 20k 300 1024 1024 2048

Ablation Comparison to other multimodal methods • MCB achieves highest accuracy Ablation Comparison to other multimodal methods Eltwise Sum 56.5 Concat 57.5 Concat + FC 58.4 Concat + FC + FC 57.1 Eltwise Product 58.6 Eltwise Product + FC 56.4 Eltwise Product + FC + FC 57.8 MCB (2048x2048 -> 16k) 59.8 Full Bilinear (128x128 -> 16k) 58.5 MCB (128x128 -> 4k) 58.7 54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0 Trained on train, test-dev Acc. [%]

Ablation Comparison to other multimodal methods • MCB comparable to Full Bilinear Ablation Comparison to other multimodal methods Eltwise Sum 56.5 Concat 57.5 Concat + FC 58.4 Concat + FC + FC 57.1 Eltwise Product 58.6 Eltwise Product + FC 56.4 Eltwise Product + FC + FC 57.8 MCB (2048x2048 -> 16k) 59.8 Full Bilinear (128x128 -> 16k) 58.5 MCB (128x128 -> 4k) 58.7 54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0 Trained on train, test-dev Acc. [%]

Dimensionality of MCB • Dimensionality of MCB decides the performance of outer product approximation VQA Open-Ended test-dev accuracy 59.8 60.0 59.7 59.7 2048 59.4 59.5 test-dev Acc. [%] Dim size 59.0 58.8 MCB 58.4 58.5 2048 58.0 57.5 1024 2048 4096 8192 16000 32000

Multimodal language and visual understanding Visual Question Answering What is the brown souce? Gravy

MCB with Attention • Predict spatial attentions with MCB (ResNet152) 2048x14x14 Weighted Sum CNN L2 Normalization 16k x14x14 Conv, Relu Signed Sqrt Softmax Conv MCB 512 x 14 x 14 1 x 14 x 14 2048x14x14 WE, LSTM What is the Tile yellow food? L2 Normalization Full Connected Signed Sqrt 2048 Softmax Corn MCB 16k 16k 16k 3000 2048 Attention for captioning : -K. Xu, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Attention for VQA : - H. Xu, K. Saenko Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering - J.Lu Hierarchal Question-Image Co-Attention for Visual Question Answering

Attention Visualizations Is this person wearing a hat? Yes [Groundtruth: Yes]

Results on MCB with Attention • MCB performs well with Attention Performance of Attention Models 63.0 62.5 62.0 61.0 test-dev Acc. [%] 59.8 60.0 59.0 58.4 58.4 58.0 57.0 56.0 Concat + FC MCB Concat + FC MCB + Attention + Attention

Techniques to improve performance • Data Augmentation - VQA data from Visual Genome Dataset - Additional 1M Question and answer pairs - Removed articles, Single word answer • Ensembles - Average the output of Softmax over models VQA Open-Ended accuracy for genome and ensemble 70.2 72 70 66.7 68 test-dev Acc. [%] 65.1 66 64.2 62.5 64 62 60 58 MCB MCB MCB MCB MCB + Attention + Attention + Attention + Attention + Attention (train) (train + val) (train + val + genome) + Ensemble + Ensemble (train + val + genome) (train + val + genome) Multiple Choice Visual genome: Connecting language and vision using crowdsourced dense image annotations.

MCB on other Datasets and Tasks • Visual Grounding • Visual 7w (Multiple Choice) Our architecture for Visual 7w : MCB with Attention and Answer Encoding. Accuracy on Visual7W Accuracy on Flickr30k Entities Plummer et al. 43.8 Zhu et al. 54.3 Wang et al. 43.9 Rohrbach et al. 47.7 Concat + Attention 52.8 Concat 46.5 Eltwise Prod 47.4 Eltwise Prod + Conv 47.9 MCB + Attention 62.2 MCB 48.7 45 50 55 60 65 40 42 44 46 48 50 Visual7W: Grounded Question Answering in Images Grounding of textual phrases in images by reconstruction.

Examples for VQA

Attention Visualizations What is the woman feeding the giraffe? Carrot [Groundtruth: Carrot]

Attention Visualizations What color is her shirt? Purple [Groundtruth: Purple]

Attention Visualizations What is her hairstyle for the picture? Ponytail [Groundtruth: Ponytail]

Multimodal Compact Bilinear Pooling for VQA Akira Fukui 1,2 , Dong - PowerPoint PPT Presentation

Multimodal Compact Bilinear Pooling for VQA Akira Fukui 1,2 , Dong Huk Park 1 , Daylen Yang 1 , Anna Rohrbach 1,3 , Trevor Darrell 1 , Marcus Rohrbach 1 1 UC Berkeley EECS, CA, United States 2 Sony Corp., Tokyo, Japan 3 Max Planck Institute for

Risk Pooling Strategies to Reduce and Hedge Uncertainty Location Pooling Product Pooling

Pairing-Based Cryptography & Generic Groups Lecture 22 Bilinear Pairing Bilinear Pairing

Pairing-Based Cryptography & Generic Groups Lecture 21 Bilinear Pairing Bilinear Pairing

Pairing-Based Cryptography & Generic Groups Lecture 22 1 Bilinear Pairing 2 Bilinear

Deep Learning (Partly) Need for Pooling Demystified Which Pooling . . . Pooling Four Values

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Jin-Hwa Kim,

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Abstract rule representations in a Abstract rule representations in a bilinear model bilinear

Weakly-coupled bilinear quantum systems Thomas Chambrion Nabile Boussad (Besanon) and Marco

LowFER: Low-rank Bilinear Pooling for Link Prediction Saadullah Amin, Stalin Varanasi, Katherine

Interstate Medical Licensure Compact Overview Define Need for compact Compacts in

Compact Subsets Theorem Suppose that K is a subset of a topological space X. 1 If X is compact

Business rates and pooling Cameron Hall, Ian Hewitt, Mark Holland, Owen Jones, Zoe Lawson, Neeraj

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

GRAINS: Generative Recursive Autoencoders for INdoor Scenes Manyi Li 1,2 , Akshay Gadi Patil 2 ,

Prior Authorization Process for Certain Hospital Outpatient Department (OPD) Services Amy

Research and Trials within the BHS Morris J Brown William Harvey Research Institute Queen Mary

Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, Meghan Cowan, Matthai Philipose,

T re atme nt We binar 2/ 6/ 14 NAACCR 2013 2014 Webinar Series Treatment February 6, 2014

Multiple scales of task and reward - based learning Jane Wang Zeb Kurth - Nelson , Sam Ritter ,

Unsupervised Video Object Segmentation for Deep Reinforcement Learning Authors: Vik Goel,

Slide Credits:Agrawal Slide Credits:Agrawal Slide Credits:Agrawal Kolmogorov-Smirnov Test

Multimodal Compact Bilinear Pooling for VQA Akira Fukui 1,2 , Dong - PowerPoint PPT Presentation

Multimodal Compact Bilinear Pooling for VQA Akira Fukui 1,2 , Dong Huk Park 1 , Daylen Yang 1 , Anna Rohrbach 1,3 , Trevor Darrell 1 , Marcus Rohrbach 1 1 UC Berkeley EECS, CA, United States 2 Sony Corp., Tokyo, Japan 3 Max Planck Institute for

Risk Pooling Strategies to Reduce and Hedge Uncertainty Location Pooling Product Pooling

Pairing-Based Cryptography &amp; Generic Groups Lecture 22 Bilinear Pairing Bilinear Pairing

Pairing-Based Cryptography &amp; Generic Groups Lecture 21 Bilinear Pairing Bilinear Pairing

Pairing-Based Cryptography &amp; Generic Groups Lecture 22 1 Bilinear Pairing 2 Bilinear

Deep Learning (Partly) Need for Pooling Demystified Which Pooling . . . Pooling Four Values

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Jin-Hwa Kim,

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Abstract rule representations in a Abstract rule representations in a bilinear model bilinear

Weakly-coupled bilinear quantum systems Thomas Chambrion Nabile Boussad (Besanon) and Marco

LowFER: Low-rank Bilinear Pooling for Link Prediction Saadullah Amin, Stalin Varanasi, Katherine

Interstate Medical Licensure Compact Overview Define Need for compact Compacts in

Compact Subsets Theorem Suppose that K is a subset of a topological space X. 1 If X is compact

Business rates and pooling Cameron Hall, Ian Hewitt, Mark Holland, Owen Jones, Zoe Lawson, Neeraj

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

Multimodal Corridor Planning &amp; Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

GRAINS: Generative Recursive Autoencoders for INdoor Scenes Manyi Li 1,2 , Akshay Gadi Patil 2 ,

Prior Authorization Process for Certain Hospital Outpatient Department (OPD) Services Amy

Research and Trials within the BHS Morris J Brown William Harvey Research Institute Queen Mary

Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, Meghan Cowan, Matthai Philipose,

T re atme nt We binar 2/ 6/ 14 NAACCR 2013 2014 Webinar Series Treatment February 6, 2014

Multiple scales of task and reward - based learning Jane Wang Zeb Kurth - Nelson , Sam Ritter ,

Unsupervised Video Object Segmentation for Deep Reinforcement Learning Authors: Vik Goel,

Slide Credits:Agrawal Slide Credits:Agrawal Slide Credits:Agrawal Kolmogorov-Smirnov Test

Pairing-Based Cryptography & Generic Groups Lecture 22 Bilinear Pairing Bilinear Pairing

Pairing-Based Cryptography & Generic Groups Lecture 21 Bilinear Pairing Bilinear Pairing

Pairing-Based Cryptography & Generic Groups Lecture 22 1 Bilinear Pairing 2 Bilinear

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING