Ask Your Neurons: A Neural-based Approach to Answering Questions about Images Mateusz Malinowski [1] Marcus Rohrbach [2] Mario Fritz [1] [1] Max Planck Institute for Informatics [2] Berkeley University of California, ICSI
Human-like Comprehension 011101011100 Is the water 6 = 1011000100100 boiling? 010011110000 • How far are machines from human quality understanding? • How can we monitor progress and evaluate architectures? 2 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Visual Turing Test (NIPS’14) • Holistic, open-ended task ‣ Visual scene understanding ‣ Natural language understanding ‣ Deduction • No internal representation is evaluated ‣ Challenge is open to diverse approaches • Scalable annotation end evaluation effort What is behind the table? sofa ‣ Only question-answer pairs What is on the refrigerator? How many lamps are there? What color are the cabinets? magnet, paper 2 brown 3 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Related Work • Symbolic-based Approaches chair(1, brown, position X, Y, Z) window(1, blue, position X, Y, Z) M. Malinowski et. al. Multiworld. NIPS’14 window What …? λ x . Behind ( x , Table ) • Large Scale Datasets S. Antol et. al. Visual QA. ICCV’15 L. Yu et. al. al. Visual Madlibs. ICCV’15 D. Geman et. al. Visual Turing Test. PNAS’15 M. Ren et. al. Image QA. NIPS15 H. Gao et. al. Are You Talking to a Machine? NIPS’15 What is the mustache Person A is … Y. Zhu et. al. Visual7W. arXiv’15 made of? L. Zhu et. al. Uncovering Temporal Context. arXiv’15 ... What is the cat doing ? <BOA> Sitting on the umbrella .21 .56 .09 .01 ... Shared One Two Red Bird • Neural-based Approaches ... Softmax Embedding Shared LSTM LSTM M. Ren et. al. Image QA. NIPS’15 Fusing Image Word Embedding Linear Intermediate H. Gao. et. al. Are You Talking to a Machine? NIPS’15 CNN “many” CNN “How” “books” Softmax Sitting on the umbrella <EOA> L. Ma et. al. Learning to Answer Questions From Images. arXiv’15 feature vectors of di ff erent • Attention-based Approaches parts of image A B cat CNN Z. Yang. et. al. Stacked Attention Networks. arXiv’15 cake Query Y. Zhu et. al. Visual7W. arXiv’15 Question: Answer: Softmax What are sitting + + CNN/ dogs J. Andres et. al. Deep Compositional QA. arXiv’15 in the basket on LSTM a bicycle? H. Xu et. al. Ask, Attend and Answer. arXiv’15 Attention layer 1 Attention layer 2 What kind of animal is in the photo? Why is the person holding a knife? K. Chen et. al. ABC-CNN. arXiv’15 A cat . To cut the cake with. Where is K. J. Shih et. al. Where To Look. arXiv’15 LSTM couch the dog? C D count where color ... Parser Layout • Hybrid Approaches - dog cat standing ... H. Noh et al. Dynamic Parameter Prediction. arXiv’15 CNN J. Andres et al. Deep Compositional QA. arXiv’15 Where are the carrots? How many people are there? At the top. Three. 4 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Outline • Neural approach to answer questions about images CNN table ? What is behind the LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM chair window <end> • Performance metrics based on additional annotations What is the object on the floor in front of the wall? -. Human 1: bed Human 2: shelf Human 3: bed Human 4: bookshelf 5 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Method: Ask Your Neurons CNN is table the ? What behind LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM <end> window chairs 6 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Method: Ask Your Neurons CNN x q n-1 q n a t-1 ... ... ... LSTM LSTM LSTM ... a 1 a t • Predicting answer sequence ‣ Recursive formulation p ( a | x , q , ˆ a | x , - image representation a t = arg max A t � 1 ; θ ) , ˆ a 2 V ⇥ ⇤ i.e. q = q 1 , . . . , q n � 1 , J ? K , q j - question word index , problem tion and J K ord ques- encodes where ˆ ulary V - vocabulary, - previous answer words A t � 1 = { ˆ a t − 1 } a 1 , . . . , ˆ ˆ of the 7 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Method: Ask Your Neurons CNN x q n-1 q n a t-1 ... ... ... LSTM LSTM LSTM ... a 1 a t • Predicting answer sequence ‣ Recursive formulation p ( a | x , q , ˆ a | x , - image representation a t = arg max A t � 1 ; θ ) , ˆ a 2 V ⇥ ⇤ i.e. q = q 1 , . . . , q n � 1 , J ? K , q j - question word index , problem tion and J K ord ques- encodes where ˆ ulary V - vocabulary, - previous answer words A t � 1 = { ˆ a t − 1 } a 1 , . . . , ˆ ˆ of the 8 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Method: Ask Your Neurons CNN x q n-1 q n a t-1 ... ... ... LSTM LSTM LSTM ... a 1 a t • Predicting answer sequence ‣ Recursive formulation p ( a | x , q , ˆ a | x , - image representation a t = arg max A t � 1 ; θ ) , ˆ a 2 V ⇥ ⇤ i.e. q = q 1 , . . . , q n � 1 , J ? K , q j - question word index , problem tion and J K ord ques- encodes where ˆ ulary V - vocabulary, - previous answer words A t � 1 = { ˆ a t − 1 } a 1 , . . . , ˆ ˆ of the 9 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Method: Ask Your Neurons CNN x q n-1 q n a t-1 ... ... ... LSTM LSTM LSTM ... a 1 a t • Predicting answer sequence ‣ Recursive formulation p ( a | x , q , ˆ a | x , - image representation a t = arg max A t � 1 ; θ ) , ˆ a 2 V ⇥ ⇤ i.e. q = q 1 , . . . , q n � 1 , J ? K , q j - question word index , problem tion and J K ord ques- encodes where ˆ ulary V - vocabulary, - previous answer words A t � 1 = { ˆ a t − 1 } a 1 , . . . , ˆ ˆ of the 10 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Symbolic vs Neural-based Approaches Symbolic approach (NIPS’14) • Explicit representation ‣ Independent components ‣ - Detectors, Semantic Parser, Database Components trained separately ‣ Many ‘hard’ design decisions ‣ Knowledge base chairs, What is behind λ x . Behind ( x , Table ) window the table ? Logical Representation M. Malinowski, et. al. “A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input”. NIPS’14 11 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Symbolic vs Neural-based Approaches Symbolic approach (NIPS’14) Ask Your Neurons (Our) • • Explicit representation Implicit representation ‣ ‣ Independent components End-to-end formula ‣ ‣ - - Detectors, Semantic Parser, From images and questions to Database answers Components trained separately Joint training ‣ ‣ Many ‘hard’ design decisions Fewer design decisions ‣ ‣ CNN Knowledge base What ? is … LSTM LSTM LSTM LSTM LSTM LSTM chairs, What is behind λ x . Behind ( x , Table ) window the table ? <end> chairs window Logical Representation End-to-end, jointly trained architecture M. Malinowski, et. al. “A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input”. NIPS’14 12 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Neural Visual QA vs Neural Image Description Neural Image Description • Conditions on an image ‣ Generates a description ‣ - Sequence of words Loss at every step ‣ CNN LSTM LSTM LSTM LSTM LSTM LSTM Large building with a clock <end> Loss J. Donahue, et. al. “Long-term Recurrent Convolutional Networks for Visual Recognition and Description”. CVPR15 13 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Neural Visual QA vs Neural Image Description Neural Image Description Ask Your Neurons (Our) • • Conditions on an image Conditions on an image ‣ ‣ and a question Generates a description Generates an answer ‣ ‣ - - Sequence of words Sequence of answer words Loss at every step Loss only at answer words ‣ ‣ CNN CNN What ? is … LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Large building with a clock <end> <end> chairs window Loss Loss J. Donahue, et. al. “Long-term Recurrent Convolutional Networks for Visual Recognition and Description”. CVPR15 14 M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Recommend
More recommend