designing deep architectures for visual question answering
play

Designing deep architectures for Visual Question Answering Matthieu - PowerPoint PPT Presentation

Designing deep architectures for Visual Question Answering Matthieu Cord Sorbonne University valeo.ai research lab. Paris Thanks to H. Ben-younes, R. Cadne Visual Question Answering Question Answering: + What does Claudia do? Visual


  1. Designing deep architectures for Visual Question Answering Matthieu Cord Sorbonne University valeo.ai research lab. Paris Thanks to H. Ben-younes, R. Cadène

  2. Visual Question Answering Question Answering: + What does Claudia do?

  3. Visual Question Answering Visual Question Answering: + What does Claudia do?

  4. Visual Question Answering Visual Question Answering: + What does Claudia do? Sitting at the bottom Standing at the back …

  5. Visual Question Answering Visual Question Answering: + What does Claudia do? Sitting at the bottom Standing at the Deep ML back … Solving this task interesting for: - Study of deep learning models in a multimodal context - Improving human-machine interaction - One step to build visual assistant for blind people

  6. Outline 1. Multimodal embedding Deep nets to align text+image • learning • 2. VQA framework Task modeling • Fusion in VQA • Reasoning in VQA •

  7. Deep semantic-visual embedding RNN ConvNet

  8. Deep semantic-visual embedding RNN ConvNet Semantic of d istance Retrieval by NN search

  9. Deep semantic-visual embedding A car A cat on a sofa A dog playing 2D Semantic visual space example: • Distance in the space has a semantic interpretation • Retrieval is done by finding nearest neighbors

  10. Deep semantic-visual embedding • Designing image and text embedding architectures • Learning scheme for these deep hybrid nets

  11. Deep semantic-visual embedding DeViSE: A Deep Visual-Semantic Embedding Model, A. Frome et al, NIPS 2013 Finding beans in burgers: Deep semantic-visual embedding with localization, M. Engilberge et al, CVPR 2018 Textual pipeline: Visual pipeline: • Pretrained word embedding • ResNet-152 pretrained • Simple Recurrent Unit (SRU) • Weldon spatial pooling Normalization • Affine projection • normalization • affine+ ResNet conv pool norm. (a, man, in, ski, gear, w2v SRU+norm RNN skiing, on, snow) ! 0: 2 and ϕ are the trained parameters

  12. Deep semantic-visual embedding Finding beans in burgers: Deep semantic-visual embedding with localization, M. Engilberge et al, CVPR 2018 Textual pipeline: Visual pipeline: Pretrained word embedding ResNet-152 pretrained • • Simple Recurrent Unit (SRU) Weldon spatial pooling • • Normalization Affine projection • • normalization • affine+ ResNet conv pool norm. (a, man, in, ski, gear, w2v SRU+norm skiing, on, snow)

  13. Deep semantic-visual embedding Finding beans in burgers: Deep semantic-visual embedding with localization, M. Engilberge et al, CVPR 2018 Textual pipeline: Visual pipeline: Pretrained word embedding ResNet-152 pretrained • • Simple Recurrent Unit (SRU) Weldon spatial pooling • • Normalization Affine projection • • normalization • affine+ ResNet conv pool norm. (a, man, in, ski, gear, w2v SRU+norm skiing, on, snow) ! 0: 2 and ϕ => Learning using a training set

  14. How to get large training datasets? Cooking recipes: easy to get large multimodal datasets with aligned data Learning Cross-modal Embeddings for Cooking Recipes and Food Images. A. Salvador, et al. CVPR 2017 Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings M. Carvalho, R. Cadene, D. Picard, L. Soulier, N. Thome, M. Cord, SIGIR (2018)

  15. Deep semantic-visual embedding Demo Visiir.lip6.fr

  16. Cross-modal retrieval Closest elements Query A plane in a cloudy sky A dog playing with a frisbee 1. A herd of sheep standing on top of snow covered field. 2. There are sheep standing in the grass near a fence. 3. some black and white sheep a fence dirt and grass

  17. Cross-modal retrieval and localization Visual grounding examples: • Generating multiple heat maps with different textual queries Finding beans in burgers: Deep semantic-visual embedding with localization, M. Engilberge et al, CVPR 2018

  18. Cross-modal retrieval and localization Emergence of color understanding:

  19. Outline 1. Multimodal embedding Deep nets to align text+image • Learning • 2. Visual Question Answering Task modeling • Fusion in VQA • Reasoning in VQA •

  20. VQA

  21. VQA What color is the fire Hydrant on the left? Green What color is the fire hydrant

  22. VQA What color is the fire Hydrant on the right? Yellow What color is the fire hydrant

  23. Who is wearing glasses? Similar images man woman Different answers @VQA workshop, CVPR 2017 Þ Need very good Visual and Question (deep) representations Þ Full scene understanding Þ Need High level multimodal interaction modeling Þ Merging operators, attention and reasoning

  24. Vanilla VQA scheme: 2 deep + fusion Question Representation Image

  25. VQA: the output space Image representation Yes VQA Question : Is the lady with the Question representation blue fur wearing glasses ?

  26. VQA: the output space

  27. VQA: the output space Output space representation: => Classify over the most frequent answers (3000/95%)

  28. VQA: the output space Image representation VQA Question : Is the lady with the Question Classes representation blue fur wearing glasses ?

  29. VQA processing Image ● Convolutional Network (VGG, ResNet,....) ● Detection system (EdgeBoxes, Faster-RCNN, …) Multimodal Fusion Question Reasoning ● Bag-of-words ● Recurrent Network (RNN, LSTM, GRU, SRU, …) Learning ● Fixed answer vocabulary ● Classification (cross-entropy)

  30. Fusion in VQA

  31. VQA: fusion Is the lady with the purple fur wearing glasses ? Fusion Concat+ Proj Element-Wise Concat+MLP

  32. VQA: fusion Is the lady with the purple fur wearing glasses ? Fusion Concat+ Proj Element-Wise Concat+MLP

  33. VQA: bilinear fusion [Fukui, Akira et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, CVPR 2016] [Kim, Jin-Hwa et al. Hadamard Product for Low-rank Bilinear Pooling, ICLR 2017] Is the lady with the purple fur wearing glasses ? Bilinear model:

  34. VQA: bilinear fusion Learn the 3-ways tensor coeff. • Different than the Signal Proc. Tensor analysis (representation) Problem: q , v and y are of dimension ~ 2000 => 8 billion free parameters in the tensor Need to reduce the tensor size: • Idea: structure the tensor to reduce the number of parameters

  35. VQA: bilinear fusion Tensor structure: Tucker decomposition: ⇔ constrain the rank of each unfolding of

  36. VQA: bilinear fusion =

  37. VQA: bilinear fusion =

  38. VQA: bilinear fusion =

  39. VQA: bilinear fusion =

  40. VQA: bilinear fusion Other ways of structuring the tensor of parameters Compact Bilinear Low-Rank Tucker Pooling Bilinear Pooling Decomposition (MCB) (MLB) (MUTAN) Ben-younes H.* Cadene R.*, Thome N., Cord M., MUTAN: Multimodal Tucker Fusion for Visual Question Answering, ICCV 2017

  41. VQA: bilinear fusion [AAAI 2019]

  42. VQA: BLOCK fusion [AAAI 2019] B C A

  43. VQA: bilinear fusion Is the lady with the purple fur wearing glasses ?

  44. Reasoning in VQA

  45. VQA: reasoning What is reasoning (for VQA)? Attentional reasoning Relational reasoning Iterative reasoning Compositional reasoning

  46. VQA: reasoning What is reasoning (for VQA)? Attentional reasoning : given a certain context (i.e. Q), focus only on the relevant subparts of the image Relational reasoning Iterative reasoning Compositional reasoning

  47. VQA: attentional reasoning Idea: focusing only on parts of the image relevant to Q ● Each region scored according to the question What is sitting on the desk in front of the boys ? ● Representation = sum of all (weighted) embeddings

  48. VQA: attentional reasoning ResNet What is sitting on the desk in GRU front of the boys ?

  49. VQA: attentional reasoning ResNet What is sitting on the desk in GRU front of the boys ?

  50. VQA: attentional reasoning Attention MUTAN mechanism Fusion ResNet What is sitting on the desk in GRU front of the boys ? MUTAN Fusion “laptop” Attentional glimpse in most of recent strategies [MLB, MCB, MUTAN …]

  51. VQA: attentional reasoning

  52. VQA: attentional reasoning Focusing on multiple regions: Multi-glimpse attention Where is the smoke coming from ?

  53. VQA: attentional reasoning with Multi-glimpse attention Attention MUTAN mechanism Fusion ResNet Where is the GRU smoke coming from ? Focus on the train Focus on the MUTAN smoke Fusion “train”

  54. VQA: attentional reasoning with Multi-glimpse attention

  55. VQA: attentional reasoning Evaluation on VQA dataset: Best MUTAN score of 67.36% on test-std Human performances about 83% on this dataset The winner of the VQA Challenge in CVPR 2017 (and CVPR 2018) integrates adaptive grid selection from additional region detection learning process

  56. VQA: attentional reasoning

  57. VQA: reasoning What is reasoning (for VQA)? Attentional reasoning : given a certain context (i.e. Q), focus only on the relevant subparts of the image Relational reasoning : object detection + mutual relationships (spatial, semantic,...), merging both with Q Iterative reasoning Compositional reasoning

  58. Bottom-up and Relational reasoning Determine the answer using relevant objects and relationships

Recommend


More recommend