Multi-modal Reasoning: Bridging Vision and Language Heming Zhang - PowerPoint PPT Presentation

Multi-modal Reasoning: Bridging Vision and Language Heming Zhang Media Communications Lab University of Southern California

Personal Assistant – AI Touchstone 2

The mass of an electron is approximately 9.109 × 10 -31 kg. 3

Has Personal Assistant Come True? Illustration by Fiona Carswell 4

Vision & Language in MCL Vision Vision & Language • Object detection • Visual dialogue • Semantic segmentation • Vision & Language navigation • Video segmentation • Multi-modal machine Language translation • Text classification • Language graph learning 5

What is Visual Dialogue? • Dialogue that is grounded in vision A man wearing leather jacket standing next to a motorcycle Is it colored leather? Yes, it is. What color is his leather? 6

Why Visual Dialogue? • Aiding visually impaired users Daisy just sent you some pictures of her new house. Great, is the living room large? Yes, there is a large living room with fireplace • Aiding analysts Did anyone pass the gate yesterday? Yes, 45 instances logged on camera. Were any of them carrying a cardboard box? 7

From Information Point of View Image Text 8

Previous Work • Encoder-decoder framework (Das et al., 2017, Lu et al., 2017, Wu et al., 2018, etc.) Embedding Encoder Decoder Q t , I , H t Â t E t – Encoder • Embeds image, question and dialogue history – Decoder • Decodes the embedding to answers in natural language 9

Previous multi-modal encoders • Lu et al., 2017, Wu et al., 2018, etc. – Use one input as guidance to compute attention on another input 10

Attention • Weighted-sum over features Weights h c w c 11

Attention with Guidance 𝒈 𝑕 Weights h c w c 12

Previous multi-modal encoders • Lu et al., 2017, Wu et al., 2018, etc. – Use one input as guidance to compute attention on another input – Process inputs sequentially in pre-defined orders 13

Encoders with Sequential Attention • Lu et al. 2017 What color is his leather? A man wearing leather jacket standing next to a motorcycle Is it colored leather? Yes, it is. 14

Encoders with Sequential Attention • Wu et al., 2018 What color is his leather? A man wearing leather jacket standing next to a motorcycle Is it colored leather? Yes, it is. 15

Previous multi-modal encoders • Cannot accommodate to different scenario’s • How many people are there in the image? • Is there anything else on the table? 16

Adaptive reasoning F Q F Q F I F I F H F H Guided Guided Guided Attention Attention Attention f H, i f Q, i f I, i Comprehension Exploration f QIH, i f g, i No Reasoning i = i max ? RNN Yes E 17

Attention Visualization Is the little boy on a beach? How old does he look? 18

Attention Visualization What color hair does he have? How old does he look? 19

Attention Visualization What color hair does he have? Is he dressed for summer? 20

Attention Visualization What color is the airplane? Time step i=1 21

Attention Visualization What color is the airplane? Time step i=2 22

Qualitative results 4 ducks are in a grassy island of a parking lot with their heads down 23

Qualitative results 4 ducks are in a grassy island of a parking lot with their heads down Questions Human Ours Any grass? Yes Yes, a lot of grass What color grass? It is green with brownish dead spots Green and brown 24

Qualitative results 4 ducks are in a grassy island of a parking lot with their heads down Questions Human Ours Any vehicles on the lot? Yes Yes, there are a lot of cars Do they look new or old? They look new They look new 25

IJCAI 2019 Generative Visual Dialogue System via Weighted Likelihood Estimation Heming Zhang, Shalini Ghosh, Larry Heck, Stephen Walsh, Junting Zhang, Jie Zhang, C.-C. Jay Kuo Thursday Aug. 15th 09:30 - 10:30 AM CV|LV - Language and Vision 2 (2501-2502) 26

Vision-grounded Problems Revisited • What is visual dialogue? • Dialogue that grounded in vision A man wearing leather jacket standing next to a motorcycle Is it colored leather? Yes, it is. What color is his leather? 27

Vision-grounded Problems Revisited • From information point of view Image Text 28

Vision-grounded Problems Revisited • No alignment between image & text manifolds Image Text SIFT RNN BoW Transformer CNN … … 29

Bridging Vision & Language ? Image Text 30

Bridging Vision & Language • Manifold alignment Image Joint Text 31

Bridging Vision & Language • Usually one-to-one mapping in other manifold alignment problems – E.g. machine translation English Joint Dutch Ik hou van jou I like you I take you with me Ik neem je mee 32

Bridging Vision & Language • Alignment between vision and language – No one-to-one mapping Image Joint Text 33

Attention Revisited • Weighted-sum over features Weights h c w c 34

Bridging Vision & Language • Alignment by attention – Joint learning of attention and alignment Image Joint Text 35

Related Research in MCL Vision Vision & Language • Object detection • Visual dialogue • Semantic segmentation • Vision & Language navigation • Video segmentation • Multi-modal machine Language translation • Text classification • Language graph learning 36

Vision-and-language Navigation • Instructions in natural language – Walk down and turn right. • Surrounding environment in vision 37

Co-attention between Vision & Language • Leave the room into the hall and go straight. • Head towards the stairs. • Stop on the round rug next to the flowers. 38

Unsupervised Multi-modal Neural Machine Translation 39

Media Communication Lab • Lab director: Prof. C.-C. Jay Kuo • Visiting scholars • PhD students • Master students 40

Thank you for listening Visit us at http://mcl.usc.edu/ 41

Multi-modal Reasoning: Bridging Vision and Language Heming Zhang - PowerPoint PPT Presentation

Multi-modal Reasoning: Bridging Vision and Language Heming Zhang Media Communications Lab University of Southern California Personal Assistant AI Touchstone 2 The mass of an electron is approximately 9.109 10 -31 kg. 3 Has Personal

The Expressive Power of Backround Modal Dependence Logic Modal logic Team semantics Modal

Multi-modal Face Recognition Hu Han hanhu@ict.ac.cn http: / / vipl.ict.ac.cn/ members/ hhan

Modal logic Benzm uller/Rojas, 2014 Artificial Intelligence 2 What is Modal Logic?

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

W HAT IS EHD? Introduction EHD without cross-flow Modal Dielectric fluid Non-modal EHD with

Why is modal logic decidable Petros Potikas NTUA 9/5/2017 Petros Potikas (NTUA) Modal logic

Evidential and Causal Reasoning Much reasoning in AI can be seen as evidential reasoning ,

Deep Reasoning A Vision for Automated Deduction Stephan Schulz Deep Reasoning A Vision for

Positive modal separation logics Fredrik Dahlqvist University College London Resource Reasoning

Guiding Interaction Behaviors for Multi-modal Grounded Language Learning Jesse Thomason, Jivko

MoQA A Multi-Modal Question Answering Architecture Monica Haurilet, Ziad Al-Halah and Rainer

Temporal and Modal Logic Based on paper: E.A. Emerson. Temporal and Modal Logic J. van Leeuwen,

Modal dal Logic ic Submitted to Prof . Lubomr Popelnsk, Masaryk University Prepared by

A Southeast Louisiana Inter Modal A Southeast Louisiana Inter Modal Transportation Hub

Introduction to modal logic Lus Soares Barbosa Jos Proena HASLab - INESC TEC Universidade

ADDRESS INTER-MODAL CONFLICT CONTENTS 1. Introduction 2. Identified inter-modal conflicts within

APBA Safety Seminar 2020 APBA Annual Meeting January 25, 2020 8-9:00 am Note: The following

Computational Logic: (Constraint) Logic Programming Theory, practice, and implementation Program

Beyond calculation: Probabilistic Computing Machines and Universal Stochastic Inference Vikash K.

Science and Industry Kees van Hee Barcelona, 20-11-2012 Agenda 1. Role of 3 e Cycle Engineering

The Opinion Tetrahedron as a Tool for Coalescing Group Beliefs J. Michael Dunn School of

Lecture 10: Psychology of probability: predictable irrationality. David Aldous March 7, 2016

In (9), the anaphor can be bound within the first conjunct or within the second conjunct. (9) a.

KEKB / SuperKEKB The Luminosity Frontier (number of events/unit time) = (cross section) X

Sambuz

Useful Links

Newsletter

Mail Us