Multimodal Learning for Image Captioning and Visual Question - PowerPoint PPT Presentation

Multimodal Learning for Image Captioning and Visual Question Answering Xiaodong He Deep Learning T echnology Center Microsoft Research UC Berkeley, April 7 th , 2016

Knowledge Freebase Text Vision Barack Obama is an American politician serving as the 44th President of the United States. Born in Honolulu, Hawaii, … in 2008, he defeated Republican nominee and was inaugurated as president on January 20, 2009. http://s122.photobucket.com/user/b meuppls/media/stampede.jpg.html (Wikipedia.org)

Image Captioning (one step from perception to cognition) describe objects, attributes, and relationship in an image, in a natural language form a man holding a tennis racquet on a tennis court the man is on the tennis court playing a game

Two entries tied at the 1 st place at COCO 2015 Caption Challenge Adopted encoder der-dec ecod oder er framework from machine translation, Popular: Google, Montreal, Stanford, Berkeley Vinyals, T oshev, Bengio, Erhan, "Show and Tell: A Neural l Image Caption on Generator or , “ CVPR, June 2015 Visual concept detection ction => caption candi didates dates generati ation on => Deep sema mantic tic rankin king Compositional framework can potentially exploit non paired image- caption data more effectively [Fang, Gupta, Iandola, Srivastava, Deng, Dollar, Gao, He, Mitchell, Platt, Zitnick, Zweig, “ From Caption ons s to Visual al Concepts s and Back, ” CVPR, June 2015]

     sitting

cabinets room wooden kitchen stove Repeat to generate 500 candidates cabinets sink floor [Fang, et al., CVPR 2015]

   Huang, He, Gao, Deng, Acero, Heck, “ Learn rnin ing Deep Structured ed Semantic ic Model for Web Search , “ CIKM, 2013

Semantic layer: y 500 Semantic projection matrix: W s Max pooling layer: v 500 ... ... Max pooling operation max max max ... ... ... ... Convolutional layer: h t 500 500 500 Convolution matrix: W c ... Word hashing layer: f t 15K 15K 15K 15K 15K Word hashing matrix: W f <s> w 1 w 2 w T <s> Word sequence: x t a man … bench

– What does the model learn at the convolutional layer? Capture the local context dependent word sense auto body repair … Learn one embedding vector for each local context- • dependent word 𝑢 , 𝑔 𝑢+1 ] ℎ 𝑢 = 𝑋 𝑑 × [𝑔 𝑢−1 , 𝑔 semantic space auto body repair car body kits car body shop auto body part The similarity between different “ body ” within contexts car body shop cosine high similarity similarity car body kits 0.698 auto body repair 0.578 wave body language auto body parts 0.555 calculate body fat wave body language 0.301 forcefield body armour calculate body fat 0.220 low forcefield body armour 0.165 similarity

𝑤 ℎ 𝑈 ℎ 1 ℎ 2 global  𝑤 𝑗 = max 𝑢=1,…,𝑈 ℎ 𝑢 (𝑗) intent 𝑗 = 1, … , 300  Words that win the most active neurons at the max- pooling layers: auto body repair cost calculator software Usually, those are salient words containing clear intents/topics

Mean Reciprocal Rank % (ranking among 5000 Hamonic Mean Rank (ranking among 5000 candidates on the 5K validation set) candidates on the 5K val set) 0.33 4 3.9 CDSSM d=300 0.32 3.8 CDSSM d=1000 0.31 3.7 DSSM d=300 0.3 3.6 0.29 3.5 3.4 0.28 3.3 CDSSM d=300 0.27 CDSSM d=1000 3.2 DSSM d=300 0.26 3.1 3 0.25 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193

Turing ng Test t Results sults at the MS COCO Captioning Challenge 2015 % of captions Official that pass the Rank Turing Test MSR SR 32.2% 2% 1st Go Google le 31.7% 7% 1st Still a big gap! MSR SR Captiv tivato ator r 30.1% 3rd Montrea Montr eal/T /T oront onto 27.2% 2% 3rd Ber erkele eley LRCN CN 26.8% 8% 5th th Other er groups oups: Baidu/ idu/UCL UCLA, Stanfor anford, Tsing nghua, hua, etc. c. Human 67.5% --

System BLEU % Better or Equal to Human Model 1: MELM + DMSM 25.7 34.0% Model 2: MRNN 25.7 29.0% Human judgers shown generated caption and human caption, choose which is “better”, or equal. Devlin, Cheng, Fang, Gupta, Deng, He, Zweig, and Mitchell “Language Models for Image Captioning: The Quirks and What Works, ” ACL 2015

Example: MELM+DMSM : “ A plate with a sandwich and a cup of coffee” MRNN: “ A close up of a plate of food” (more generic)

• • • • •

Visual concepts Celebrity Language Model A small boat in Ha Long Bay high ConvNets Confidence Landmark Model low This image contains: water, Features vector DMSM boat, lake, mountain, etc. [Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun, Cornelia Carapcea, Chris Thrasher, Chris Buehler, Chris Sienkiewicz submitted to CVPR Deep Vision 2016]

[He, Zhang, Ren, Sun, 2015]

The deep p multimodal ltimodal sema mantic tic model el [Fang, et al., CVPR 2015] semantic emantic spac ace : The overall semantics of a caption will also be represented by a vector in this space. If these two vectors are close to each other, then the caption is a good match for the image. W 4 W 4 Otherwise, not a matching caption. H3 H3 H3 H3 W 3 W 3 H2 H2 W 2 W 2 H1 H1 W 1 W 1 Input t1 Input s Text: a man holding a tennis Fully connected Image feature racquet on a tennis court Convolution/pooling Raw Image pixels [Huang, He, Gao, Deng et al., 2013] [He, Zhang, Ren, Sun, 2015]

• • [Guo, Zhang, Hu, He, Gao, 2016]

W 4 W 4 H3 H3 H3 H3 W 3 W 3 H2 H2 W 2 W 2 H1 H1 W 1 W 1 Input t1 Input s caption: a man holding a Image tennis racquet on a tennis court

System Excellent Good Bad Embarrassing Fang et al., 40.6% 26.8% 28.8% 3.8% 2015 New 51.8% 23.4% 22.5% 2.4% system Human evaluation on 1000 random samples of the COCO test set.

System Excellent Good Bad Embarrassing Fang et al., 12.0% 13.4% 63.0% 11.6% 2015 New 25.4% 24.1% 45.3% 5.2% system Human evaluation on Instagram test set, which contains 1380 random images that we scraped from Instagram.

Conf. score Excellent Good Bad Embarrassing mean 0.59 0.51 0.26 0.20 Std dev 0.21 0.23 0.21 0.19

Above: Fang2015 Below: Ours a black and white photo of a man wearing a hat a man holding a baseball bat at a ball a view of a sunset over water a dog sitting on top of a grass covered field a man wearing a bow tie looking at the camera a man swinging a baseball bat in front of a crowd a view of a sunset in the ocean a dog sitting in the grass a man holding a stop sign a man wearing a suit and tie a man on a skateboard a man taking a picture in front of a mirror a man holding a stop sign Ian Somerhalder wearing a suit and tie this picture is about photo an picture about person a woman standing in front of a christmas tree a colorful kite flying in the air a black and white photo of a man wearing a hat a couple of people at night a woman standing next to a window a table topped with a kite a man posing for a picture a fire hydrant that is lit up at night

a woman sitting on a couch a pair of scissors sitting on top of a table a woman holding a red umbrella a group of pictures on the wall this picture is about person a bunch of different items the image is about person this picture seems contain text two women standing in front of a cake a woman sitting on a bench a man holding a baseball bat on a field a black and white photo of a woman brushing her hair a woman standing in front of a mirror a woman posing for a picture a woman sitting on a bench a boy standing in front of a building a person holding a cell phone a man and a woman wearing a tie a man holding a teddy bear a pair of scissors a hand holding a cell phone a couple posing for a photo a picture about table the image is about clothing

Cognitive Services http://CaptionBot.ai

when Jen-Hsun Huang was giving a keynote showing off a GPU-powered VR visiting of mt. Everest -- here is what our CaptionBot has to say.

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola, "Stack acked ed Attent entio ion n Network orks s for Image ge Questio ion n Answering ing," CVPR 2016 (oral)

Big improvement

umbrella

a herd of elephants standing next to a man a herd of elephants standing next to Obama Obama the president from Democratic party whose competitor is Republic party mascot is Elephant Image credit: Obama is chased by his republic competitors  http://s122.photobucket.com/user/bmeup pls/media/stampede.jpg.html Republic Party Who is that person? What are behind that man? Why these elephants are chasing him?

Multimodal Learning for Image Captioning and Visual Question - PowerPoint PPT Presentation

Multimodal Learning for Image Captioning and Visual Question Answering Xiaodong He Deep Learning T echnology Center Microsoft Research UC Berkeley, April 7 th , 2016 Knowledge Freebase Text Vision Barack Obama is an American politician

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

Video Captioning Erin Grant March 1 st , 2016 Last Class: Image Captioning From Kiros et al.

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1 Outline Problem

Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao Background Image captioning and

Phrase-based Image Captioning Rmi Lebret , Pedro O. Pinheiro, Ronan Collobert Idiap Research

Multimodal Deep Learning Ahmed Abdelkader Design & Innovation Lab, ADAPT Centre Talk outline

Implementing Closed Captioning Implementing Closed Captioning for DTV for DTV Graham Jones

Session Transcript: 6/26/2020 Closed Captioning/ Transcript Disclaimer Closed captioning and/or

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Multimodal Machine Learning Main Goal Define a common taxonomy for multimodal machine learning

Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for Research

Unsupervised learning of multimodal image registration using domain adaptation with projected

Preventing and Managing Overpayments: A Webinar for Social Security Beneficiaries Date:

Mon Month th Agenda Agenda Preparedness Barriers National Preparedness Month Objectives

Neural network architectures for image captioning By Emily Kern Given a set of images and

Live and Direct Access and Subtitles Alic Joy Stagetext Marketing and Communications Manager

Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects Aditya

Br Broadcastin ing Access ssib ibil ilit ity Fund Meeting the Challenge of Content

TECHNOLOGY SERVICES Presenter: Mike Finch Technology Services Director FY 19-20 Proposed Budget

Resources for Remote Learning Presented by: Riann Batch, Jamie Corpuz, and Sarah Paziuk May 19,

Sambuz

Useful Links

Newsletter

Mail Us

Multimodal Learning for Image Captioning and Visual Question - PowerPoint PPT Presentation

Multimodal Learning for Image Captioning and Visual Question Answering Xiaodong He Deep Learning T echnology Center Microsoft Research UC Berkeley, April 7 th , 2016 Knowledge Freebase Text Vision Barack Obama is an American politician

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

Video Captioning Erin Grant March 1 st , 2016 Last Class: Image Captioning From Kiros et al.

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1 Outline Problem

Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao Background Image captioning and

Phrase-based Image Captioning Rmi Lebret , Pedro O. Pinheiro, Ronan Collobert Idiap Research

Multimodal Deep Learning Ahmed Abdelkader Design &amp; Innovation Lab, ADAPT Centre Talk outline

Implementing Closed Captioning Implementing Closed Captioning for DTV for DTV Graham Jones

Session Transcript: 6/26/2020 Closed Captioning/ Transcript Disclaimer Closed captioning and/or

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Multimodal Corridor Planning &amp; Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Multimodal Machine Learning Main Goal Define a common taxonomy for multimodal machine learning

Multimodal Memory Modelling for Video Captioning Liang Wang &amp; Yan Huang Center for Research

Unsupervised learning of multimodal image registration using domain adaptation with projected

Preventing and Managing Overpayments: A Webinar for Social Security Beneficiaries Date:

Mon Month th Agenda Agenda Preparedness Barriers National Preparedness Month Objectives

Neural network architectures for image captioning By Emily Kern Given a set of images and

Live and Direct Access and Subtitles Alic Joy Stagetext Marketing and Communications Manager

Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects Aditya

Br Broadcastin ing Access ssib ibil ilit ity Fund Meeting the Challenge of Content

TECHNOLOGY SERVICES Presenter: Mike Finch Technology Services Director FY 19-20 Proposed Budget

Resources for Remote Learning Presented by: Riann Batch, Jamie Corpuz, and Sarah Paziuk May 19,

Sambuz

Useful Links

Newsletter

Mail Us

Multimodal Deep Learning Ahmed Abdelkader Design & Innovation Lab, ADAPT Centre Talk outline

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for Research