GTC technology conference 2017 @Sun Jose, May 11th Hideki Nakayama The University of Tokyo, Grad School of IST 1
Hideki Nakayama ◦ Assistant Professor @The University of Tokyo ◦ AI research center Research topics: ◦ Computer Vision ◦ Natural Language Processing ◦ Deep Learning
Large-scale image tagging Fine-grained recognition Wearable interface Medical image analysis Representation learning Object discovery for vision Vision-based recommendation 3
Automatic question generation Word representation learning Flexible attention mechanism 4
a cat is trying to eat the food Image/video caption generation Multimodal deep models Multimodal machine translation 5
1. Background: cross-modal encoder-decoder learning with supervised data 2. Proposed idea: pivot-based learning 3. Zero-shot learning of machine translation system using image pivots 6
Goal: to learn a function that transforms data in one modality x (source) into another modality y y (target) How: statistical estimation from a lot of paired examples { ( ) } = , , i 1 ,..., N x y i i … X cat dog bird Y x y cat 0 . 99 ( ) dog 0 . 01 f x bird 0 . 01 7
Derive hidden multimodal representation (vector) that aligns the coupled source and target data Multimodal Space Text encoder/decoder Image encoder (e.g., recurrent neural network) (e.g.,convolutional neural network) A brown dog in front of a door. A black and white cow standing in a field. X Y 8
Prediction can be realized by encoding an input into multimodal space and then decoding it Multimodal Space Image encoder Text encoder/decoder (convolutional neural (e.g., recurrent neural network) network) A black dog sitting on grass. ˆ y x 9
R. Kiros et al., “Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models”, TACL, 2015. [Kiros et al., 2014] 10
[Kiros et al., 2014] 11
As long as we have enough parallel data , we can now build many attractive applications Multimodal Space X Y Imag mage recognitio tion /c /capti tion oning Mac Machi hine 私は学生です。 I am a student. Transla latio ion Multimed edia This is a dog. synthesi esis 12
Supervised parallel data (X,Y) is not always available in real situations! Annotating data is very expensive… ◦ 1M parallel sentences (machine translation) ◦ 15M images in 10K categories (object recognition) ◦ etc. What can we do when NO direct parallel data is available? 13
Semi-supervised learning Unlabeled Unlabeled X Y 0 X 0 Y Transfer learning Another X ′ Y ′ domain Target domain X Y 14
Learn multimodal representation of X and Y from indirect data (X,Z) and (Z,Y) where Z is the “pivot” (third modality) Assumption: Z is a “common” modality (e.g., image, English text) and therefore (X,Z) and (Z,Y) are relatively easy to obtain Multimodal Space ( ) ( ) X , Z Z , Y X Y Z 15
1. Background: cross-modal encoder-decoder learning with supervised data 2. Proposed idea: pivot-based learning 3. Zero-shot learning of machine translation system [1] R. Funaki and H. Nakayama, “Image-mediated Learning for Zero-shot Cross-lingual Document Retrieval”, In Proc. of EMNLP , 2015. [2] H. Nakayama and N. Nishida, “Toward Zero-resource Machine Translation by Multimodal Embedding with Multimedia Pivot”, Machine Translation Journal , 2017. (in press) 16
Our approach (image pivot) Typical approach ◦ Parallel document is ◦ We can find abundant monolingual documents with images! hard to obtain… ◦ E.g., blog, SNS, web news X Y Z X Y Japane apanese Imag mage Englis En glish Japane apanese Englis En glish { } t T N = t t , z y = k k k 1 { } s N T = s s , x z = k k k 1
◦ Multimodal embedding using image pivots ◦ Puts target language decoder on top of the multimodal space ◦ End-to-end learning with neural network (deep learning) { } s N T = Training Data: s s , x z = Source language encoder RNN k k k 1 { } t T N = t t , z y = s k k k 1 E … … … … … v E t D … … … … … Image encoder CNN Target language decoder RNN Multimodal space t E … … … … … Target language encoder RNN 18
◦ Align source language texts and images in the multimodal space 白い壁の隣 に座っている Source language encoder RNN 小さな犬。 x k s E … … … … … v E t D … … … … … s z Image encoder CNN k Target language decoder RNN Multimodal space t E … … … … … Target language encoder RNN 19
◦ Align source language texts and images in the multimodal space 白い壁の隣 { } Source language encoder RNN に座っている s T N = s s 小さな犬。 , x z = k k k 1 x k s E … … … … … Pair-wise Rank Loss ( ) s [Frome+, NIPS’13] : Similarity score function { ( ( ) ( ) ) ( ( ) ( ) ) } s N ∑∑ L = α − + s v s s v s s v max 0 , s E , E s E , E E z x z x t D k k k i ≠ … … … … … k i k s z Image encoder CNN k Target language decoder RNN Multimodal space t E Margin … … … … … An image (Hyper parameter) Paired Negative text (not paired) text Target language encoder RNN 20
◦ Align target language texts and images in the multimodal space Source language encoder RNN s E … … … … … v E t D … … … … … t Image encoder CNN z Target language decoder RNN k Multimodal space t { } E t … … … … … T N = t t , A black dog z y = k k k 1 sitting on grass next to a sidewalk. Target language encoder RNN y { ( ( ) ( ) ) ( ( ) ( ) ) } t N ∑∑ k L = α − + t v t t v t t max 0 , s E , E s E , E z y z y k k k i 21 ≠ k i k
◦ Feedforward images in target data and decode it into texts ◦ Cross-entropy loss { } t T N = t t , z y = k k k 1 Source language encoder RNN y k s E … … … … … A black dog sitting on grass next to a sidewalk. v E t D … … … … … t Image encoder CNN z Target language decoder RNN k Multimodal space t E … … … … … Target language encoder RNN 22
◦ Reconstruction loss of texts in target language ◦ This can also improve decoder performance { } t T N = t t , z y Source language encoder RNN = k k k 1 y k s E … … … … … A black dog sitting on grass next to a sidewalk. v E t D … … … … … Image encoder CNN Target language decoder RNN Multimodal space t E … … … … … y k A black dog sitting on grass next to a Target language encoder RNN sidewalk. 23
Just feedforward through and v t E D We don’t need images in the testing phase! x q Source language encoder RNN ( ) ( ) 草地に立ってい y = t v ˆ D E る黒と白の牛。 x q q s E … … … … … A black and white cow standing in a grassy field. v E t D … … … … … Image encoder CNN Target language decoder RNN Multimodal space t E … … … … … Target language encoder RNN 24
IAPR-TC12 [Grubinger+, 2006] ◦ 20000 images and English/German captions a photo of a brown sandy ein Photo eines braunen beach; the dark blue sea Sandstrands; das with small breaking dunkelblaue Meer mit waves behind it; a dark kleinen brechenden Wellen green palm tree in the dahinter; eine dunkelgrüne foreground on the left; Palme im Vordergrund links; a blue sky with clouds ein blauer Himmel mit on the horizon in the Wolken am Horizont im background; Hintergrund; Multi30K [Elliott+, 2016] ◦ 30000 images and English/German captions We randomly split data into our zero-shot setup and perform German to English translation
26
Evaluation Metrics: BLEU scores (larger is better) Ou Ours (Zero-shot learning) Super ervised sed ba baseli lines (parallel corpus) Zero-shot results are comparable to supervised models using parallel corpora roughly 20% as large as our monolingual ones. 27
28
L Cross-camera L XZ ZY person identification Z All we need is two losses! cam 3 (data is still capsulated) X Y cam 2 cam 1 Recognizing other sensory data Z image A black sofa in a room. X Y depth caption 29
Z X Y 30
31
X Y 32
L L X L L L L L L L L L Y 33
• Routing “knowledge” • Edge-side loss computation • No need to open data itself! L L X L L L L L L L L L Y 34
Numerous new modalities in different types of data, different environments ( ≒ Airports) “Direct flight” ( ≒ supervised learning) for each pair is theoretically possible but practically infeasible ◦ Annotation cost, privacy or company-side issue “hub airport” (pivot) plays the key role! World airlines (https://ja.wikipedia.org/wiki/ 航空会社 ) 35
Recommend
More recommend