hideki nakayama
play

Hideki Nakayama The University of Tokyo, Grad School of IST 1 - PowerPoint PPT Presentation

GTC technology conference 2017 @Sun Jose, May 11th Hideki Nakayama The University of Tokyo, Grad School of IST 1 Hideki Nakayama Assistant Professor @The University of Tokyo AI research center Research topics: Computer Vision


  1. GTC technology conference 2017 @Sun Jose, May 11th Hideki Nakayama The University of Tokyo, Grad School of IST 1

  2.  Hideki Nakayama ◦ Assistant Professor @The University of Tokyo ◦ AI research center  Research topics: ◦ Computer Vision ◦ Natural Language Processing ◦ Deep Learning

  3. Large-scale image tagging Fine-grained recognition Wearable interface Medical image analysis Representation learning Object discovery for vision Vision-based recommendation 3

  4. Automatic question generation Word representation learning Flexible attention mechanism 4

  5. a cat is trying to eat the food Image/video caption generation Multimodal deep models Multimodal machine translation 5

  6.  1. Background: cross-modal encoder-decoder learning with supervised data  2. Proposed idea: pivot-based learning  3. Zero-shot learning of machine translation system using image pivots 6

  7.  Goal: to learn a function that transforms data in one modality x (source) into another modality y y (target)  How: statistical estimation from a lot of paired examples { ( ) } = , , i 1 ,..., N x y i i … X cat dog bird Y x y cat   0 . 99   ( ) dog   0 . 01 f x   bird 0 . 01     7   

  8.  Derive hidden multimodal representation (vector) that aligns the coupled source and target data Multimodal Space Text encoder/decoder Image encoder (e.g., recurrent neural network) (e.g.,convolutional neural network) A brown dog in front of a door. A black and white cow standing in a field. X Y 8

  9.  Prediction can be realized by encoding an input into multimodal space and then decoding it Multimodal Space Image encoder Text encoder/decoder (convolutional neural (e.g., recurrent neural network) network) A black dog sitting on grass. ˆ y x 9

  10. R. Kiros et al., “Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models”, TACL, 2015. [Kiros et al., 2014] 10

  11. [Kiros et al., 2014] 11

  12.  As long as we have enough parallel data , we can now build many attractive applications Multimodal Space X Y Imag mage recognitio tion /c /capti tion oning Mac Machi hine 私は学生です。 I am a student. Transla latio ion Multimed edia This is a dog. synthesi esis 12

  13.  Supervised parallel data (X,Y) is not always available in real situations!  Annotating data is very expensive… ◦ 1M parallel sentences (machine translation) ◦ 15M images in 10K categories (object recognition) ◦ etc.  What can we do when NO direct parallel data is available? 13

  14.  Semi-supervised learning Unlabeled Unlabeled X Y 0 X 0 Y  Transfer learning Another X ′ Y ′ domain Target domain X Y 14

  15.  Learn multimodal representation of X and Y from indirect data (X,Z) and (Z,Y) where Z is the “pivot” (third modality)  Assumption: Z is a “common” modality (e.g., image, English text) and therefore (X,Z) and (Z,Y) are relatively easy to obtain Multimodal Space ( ) ( ) X , Z Z , Y X Y Z 15

  16.  1. Background: cross-modal encoder-decoder learning with supervised data  2. Proposed idea: pivot-based learning  3. Zero-shot learning of machine translation system [1] R. Funaki and H. Nakayama, “Image-mediated Learning for Zero-shot Cross-lingual Document Retrieval”, In Proc. of EMNLP , 2015. [2] H. Nakayama and N. Nishida, “Toward Zero-resource Machine Translation by Multimodal Embedding with Multimedia Pivot”, Machine Translation Journal , 2017. (in press) 16

  17.  Our approach (image pivot)  Typical approach ◦ Parallel document is ◦ We can find abundant monolingual documents with images! hard to obtain… ◦ E.g., blog, SNS, web news X Y Z X Y Japane apanese Imag mage Englis En glish Japane apanese Englis En glish { } t T N = t t , z y = k k k 1 { } s N T = s s , x z = k k k 1

  18. ◦ Multimodal embedding using image pivots ◦ Puts target language decoder on top of the multimodal space ◦ End-to-end learning with neural network (deep learning) { } s N T = Training Data: s s , x z = Source language encoder RNN k k k 1 { } t T N = t t , z y = s k k k 1 E … … … … … v E t D … … … … … Image encoder CNN Target language decoder RNN Multimodal space t E … … … … … Target language encoder RNN 18

  19. ◦ Align source language texts and images in the multimodal space 白い壁の隣 に座っている Source language encoder RNN 小さな犬。 x k s E … … … … … v E t D … … … … … s z Image encoder CNN k Target language decoder RNN Multimodal space t E … … … … … Target language encoder RNN 19

  20. ◦ Align source language texts and images in the multimodal space 白い壁の隣 { } Source language encoder RNN に座っている s T N = s s 小さな犬。 , x z = k k k 1 x k s E … … … … … Pair-wise Rank Loss ( ) s [Frome+, NIPS’13] : Similarity score function { ( ( ) ( ) ) ( ( ) ( ) ) } s N ∑∑ L = α − + s v s s v s s v max 0 , s E , E s E , E E z x z x t D k k k i ≠ … … … … … k i k s z Image encoder CNN k Target language decoder RNN Multimodal space t E Margin … … … … … An image (Hyper parameter) Paired Negative text (not paired) text Target language encoder RNN 20

  21. ◦ Align target language texts and images in the multimodal space Source language encoder RNN s E … … … … … v E t D … … … … … t Image encoder CNN z Target language decoder RNN k Multimodal space t { } E t … … … … … T N = t t , A black dog z y = k k k 1 sitting on grass next to a sidewalk. Target language encoder RNN y { ( ( ) ( ) ) ( ( ) ( ) ) } t N ∑∑ k L = α − + t v t t v t t max 0 , s E , E s E , E z y z y k k k i 21 ≠ k i k

  22. ◦ Feedforward images in target data and decode it into texts ◦ Cross-entropy loss { } t T N = t t , z y = k k k 1 Source language encoder RNN y k s E … … … … … A black dog sitting on grass next to a sidewalk. v E t D … … … … … t Image encoder CNN z Target language decoder RNN k Multimodal space t E … … … … … Target language encoder RNN 22

  23. ◦ Reconstruction loss of texts in target language ◦ This can also improve decoder performance { } t T N = t t , z y Source language encoder RNN = k k k 1 y k s E … … … … … A black dog sitting on grass next to a sidewalk. v E t D … … … … … Image encoder CNN Target language decoder RNN Multimodal space t E … … … … … y k A black dog sitting on grass next to a Target language encoder RNN sidewalk. 23

  24.  Just feedforward through and v t E D  We don’t need images in the testing phase! x q Source language encoder RNN ( ) ( ) 草地に立ってい y = t v ˆ D E る黒と白の牛。 x q q s E … … … … … A black and white cow standing in a grassy field. v E t D … … … … … Image encoder CNN Target language decoder RNN Multimodal space t E … … … … … Target language encoder RNN 24

  25.  IAPR-TC12 [Grubinger+, 2006] ◦ 20000 images and English/German captions a photo of a brown sandy ein Photo eines braunen beach; the dark blue sea Sandstrands; das with small breaking dunkelblaue Meer mit waves behind it; a dark kleinen brechenden Wellen green palm tree in the dahinter; eine dunkelgrüne foreground on the left; Palme im Vordergrund links; a blue sky with clouds ein blauer Himmel mit on the horizon in the Wolken am Horizont im background; Hintergrund;  Multi30K [Elliott+, 2016] ◦ 30000 images and English/German captions  We randomly split data into our zero-shot setup and perform German to English translation

  26. 26

  27.  Evaluation Metrics: BLEU scores (larger is better) Ou Ours (Zero-shot learning) Super ervised sed ba baseli lines (parallel corpus)  Zero-shot results are comparable to supervised models using parallel corpora roughly 20% as large as our monolingual ones. 27

  28. 28

  29. L  Cross-camera L XZ ZY person identification Z All we need is two losses! cam 3 (data is still capsulated) X Y cam 2 cam 1  Recognizing other sensory data Z image A black sofa in a room. X Y depth caption 29

  30. Z X Y 30

  31. 31

  32. X Y 32

  33. L L X L L L L L L L L L Y 33

  34. • Routing “knowledge” • Edge-side loss computation • No need to open data itself! L L X L L L L L L L L L Y 34

  35.  Numerous new modalities in different types of data, different environments ( ≒ Airports)  “Direct flight” ( ≒ supervised learning) for each pair is theoretically possible but practically infeasible ◦ Annotation cost, privacy or company-side issue  “hub airport” (pivot) plays the key role! World airlines (https://ja.wikipedia.org/wiki/ 航空会社 ) 35

Recommend


More recommend