Hideki Nakayama The University of Tokyo, Grad School of IST 1 - PowerPoint PPT Presentation

GTC technology conference 2017 @Sun Jose, May 11th Hideki Nakayama The University of Tokyo, Grad School of IST 1

 Hideki Nakayama ◦ Assistant Professor @The University of Tokyo ◦ AI research center  Research topics: ◦ Computer Vision ◦ Natural Language Processing ◦ Deep Learning

Large-scale image tagging Fine-grained recognition Wearable interface Medical image analysis Representation learning Object discovery for vision Vision-based recommendation 3

Automatic question generation Word representation learning Flexible attention mechanism 4

a cat is trying to eat the food Image/video caption generation Multimodal deep models Multimodal machine translation 5

 1. Background: cross-modal encoder-decoder learning with supervised data  2. Proposed idea: pivot-based learning  3. Zero-shot learning of machine translation system using image pivots 6

 Goal: to learn a function that transforms data in one modality x (source) into another modality y y (target)  How: statistical estimation from a lot of paired examples { ( ) } = , , i 1 ,..., N x y i i … X cat dog bird Y x y cat   0 . 99   ( ) dog   0 . 01 f x   bird 0 . 01     7   

 Derive hidden multimodal representation (vector) that aligns the coupled source and target data Multimodal Space Text encoder/decoder Image encoder (e.g., recurrent neural network) (e.g.,convolutional neural network) A brown dog in front of a door. A black and white cow standing in a field. X Y 8

 Prediction can be realized by encoding an input into multimodal space and then decoding it Multimodal Space Image encoder Text encoder/decoder (convolutional neural (e.g., recurrent neural network) network) A black dog sitting on grass. ˆ y x 9

R. Kiros et al., “Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models”, TACL, 2015. [Kiros et al., 2014] 10

[Kiros et al., 2014] 11

 As long as we have enough parallel data , we can now build many attractive applications Multimodal Space X Y Imag mage recognitio tion /c /capti tion oning Mac Machi hine 私は学生です。 I am a student. Transla latio ion Multimed edia This is a dog. synthesi esis 12

 Supervised parallel data (X,Y) is not always available in real situations!  Annotating data is very expensive… ◦ 1M parallel sentences (machine translation) ◦ 15M images in 10K categories (object recognition) ◦ etc.  What can we do when NO direct parallel data is available? 13

 Semi-supervised learning Unlabeled Unlabeled X Y 0 X 0 Y  Transfer learning Another X ′ Y ′ domain Target domain X Y 14

 Learn multimodal representation of X and Y from indirect data (X,Z) and (Z,Y) where Z is the “pivot” (third modality)  Assumption: Z is a “common” modality (e.g., image, English text) and therefore (X,Z) and (Z,Y) are relatively easy to obtain Multimodal Space ( ) ( ) X , Z Z , Y X Y Z 15

 1. Background: cross-modal encoder-decoder learning with supervised data  2. Proposed idea: pivot-based learning  3. Zero-shot learning of machine translation system [1] R. Funaki and H. Nakayama, “Image-mediated Learning for Zero-shot Cross-lingual Document Retrieval”, In Proc. of EMNLP , 2015. [2] H. Nakayama and N. Nishida, “Toward Zero-resource Machine Translation by Multimodal Embedding with Multimedia Pivot”, Machine Translation Journal , 2017. (in press) 16

 Our approach (image pivot)  Typical approach ◦ Parallel document is ◦ We can find abundant monolingual documents with images! hard to obtain… ◦ E.g., blog, SNS, web news X Y Z X Y Japane apanese Imag mage Englis En glish Japane apanese Englis En glish { } t T N = t t , z y = k k k 1 { } s N T = s s , x z = k k k 1

◦ Multimodal embedding using image pivots ◦ Puts target language decoder on top of the multimodal space ◦ End-to-end learning with neural network (deep learning) { } s N T = Training Data: s s , x z = Source language encoder RNN k k k 1 { } t T N = t t , z y = s k k k 1 E … … … … … v E t D … … … … … Image encoder CNN Target language decoder RNN Multimodal space t E … … … … … Target language encoder RNN 18

◦ Align source language texts and images in the multimodal space 白い壁の隣に座っている Source language encoder RNN 小さな犬。 x k s E … … … … … v E t D … … … … … s z Image encoder CNN k Target language decoder RNN Multimodal space t E … … … … … Target language encoder RNN 19

◦ Align source language texts and images in the multimodal space 白い壁の隣 { } Source language encoder RNN に座っている s T N = s s 小さな犬。 , x z = k k k 1 x k s E … … … … … Pair-wise Rank Loss ( ) s [Frome+, NIPS’13] : Similarity score function { ( ( ) ( ) ) ( ( ) ( ) ) } s N ∑∑ L = α − + s v s s v s s v max 0 , s E , E s E , E E z x z x t D k k k i ≠ … … … … … k i k s z Image encoder CNN k Target language decoder RNN Multimodal space t E Margin … … … … … An image (Hyper parameter) Paired Negative text (not paired) text Target language encoder RNN 20

◦ Align target language texts and images in the multimodal space Source language encoder RNN s E … … … … … v E t D … … … … … t Image encoder CNN z Target language decoder RNN k Multimodal space t { } E t … … … … … T N = t t , A black dog z y = k k k 1 sitting on grass next to a sidewalk. Target language encoder RNN y { ( ( ) ( ) ) ( ( ) ( ) ) } t N ∑∑ k L = α − + t v t t v t t max 0 , s E , E s E , E z y z y k k k i 21 ≠ k i k

◦ Feedforward images in target data and decode it into texts ◦ Cross-entropy loss { } t T N = t t , z y = k k k 1 Source language encoder RNN y k s E … … … … … A black dog sitting on grass next to a sidewalk. v E t D … … … … … t Image encoder CNN z Target language decoder RNN k Multimodal space t E … … … … … Target language encoder RNN 22

◦ Reconstruction loss of texts in target language ◦ This can also improve decoder performance { } t T N = t t , z y Source language encoder RNN = k k k 1 y k s E … … … … … A black dog sitting on grass next to a sidewalk. v E t D … … … … … Image encoder CNN Target language decoder RNN Multimodal space t E … … … … … y k A black dog sitting on grass next to a Target language encoder RNN sidewalk. 23

 Just feedforward through and v t E D  We don’t need images in the testing phase! x q Source language encoder RNN ( ) ( ) 草地に立ってい y = t v ˆ D E る黒と白の牛。 x q q s E … … … … … A black and white cow standing in a grassy field. v E t D … … … … … Image encoder CNN Target language decoder RNN Multimodal space t E … … … … … Target language encoder RNN 24

 IAPR-TC12 [Grubinger+, 2006] ◦ 20000 images and English/German captions a photo of a brown sandy ein Photo eines braunen beach; the dark blue sea Sandstrands; das with small breaking dunkelblaue Meer mit waves behind it; a dark kleinen brechenden Wellen green palm tree in the dahinter; eine dunkelgrüne foreground on the left; Palme im Vordergrund links; a blue sky with clouds ein blauer Himmel mit on the horizon in the Wolken am Horizont im background; Hintergrund;  Multi30K [Elliott+, 2016] ◦ 30000 images and English/German captions  We randomly split data into our zero-shot setup and perform German to English translation

 Evaluation Metrics: BLEU scores (larger is better) Ou Ours (Zero-shot learning) Super ervised sed ba baseli lines (parallel corpus)  Zero-shot results are comparable to supervised models using parallel corpora roughly 20% as large as our monolingual ones. 27

L  Cross-camera L XZ ZY person identification Z All we need is two losses! cam 3 (data is still capsulated) X Y cam 2 cam 1  Recognizing other sensory data Z image A black sofa in a room. X Y depth caption 29

Z X Y 30

X Y 32

L L X L L L L L L L L L Y 33

• Routing “knowledge” • Edge-side loss computation • No need to open data itself! L L X L L L L L L L L L Y 34

 Numerous new modalities in different types of data, different environments ( ≒ Airports)  “Direct flight” ( ≒ supervised learning) for each pair is theoretically possible but practically infeasible ◦ Annotation cost, privacy or company-side issue  “hub airport” (pivot) plays the key role! World airlines (https://ja.wikipedia.org/wiki/ 航空会社 ) 35

Hideki Nakayama The University of Tokyo, Grad School of IST 1 - PowerPoint PPT Presentation

GTC technology conference 2017 @Sun Jose, May 11th Hideki Nakayama The University of Tokyo, Grad School of IST 1 Hideki Nakayama Assistant Professor @The University of Tokyo AI research center Research topics: Computer Vision

HD + 5.1ch Streaming Hiroshi Nakayama Asahi Broadcasting Corp. http://www.asahi.co.jp

A Murnaghan-Nakayama Rule For k -Schur Functions Jason Bandlow (joint work with Anne Schilling,

Head Finalization: Translation from SVO to SOV Hideki Isozaki Okayama

Computable dyadic subbases Arno Pauly and Hideki Tsuiki Second Workshop on Mathematical Logic and

Complex Analysis on Teichm uller space Hideki Miyachi Kanazawa University Teichm uller

Top Management Presentation Financial Results of FY2015 DAIICHI SANKYO CO., LTD Joji Nakayama

Chemical Biology of Tea Catechins Tsutomu NAKAYAMA Laboratory of Molecular Food Engineering and

Top Management Presentation Financial Results of Fiscal Year 2013 Joji Nakayama President and

Probing dark radiation with inflationary gravitational waves Kazunori Nakayama (The University of

Kisaburo Nakata University of Tokai Nakayama Channnel The depth of water way dredging is

The Spotlight Presentation by Yuji Nakayama, Tomonori Ishigaki and Nagateru Araki College of

Algorithmic Bias in Google Searches for Political Parties and Candidates Johannes Nakayama, Nils

Yong Tang University of Tokyo KEK, Dec 4-7, 2018 K.Nakayama & Y. Tang , 1810.04975

Bootstrapping the genesis of Nambu-Goldstone Bosons from Quark Gluon Plasma Yu Nakayama (Kavli

AMS-02 Kazunori Nakayama (University of Tokyo)

Dissipative Effects on Reheating after Inflation Kyohei Mukaida (Univ. of Tokyo) Based on:

1 Introduction Motivation Source coding with decoder

JaTest Build Software So Secure You May Actually Make America Great Again Jake Weissman,

FY 2021 Budget Kickoff February 21, 2020 Agenda Introductions Budget Update FAMIS

Cryptocurrency Q o d e n Te c h n o l o g i e s L L C Exchange Engine Qoden Technologies LLC

Zhen Yang, Wei Chen, Feng Wang and Bo Xu Institute of Automation, Chinese Academy of Sciences

HIMSS 2019 Solor Presentation Steven Brown, MD, MS, FACMI Keith E. Campbell, MD, PhD, FACMI 1

Control over Gaussian Channels With and Setting Without SourceChannel Separation Background

RFIDIOts!!! Hacking RFID Without A Soldering Iron (or a Patent Attorney) Adam Laurie

Hideki Nakayama The University of Tokyo, Grad School of IST 1 - PowerPoint PPT Presentation

GTC technology conference 2017 @Sun Jose, May 11th Hideki Nakayama The University of Tokyo, Grad School of IST 1 Hideki Nakayama Assistant Professor @The University of Tokyo AI research center Research topics: Computer Vision

HD + 5.1ch Streaming Hiroshi Nakayama Asahi Broadcasting Corp. http://www.asahi.co.jp

A Murnaghan-Nakayama Rule For k -Schur Functions Jason Bandlow (joint work with Anne Schilling,

Head Finalization: Translation from SVO to SOV Hideki Isozaki Okayama

Computable dyadic subbases Arno Pauly and Hideki Tsuiki Second Workshop on Mathematical Logic and

Complex Analysis on Teichm uller space Hideki Miyachi Kanazawa University Teichm uller

Top Management Presentation Financial Results of FY2015 DAIICHI SANKYO CO., LTD Joji Nakayama

Chemical Biology of Tea Catechins Tsutomu NAKAYAMA Laboratory of Molecular Food Engineering and

Top Management Presentation Financial Results of Fiscal Year 2013 Joji Nakayama President and

Probing dark radiation with inflationary gravitational waves Kazunori Nakayama (The University of

Kisaburo Nakata University of Tokai Nakayama Channnel The depth of water way dredging is

The Spotlight Presentation by Yuji Nakayama, Tomonori Ishigaki and Nagateru Araki College of

Algorithmic Bias in Google Searches for Political Parties and Candidates Johannes Nakayama, Nils

Yong Tang University of Tokyo KEK, Dec 4-7, 2018 K.Nakayama &amp; Y. Tang , 1810.04975

Bootstrapping the genesis of Nambu-Goldstone Bosons from Quark Gluon Plasma Yu Nakayama (Kavli

AMS-02 Kazunori Nakayama (University of Tokyo)

Dissipative Effects on Reheating after Inflation Kyohei Mukaida (Univ. of Tokyo) Based on:

1 Introduction Motivation Source coding with decoder

JaTest Build Software So Secure You May Actually Make America Great Again Jake Weissman,

FY 2021 Budget Kickoff February 21, 2020 Agenda Introductions Budget Update FAMIS

Cryptocurrency Q o d e n Te c h n o l o g i e s L L C Exchange Engine Qoden Technologies LLC

Zhen Yang, Wei Chen, Feng Wang and Bo Xu Institute of Automation, Chinese Academy of Sciences

HIMSS 2019 Solor Presentation Steven Brown, MD, MS, FACMI Keith E. Campbell, MD, PhD, FACMI 1

Control over Gaussian Channels With and Setting Without SourceChannel Separation Background

RFIDIOts!!! Hacking RFID Without A Soldering Iron (or a Patent Attorney) Adam Laurie

Yong Tang University of Tokyo KEK, Dec 4-7, 2018 K.Nakayama & Y. Tang , 1810.04975