Recognize, Describe, and Generate: Introduction of Recent Work at MIL The University of Tokyo, NVAIL Partner Yoshitaka Ushiku
MIL: Machine Intelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical Systems Members Varying research topics ICCV, CVPR, ECCV, • One Professor ( Prof. Harada ) ICML, NIPS, • One Lecturer ( me ) ICASSP , SIGdial, • One Assistant Professor ACM Multimedia, ICME, • One Postdoc ICRA, IROS, etc. • Two Office Administrators The most important thing • 11 Ph. D. students • 23 Master students • 8 Bachelor students We are hiring! • 5 Interns
Journalist Robot • Born in 2006 • Objective: publishing news automatically – Recognize • Objects, people, actions – Describe • What is happening – Generate • Contents as humans do
Outline • Journalist Robot: ancestor of current work in MIL • Outline: research originates with this robot – Recognize • Basic: Framework for DL, Domain Adaptation • Classification: Single-modality, Multi-modalities – Describe • Image Captioning • Video Captioning – Generate • Image Reconstruction • Video Generation
Recognize
MILJS: JavaScript × Deep Learning [Hidaka+, ICLR Workshop 2017]
MILJS: JavaScript × Deep Learning [Hidaka+, ICLR Workshop 2017] • Support for both learning and inference • Support for nodes with GPGPUs – Currently WebCL is utilized. – Now working on WebGPU. • Support for nodes w/o GPGPUs • No requirements to install any software – Even ResNet with 152 layers can be trained Let me show you a preliminary demonstration using mnist!
Asymmetric Tri-training for Domain Adaptation [Saito+, submitted to ICML 2017] • Unsupervised domain adaptation Trained on mnist → Works on SVHN? – Ground-truth labels are associated with source (mnist) – However, there are no labels for target (SVHN)
Asymmetric Tri-training for Domain Adaptation [Saito+, submitted to ICML 2017] • Asymmetric Tri-training: pseudo labels for target domain
Asymmetric Tri-training for Domain Adaptation [Saito+, submitted to ICML 2017] 1 st : Training on MNIST → Add pseudo labels for easy samples eight nine 2 nd ~: Training on MNIST+α→ Add more pseudo labels
End-to-end learning for environmental sound classification [Tokozume+, ICASSP 2017] Existing methods for speech / sound recognition: ① Feature extraction: Fourier Transformation (log-mel features) ② Classification: CNN with the extracted feature map ① ② Log-mel features are suitable for human speech; but for environmental sounds…?
End-to-end learning for environmental sound classification [Tokozume+, ICASSP 2017] Proposed approach (EnvNet): CNN for both ① feature map extraction and ② classification ① ② Extracted “feature map”
End-to-end learning for environmental sound classification [Tokozume+, ICASSP 2017] Comparison of accuracy [%] on ESC-50 [Piczak, ACM MM 2015] 71.0 64.5 64.0 log-mel feature + CNN End-to-end CNN End-to-end CNN & [Piczak, MLSP 2015] (Ours) log-mel feature + CNN (Ours) EnvNet can extract discriminative features for environmental sounds
Visual Question Answering (VQA) [Saito+, ICME 2017] Question answering system for • Associated image • Question by natural language Q: Is it going to rain soon? Q: Why is there snow on one Ground Truth A: yes side of the stream and clear grass on the other? Ground Truth A: shade
Visual Question Answering (VQA) [Saito+, ICME 2017] VQA = Multi-class classification Image feature Image � � � Integrated vector � ��� Answer � Question feature bed sheets, pillow � � Question � What objects are found on the bed? After integrating for ��� : usual classification
Visual Question Answering [Saito+, ICME 2017] Current advancement: improving how to integrate � and � � ��� � � � • Concatenation e.g.) [Antol+, ICCV 2015] � � • Summation e.g.) Image feature (with attention) + Question feature � ��� � � � � � [Xu+Saenko, ECCV 2016] • Multiplication � ��� � � � � � e.g.) Bilinear multiplication [Fukui+, EMNLP 2016] � � � � • This work: DualNet doing sum, multiply and concatenation � ��� � � � � �
Visual Question Answering (VQA) [Saito+, ICME 2017] VQA Challenge 2016 (in CVPR 2016) Won the 1 st place on abstract images w/o attention mechanism Q: What fruit is yellow and brown? Q: How many screens are there? A: banana A: 2 Q: What is the boy playing with? Q: Are there any animals swimming in the A: teddy bear pond? A: no
Describe
Automatic Image Captioning [Ushiku+, ACMMM 2011 ]
Training Dataset A small white dog A white van wearing a flannel parked in an warmer. empty lot. A small gray dog A white cat rests A small white dog standing on a leash. on a leash. head on a stone. Nearest Captions A black dog White and gray standing in a kitten lying on Input Image A small white dog wearing a flannel warmer. grassy area. its side. A small white dog wearing a flannel warmer. A small gray dog on a leash. A small gray dog on a leash. Silver car parked A woman posing on side of road. on a red scooter. A black dog standing in a grassy area. A black dog standing in a grassy area.
Automatic Image Captioning [ACM MM 2012 , ICCV 2015] Group of people sitting at a table with a dinner. Tourists are standing on the middle of a flat desert.
Image Captioning + Sentiment Terms [Andrew+, BMVC 2016] A confused man in a A man in a blue shirt A zebra standing in a blue shirt is sitting on a and blue jeans is field with a tree in the bench. standing in the dirty background. overlooked water.
Image Captioning + Sentiment Terms [Andrew+, BMVC 2016] Two steps for adding a sentiment term 1. Usual image captioning using CNN+RNN The most probable noun is memorized
Image Captioning + Sentiment Terms [Andrew+, BMVC 2016] Two steps for adding a sentiment term 1. Usual image captioning using CNN+RNN 2. Forced to predict sentiment term before the noun
Beyond Caption to Narrative [Andrew+, ICIP 2016] A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food.
Beyond Caption to Narrative [Andrew+, ICIP 2016] A man is he and a she is holding Narrative holding a box woman are a plate of food. of doughnuts. standing next each other.
Beyond Caption to Narrative [Andrew+, ICIP 2016] A boat is floating on the water near a mountain. And a man riding a wave on top of a surfboard. Then he on the surfboard in the water.
Generate
Image Reconstruction [Kato+, CVPR 2014] Traditional pipeline for image classification Extracting Collecting Calculating Classifying local descriptors descriptors Global feature images � � � ��� � � d d d ( d ; θ ) p 3 d 2 m d m Camera d d 1 d 2 1 k d d k N d d Cat d 3 j j d N
Image Reconstruction [Kato+, CVPR 2014] � � � ��� � � d d d ( d ; θ ) p 3 d 2 m d m Camera 1 d d d 2 1 k d d k N d d Cat d 3 j j d N Inversed problem: Image reconstruction from a label Pot
Image Reconstruction [Kato+, CVPR 2014] Pot Optimized arrangement using: Global location cost + Adjacency cost Other examples cat (bombay) camera grand piano gramophone headphone pyramid joshua tree wheel chair
Video Generation [Yamamoto+, ACMMM 2016] • Image generation is still challenging Only successful for controlled settings: – Human faces – Birds – Flowers BEGAN StackGAN [Berthelot+, 2017 Mar.] [Zhang+, 2016 Dec.] • Video generation is … – Additionally requiring temporal consistency – Extremely challenging [Vondrick+, NIPS 2016]
Video Generation [Yamamoto+, ACMMM 2016] • This work: generating easy videos – C3D (3D convolutional neural network) for conditional generation with an input label – tempCAE (temporal convolutional auto-encoder) for regularizing video to improve its naturalness
Video Generation [Yamamoto+, ACMMM 2016] Car runs Ours to left (C3D+tempCAE) Only C3D Ours Rocket (C3D+tempCAE) flies up Only C3D
Conclusion • MIL: Machine Intelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical Systems • This talk introduces some of the current research – Recognize • Basic: Framework for DL, Domain Adaptation • Classification: Single-modality, Multi-modalities – Describe • Image Captioning, Video Captioning – Generate • Image Reconstruction, Video Generation
Recommend
More recommend