learning large scale multimodal data streams
play

Learning Large-Scale Multimodal Data Streams Ranking, Mining, and - PowerPoint PPT Presentation

Learning Large-Scale Multimodal Data Streams Ranking, Mining, and Machine Comprehension Winston H. HSU ( ) Hung-Yi LEE ( ) National Taiwan University & National Taiwan University IBM TJ Watson Ctr., New York


  1. Learning Large-Scale Multimodal Data Streams – Ranking, Mining, and Machine Comprehension Winston H. HSU ( 徐宏民 ) Hung-Yi LEE ( 李宏毅 ) National Taiwan University & National Taiwan University IBM TJ Watson Ctr., New York http://winstonhsu.info/ http://speech.ee.ntu.edu.tw/~tlkagk/ @GTC 2017, May 8, 2017

  2. 2 @GTC, May 2017 – Winston Hsu

  3. 1 The First AI-Generated Movie Trailer – Identifying the “Horror” Factors by Multimodal Learning ▪ The first movie trailer generated by AI system (Watson) (tender) (suspenseful) (scary) https://www.ibm.com/blogs/think/2016/08/cognitive-movie-trailer/ @GTC, May 2017 – Winston Hsu

  4. 2 Detecting Activities of Daily Living (ADL) from Egocentric Videos ▪ Activities of daily living – used in healthcare to refer to people's daily self care activities – Enabling technologies for exciting applications ▪ Very challenging!! ADL: brushing teeth https://www.advancedrm.com/measuring-adls-to-assess-needs-and- 4 @GTC, May 2017 – Winston Hsu improve-independence/

  5. Our Proposal: Beyond Objects – Leveraging More Contexts by Multimodal Learning [Hsieh et al., ICME’16] tap Scene: Scenes : bathroom … • Bathroom: 0.8 • Kitchen: 0.1 • Living room: 0.01 toothbrush • …. cup CNN for Sensors scene recognition (67) Sensors : Objects [1]: • Accelerometer • tap • Mic. • cup • Heartrate • toothbrush [1] Ramanan et al., Detecting Activities of Daily Living in First-person Camera Views, CVPR 2012 5 [2] Hsieh et al., Egocentric activity recognition by leveraging multiple mid-level representations, ICME 2016 @GTC, May 2017 – Winston Hsu

  6. Experimental Results for ADL – Multimodal Learning Matters! ▪ Egocentric videos collected of 20 people (by Google Glass, GeneActiv) Accuracy 70% 60% 50% 40% 30% 20% 10% 0% [1] Ramanan et al., Detecting Activities of Daily Living in First-person Camera Views, CVPR 2012 6 [2] Hsieh et al., Egocentric activity recognition by leveraging multiple mid-level representations, ICME 2016 @GTC, May 2017 – Winston Hsu

  7. Perception/understanding is multimodal. How to design multimodal (end-to-end) deep learning frameworks? 7 @GTC, May 2017 – Winston Hsu

  8. Outlines ▪ Why learning with multimodal deep neural networks ▪ Requiring techniques for multimodal learning ▪ Sample projects – Medical segmentation by cross-modal and sequential learning – Cross domain and cross-view learning for 3D retrieval – Speech Summarization – Speech Question Answering – Audio Word to Vector 8 @GTC, May 2017 – Winston Hsu

  9. 3 3D Medical Segmentation by Deep Neural Networks [Tseng et al., CVPR 2017] ▪ Motivations – 3D biomedical segmentation plays a vital role in biomedical analysis. ▪ Brain tumors have different kinds of shapes, and can appear anywhere in the brain  very challenging to localize the tumors ▪ Goal – To perform 3D segmentation with deep methods and segment by stacking all the 2D slices (sequences). ▪ Observing oncologists leverage the multi-modal signals in tumor diagnosis 9 @GTC, May 2017 – Winston Hsu

  10. Multi-Modal Biomedical Images ▪ 3D multi-modal MRI images – Different modalities used to distinguish the boundary of different tumor tissues (e.g., edema, enhancing core, non-enhancing core, necrosis) – Four modalities: Flair, T1, T1c, T2 T1c Flair T2 T1 10 @GTC, May 2017 – Winston Hsu

  11. Related Work – SegNet (2D Image) ▪ Structured as encoder and decoder with multi- resolution fusion (MRF) ▪ But – Ignoring multi-modalities – lacking sequential learning Badrinarayanan, et al., SegNet: A Deep Convolutional Encoder-Decoder Architecture for 11 @GTC, May 2017 – Winston Hsu Image Segmentation, 2015

  12. 3D Medical Segmentation by Deep Neural Networks [Tseng et al., CVPR 2017] ▪ Our proposal – (first-ever) utilizing cross-modal learning in the (end-to-end) sequential and convolutional neural networks and effectively aggregating multiple resolutions 12 Kuan-Lun Tseng, Yen-Liang Lin, Winston Hsu and Chung-Yang Huang. Joint Sequence @GTC, May 2017 – Winston Hsu Learning and Cross-Modality Convolution for 3D Biomedical Segmentation. CVPR 2017

  13. ConvLSTM – Temporally Augmented Convolutional Neural Networks ▪ Convolutional + sequential networks, e.g., convLSTM – Modeling spatial cues in temporal (sequential) evolvements ▪ LSTM vs. convLSTM: Traditional LSTM employs the dot-product; Conv-LSTM replaces the dot-product by convolution. Shi, et al., Convolutional LSTM Network: A Machine Learning Approach for Precipitation 13 @GTC, May 2017 – Winston Hsu Nowcasting, NIPS 2015

  14. Cross Modality Convolution (CMC) Detailed structure in Figure 2 slice 1 slice 2 Flair … Multi-Modal Convolution Cross-Modality … Decoder – For Each Slice LSTM Encoder Convolution slice n slice 1 slice 1 slice 2 T2 … … Multi-Modal Cross-Modality Convolution slice n Decoder Encoder Convolution LSTM slice 1 slice 2 slice 2 T1 … … … slice n Cross-Modality Convolution Multi-Modal Decoder slice 1 Convolution LSTM Encoder slice 2 T1c … … slice n slice n (a) (b) (c) (d) (e) (f) (g) (h) w h Tensor(C * h * w * 4) Chan 1 C Chan 2 … w Flair Chan 1 Chan C h Chan 1 Chan 1 Chan 1 Chan 1 w … Chan 2 … Chan 2 K T2 Chan C Chan 2 Convolution Chan 2 LSTM Chan 2 Chan 1 Chan 2 … … … … Decoder Chan C Chan C T1 Chan C Chan C Chan C Chan 1 Encoder: Chan 2 … : Conv + Batch Norm + ReLU Cross-Modality Convolution Chan C T1c : Max pooling convolution with Decoder: kernel 4x1x1xC : Deconv : Conv + Batch Norm + ReLU Multi-Modal Encoder 14 @GTC, May 2017 – Winston Hsu

  15. Comparing with the State-of-The-Art in BRATS-2015 (a) MRI slices (b) Ground truth (c) U-Net (d) CMC (ours) (e) CMC + convLSTM (ours) ▪ MRF is effective ▪ MME + CMC is better than regular encoder + decoder ▪ Two phase is an important training strategy for imbalanced data ▪ convLSTM, sequential modeling, helps slightly 15 @GTC, May 2017 – Winston Hsu

  16. 4 demo Sketch/Image-Based 3D Model Search [Liu et al., ACMMM’15] [Lee et al., 2017] ▪ Speeding up 3D design and printing – Current 3D shape search engines take text inputs only – Leveraging large-scale freely available 3D models ▪ Various applications in 3D models: 3D printing, AR, 3D game design, etc. 16 @GTC, May 2017 – Winston Hsu

  17. Image-based 3D Shape Retrieval [Lee et al., 2017] ▪ To retrieve 3D shapes based on photo inputs ▪ Challenges: – Effective feature representations of 3D shapes (with CNNs) – Image to 3D cross-domain similarity learning Query  17 @GTC, May 2017 – Winston Hsu

  18. Our Proposal – Cross-Domain 3D Shape Retrieval with View Sequence Learning [Lee et al., 2017] ▪ Novel proposal – End-to-end deep neural networks for cross-domain and cross-view learning and efficient triplet learning ▪ A brand-new problem Adaptation Image Image-CNN Layer representation Rank by Query Image L2 distance View-CNN Cross-View Shape … … Convolution representation View-CNN 3D Shapes Rendered Views Top Ranked 3D Shapes: 18 @GTC, May 2017 – Winston Hsu

  19. Cross-Domain (Distance Metric) Learning: Siamese vs. Triplet Networks Triplet Contrastive Loss Loss Neural Networks (CNN / DNN..) identical, identical, weights shared weights shared image1 image2 positive anchor negative image image image Wang, Jiang, et al. "Learning fine-grained image similarity with deep ranking." CVPR 2014. 19 @GTC, May 2017 – Winston Hsu

  20. Baseline: MVCNN, 3D Shape Feature by Max Pooling – Ignoring Sequences ▪ Straightforward but ignoring view sequences – Each view is passed to the same CNN (shared weights) – View-pooling is a MAX POOLING operation conv1 → pool5 Pool 5 feature (4096D) (4096D) fc6 fc7 fc8 (same size as pool5) airplane bed … car … View-Pooling … Su, Hang, et al. "Multi-view convolutional neural networks for 3d shape recognition. ” CVPR 2015 20 @GTC, May 2017 – Winston Hsu

  21. Our Proposal: Cross-Domain Triplet NN with View Sequence Learning ▪ Cross-View Convolution aggregates multi-view features ▪ The adaptation layer adapts image features to the joint embedding space ▪ Late triplet sampling speeds up the training of cross-domain triplet learning 21 @GTC, May 2017 – Winston Hsu

  22. Cross-View Convolution (CVC) ▪ Stack the feature maps from V views by channel: V x ( H x W x C ) → H x W x V x C ▪ Convolve the new tensor with K kernels (1 x 1 x V x C) – Assign K == C → #output channel == #input channel (for comparisons) – K = C = 256 = AlexNet pool5 feature map #channels ▪ CVC works as a weighted summation across views and channels from CNN features reshape 22 @GTC, May 2017 – Winston Hsu

  23. Late Triplet Sampling (Fast-CDTNN) – Speeding Up Cross-Domain Learning ▪ Naive cross-domain triplet neural networks (CDTNN) has three streams ▪ Fast-CDTNN has two streams. It forward sampled image/3D shape, and enumerates the triplets (combinations) at the integrated triplet loss layer ▪ In our experiments, Fast-CDTNN is ~4x - 5x faster. 23 @GTC, May 2017 – Winston Hsu

Recommend


More recommend