Learning Large-Scale Multimodal Data Streams Ranking, Mining, and - PowerPoint PPT Presentation

Learning Large-Scale Multimodal Data Streams – Ranking, Mining, and Machine Comprehension Winston H. HSU ( 徐宏民 ) Hung-Yi LEE ( 李宏毅 ) National Taiwan University & National Taiwan University IBM TJ Watson Ctr., New York http://winstonhsu.info/ http://speech.ee.ntu.edu.tw/~tlkagk/ @GTC 2017, May 8, 2017

2 @GTC, May 2017 – Winston Hsu

1 The First AI-Generated Movie Trailer – Identifying the “Horror” Factors by Multimodal Learning ▪ The first movie trailer generated by AI system (Watson) (tender) (suspenseful) (scary) https://www.ibm.com/blogs/think/2016/08/cognitive-movie-trailer/ @GTC, May 2017 – Winston Hsu

2 Detecting Activities of Daily Living (ADL) from Egocentric Videos ▪ Activities of daily living – used in healthcare to refer to people's daily self care activities – Enabling technologies for exciting applications ▪ Very challenging!! ADL: brushing teeth https://www.advancedrm.com/measuring-adls-to-assess-needs-and- 4 @GTC, May 2017 – Winston Hsu improve-independence/

Our Proposal: Beyond Objects – Leveraging More Contexts by Multimodal Learning [Hsieh et al., ICME’16] tap Scene: Scenes : bathroom … • Bathroom: 0.8 • Kitchen: 0.1 • Living room: 0.01 toothbrush • …. cup CNN for Sensors scene recognition (67) Sensors : Objects [1]: • Accelerometer • tap • Mic. • cup • Heartrate • toothbrush [1] Ramanan et al., Detecting Activities of Daily Living in First-person Camera Views, CVPR 2012 5 [2] Hsieh et al., Egocentric activity recognition by leveraging multiple mid-level representations, ICME 2016 @GTC, May 2017 – Winston Hsu

Experimental Results for ADL – Multimodal Learning Matters! ▪ Egocentric videos collected of 20 people (by Google Glass, GeneActiv) Accuracy 70% 60% 50% 40% 30% 20% 10% 0% [1] Ramanan et al., Detecting Activities of Daily Living in First-person Camera Views, CVPR 2012 6 [2] Hsieh et al., Egocentric activity recognition by leveraging multiple mid-level representations, ICME 2016 @GTC, May 2017 – Winston Hsu

Perception/understanding is multimodal. How to design multimodal (end-to-end) deep learning frameworks? 7 @GTC, May 2017 – Winston Hsu

Outlines ▪ Why learning with multimodal deep neural networks ▪ Requiring techniques for multimodal learning ▪ Sample projects – Medical segmentation by cross-modal and sequential learning – Cross domain and cross-view learning for 3D retrieval – Speech Summarization – Speech Question Answering – Audio Word to Vector 8 @GTC, May 2017 – Winston Hsu

3 3D Medical Segmentation by Deep Neural Networks [Tseng et al., CVPR 2017] ▪ Motivations – 3D biomedical segmentation plays a vital role in biomedical analysis. ▪ Brain tumors have different kinds of shapes, and can appear anywhere in the brain  very challenging to localize the tumors ▪ Goal – To perform 3D segmentation with deep methods and segment by stacking all the 2D slices (sequences). ▪ Observing oncologists leverage the multi-modal signals in tumor diagnosis 9 @GTC, May 2017 – Winston Hsu

Multi-Modal Biomedical Images ▪ 3D multi-modal MRI images – Different modalities used to distinguish the boundary of different tumor tissues (e.g., edema, enhancing core, non-enhancing core, necrosis) – Four modalities: Flair, T1, T1c, T2 T1c Flair T2 T1 10 @GTC, May 2017 – Winston Hsu

Related Work – SegNet (2D Image) ▪ Structured as encoder and decoder with multi- resolution fusion (MRF) ▪ But – Ignoring multi-modalities – lacking sequential learning Badrinarayanan, et al., SegNet: A Deep Convolutional Encoder-Decoder Architecture for 11 @GTC, May 2017 – Winston Hsu Image Segmentation, 2015

3D Medical Segmentation by Deep Neural Networks [Tseng et al., CVPR 2017] ▪ Our proposal – (first-ever) utilizing cross-modal learning in the (end-to-end) sequential and convolutional neural networks and effectively aggregating multiple resolutions 12 Kuan-Lun Tseng, Yen-Liang Lin, Winston Hsu and Chung-Yang Huang. Joint Sequence @GTC, May 2017 – Winston Hsu Learning and Cross-Modality Convolution for 3D Biomedical Segmentation. CVPR 2017

ConvLSTM – Temporally Augmented Convolutional Neural Networks ▪ Convolutional + sequential networks, e.g., convLSTM – Modeling spatial cues in temporal (sequential) evolvements ▪ LSTM vs. convLSTM: Traditional LSTM employs the dot-product; Conv-LSTM replaces the dot-product by convolution. Shi, et al., Convolutional LSTM Network: A Machine Learning Approach for Precipitation 13 @GTC, May 2017 – Winston Hsu Nowcasting, NIPS 2015

Cross Modality Convolution (CMC) Detailed structure in Figure 2 slice 1 slice 2 Flair … Multi-Modal Convolution Cross-Modality … Decoder – For Each Slice LSTM Encoder Convolution slice n slice 1 slice 1 slice 2 T2 … … Multi-Modal Cross-Modality Convolution slice n Decoder Encoder Convolution LSTM slice 1 slice 2 slice 2 T1 … … … slice n Cross-Modality Convolution Multi-Modal Decoder slice 1 Convolution LSTM Encoder slice 2 T1c … … slice n slice n (a) (b) (c) (d) (e) (f) (g) (h) w h Tensor(C * h * w * 4) Chan 1 C Chan 2 … w Flair Chan 1 Chan C h Chan 1 Chan 1 Chan 1 Chan 1 w … Chan 2 … Chan 2 K T2 Chan C Chan 2 Convolution Chan 2 LSTM Chan 2 Chan 1 Chan 2 … … … … Decoder Chan C Chan C T1 Chan C Chan C Chan C Chan 1 Encoder: Chan 2 … : Conv + Batch Norm + ReLU Cross-Modality Convolution Chan C T1c : Max pooling convolution with Decoder: kernel 4x1x1xC : Deconv : Conv + Batch Norm + ReLU Multi-Modal Encoder 14 @GTC, May 2017 – Winston Hsu

Comparing with the State-of-The-Art in BRATS-2015 (a) MRI slices (b) Ground truth (c) U-Net (d) CMC (ours) (e) CMC + convLSTM (ours) ▪ MRF is effective ▪ MME + CMC is better than regular encoder + decoder ▪ Two phase is an important training strategy for imbalanced data ▪ convLSTM, sequential modeling, helps slightly 15 @GTC, May 2017 – Winston Hsu

4 demo Sketch/Image-Based 3D Model Search [Liu et al., ACMMM’15] [Lee et al., 2017] ▪ Speeding up 3D design and printing – Current 3D shape search engines take text inputs only – Leveraging large-scale freely available 3D models ▪ Various applications in 3D models: 3D printing, AR, 3D game design, etc. 16 @GTC, May 2017 – Winston Hsu

Image-based 3D Shape Retrieval [Lee et al., 2017] ▪ To retrieve 3D shapes based on photo inputs ▪ Challenges: – Effective feature representations of 3D shapes (with CNNs) – Image to 3D cross-domain similarity learning Query  17 @GTC, May 2017 – Winston Hsu

Our Proposal – Cross-Domain 3D Shape Retrieval with View Sequence Learning [Lee et al., 2017] ▪ Novel proposal – End-to-end deep neural networks for cross-domain and cross-view learning and efficient triplet learning ▪ A brand-new problem Adaptation Image Image-CNN Layer representation Rank by Query Image L2 distance View-CNN Cross-View Shape … … Convolution representation View-CNN 3D Shapes Rendered Views Top Ranked 3D Shapes: 18 @GTC, May 2017 – Winston Hsu

Cross-Domain (Distance Metric) Learning: Siamese vs. Triplet Networks Triplet Contrastive Loss Loss Neural Networks (CNN / DNN..) identical, identical, weights shared weights shared image1 image2 positive anchor negative image image image Wang, Jiang, et al. "Learning fine-grained image similarity with deep ranking." CVPR 2014. 19 @GTC, May 2017 – Winston Hsu

Baseline: MVCNN, 3D Shape Feature by Max Pooling – Ignoring Sequences ▪ Straightforward but ignoring view sequences – Each view is passed to the same CNN (shared weights) – View-pooling is a MAX POOLING operation conv1 → pool5 Pool 5 feature (4096D) (4096D) fc6 fc7 fc8 (same size as pool5) airplane bed … car … View-Pooling … Su, Hang, et al. "Multi-view convolutional neural networks for 3d shape recognition. ” CVPR 2015 20 @GTC, May 2017 – Winston Hsu

Our Proposal: Cross-Domain Triplet NN with View Sequence Learning ▪ Cross-View Convolution aggregates multi-view features ▪ The adaptation layer adapts image features to the joint embedding space ▪ Late triplet sampling speeds up the training of cross-domain triplet learning 21 @GTC, May 2017 – Winston Hsu

Cross-View Convolution (CVC) ▪ Stack the feature maps from V views by channel: V x ( H x W x C ) → H x W x V x C ▪ Convolve the new tensor with K kernels (1 x 1 x V x C) – Assign K == C → #output channel == #input channel (for comparisons) – K = C = 256 = AlexNet pool5 feature map #channels ▪ CVC works as a weighted summation across views and channels from CNN features reshape 22 @GTC, May 2017 – Winston Hsu

Late Triplet Sampling (Fast-CDTNN) – Speeding Up Cross-Domain Learning ▪ Naive cross-domain triplet neural networks (CDTNN) has three streams ▪ Fast-CDTNN has two streams. It forward sampled image/3D shape, and enumerates the triplets (combinations) at the integrated triplet loss layer ▪ In our experiments, Fast-CDTNN is ~4x - 5x faster. 23 @GTC, May 2017 – Winston Hsu

Learning Large-Scale Multimodal Data Streams Ranking, Mining, and - PowerPoint PPT Presentation

Learning Large-Scale Multimodal Data Streams Ranking, Mining, and Machine Comprehension Winston H. HSU ( ) Hung-Yi LEE ( ) National Taiwan University & National Taiwan University IBM TJ Watson Ctr., New York

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde2015 DATA

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde DATA

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Medicaid Program: Community First Choice (CFC) Option Proposed Rule Overview Section 2401 of the

School Based Claiming Program Regional Information Session Spring 2015 Agenda For each

Occupational Therapy Its the Ticket Home Presented by: Amy H. Avery MS OTR/L & Vicky Hall

Department of Health and Human Services Homemaker Services Aging and Disability Services

DSHS: Developmental Disabilities Services Overview Joint Legislative Taskforce Residential

IMPACT OF HOME MODIFICATIONS ON THE PROMOTION OF AGING IN PLACE BY IMPROVING PHYSICAL PERFORMANCE

Choice Overview 1 Todays Objectives 1. Brief overview of Community First Choice (CFC) Program

Your Family! Independence and Dignity Are You Prepared To Assume The Risk? Most financial

Learning Large-Scale Multimodal Data Streams Ranking, Mining, and - PowerPoint PPT Presentation

Learning Large-Scale Multimodal Data Streams Ranking, Mining, and Machine Comprehension Winston H. HSU ( ) Hung-Yi LEE ( ) National Taiwan University & National Taiwan University IBM TJ Watson Ctr., New York

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde2015 DATA

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde DATA

Multimodal Corridor Planning &amp; Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Medicaid Program: Community First Choice (CFC) Option Proposed Rule Overview Section 2401 of the

School Based Claiming Program Regional Information Session Spring 2015 Agenda For each

Occupational Therapy Its the Ticket Home Presented by: Amy H. Avery MS OTR/L &amp; Vicky Hall

Department of Health and Human Services Homemaker Services Aging and Disability Services

DSHS: Developmental Disabilities Services Overview Joint Legislative Taskforce Residential

IMPACT OF HOME MODIFICATIONS ON THE PROMOTION OF AGING IN PLACE BY IMPROVING PHYSICAL PERFORMANCE

Choice Overview 1 Todays Objectives 1. Brief overview of Community First Choice (CFC) Program

Your Family! Independence and Dignity Are You Prepared To Assume The Risk? Most financial

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

Occupational Therapy Its the Ticket Home Presented by: Amy H. Avery MS OTR/L & Vicky Hall