Multimodal Machine Learning Louis-Philippe (LP) Morency CMU - PowerPoint PPT Presentation

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1

CMU Course 11-777: Multimodal Machine Learning 2

Lecture Objectives ▪ What is Multimodal? ▪ Multimodal: Core technical challenges ▪ Representation learning, translation, alignment, fusion and co-learning ▪ Multimodal representation learning ▪ Joint and coordinated representations ▪ Multimodal autoencoder and tensor representation ▪ Deep canonical correlation analysis ▪ Fusion and temporal modeling ▪ Multi-view LSTM and memory-based fusion ▪ Fusion with multiple attentions

What is Multimodal? 4

What is Multimodal? Multimodal distribution ➢ Multiple modes, i.e., distinct “peaks” (local maxima) in the probability density function

What is Multimodal? Sensory Modalities

Multimodal Communicative Behaviors V erbal V isual ▪ Lexicon ▪ Gestures ▪ Words ▪ Head gestures ▪ Eye gestures ▪ Syntax ▪ Arm gestures ▪ Part-of-speech ▪ Dependencies ▪ Body language ▪ Body posture ▪ Pragmatics ▪ Proxemics ▪ Discourse acts V ocal ▪ Eye contact ▪ Head gaze ▪ Prosody ▪ Eye gaze ▪ Intonation ▪ Facial expressions ▪ Voice quality ▪ FACS action units ▪ Vocal expressions ▪ Smile, frowning ▪ Laughter, moans 7

What is Multimodal? Modality The way in which something happens or is experienced. • Modality refers to a certain type of information and/or the representation format in which information is stored. • Sensory modality: one of the primary forms of sensation, as vision or touch; channel of communication. (“middle”) Medium A means or instrumentality for storing or communicating information; system of communication/transmission. • Medium is the means whereby this information is delivered to the senses of the interpreter. 8

Multiple Communities and Modalities Psychology Medical Speech Vision Language Multimedia Robotics Learning

Examples of Modalities ❑ Natural language (both spoken or written) ❑ Visual (from images or videos) ❑ Auditory (including voice, sounds and music) ❑ Haptics / touch ❑ Smell, taste and self-motion ❑ Physiological signals ▪ Electrocardiogram (ECG), skin conductance ❑ Other modalities ▪ Infrared images, depth images, fMRI

Prior Research on “Multimodal” Four eras of multimodal research ➢ The “ behavioral ” era (1970s until late 1980s) ➢ The “ computational ” era (late 1980s until 2000) ➢ The “ interaction ” era (2000 - 2010) ➢ The “ deep learning ” era (2010s until …) ❖ Main focus of this tutorial 1970 1980 1990 2000 2010 11

The McGurk Effect (1976) Hearing lips and seeing voices – Nature 1970 1980 1990 2000 2010 12

The McGurk Effect (1976) Hearing lips and seeing voices – Nature 1970 1980 1990 2000 2010 13

➢ The “ Computational ” Era(Late 1980s until 2000) 1) Audio-Visual Speech Recognition (AVSR) 1970 1980 1990 2000 2010 14

Core Technical Challenges 15

Core Challenges in “Deep” Multimodal ML Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency https://arxiv.org/abs/1705.09406 These challenges are non-exclusive. 16

Core Challenge 1: Representation Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. Joint representations: A Representation Modality 1 Modality 2 17

Joint Multimodal Representation “I like it!” Joyful tone “Wow!” Tensed voice Joint Representation (Multimodal Space) 18

Joint Multimodal Representations Multimodal Representation Audio-visual speech recognition Depth Multimodal [Ngiam et al., ICML 2011] • Bimodal Deep Belief Network Image captioning [Srivastava and Salahutdinov, NIPS 2012] • Multimodal Deep Boltzmann Machine Depth Video Depth Verbal Audio-visual emotion recognition [Kim et al., ICASSP 2013] • Deep Boltzmann Machine Visual Verbal 19

Multimodal Vector Space Arithmetic [Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014] 20

Core Challenge 1: Representation Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. Joint representations: Coordinated representations: A B Repres. 1 Repres 2 Representation Modality 1 Modality 2 Modality 1 Modality 2 21

Coordinated Representation: Deep CCA Learn linear projections that are maximally correlated: View 𝐼 𝑦 𝒗 ∗ , 𝒘 ∗ = argmax 𝑑𝑝𝑠𝑠 𝒗 𝑼 𝒀, 𝒘 𝑼 𝒁 View 𝐼 𝑧 𝒗,𝒘 𝑰 𝒚 𝑰 𝒛 · · · · · · 𝑾 𝑽 𝒘 · · · · · · 𝒗 𝑿 𝒛 𝑿 𝒚 · · · · · · 𝒁 Text Image 𝒀 𝒀 𝒁 Andrew et al., ICML 2013 22

Core Challenge 2: Alignment Definition: I dentify the direct relations between (sub)elements from two or more different modalities. Modality 2 Modality 1 A Explicit Alignment t 1 The goal is to directly find correspondences t 2 t 4 between elements of different modalities Fancy algorithm t 3 t 5 B Implicit Alignment Uses internally latent alignment of modalities in order to better solve a different problem t n t n 23

Temporal sequence alignment Applications: - Re-aligning asynchronous data - Finding similar data across modalities (we can estimate the aligned cost) - Event reconstruction from multiple sources

Alignment examples (multimodal)

Implicit Alignment Karpathy et al., Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, https://arxiv.org/pdf/1406.5679.pdf 26

Core Challenge 3: Fusion Definition: To join information from two or more modalities to perform a prediction task. A Model-Agnostic Approaches 2) Late Fusion 1) Early Fusion Modality 1 Classifier Modality 1 Classifier Modality 2 Classifier Modality 2 27

Core Challenge 3: Fusion Definition: To join information from two or more modalities to perform a prediction task. B Model-Based (Intermediate) Approaches 1) Deep neural networks Multiple kernel learning 2) Kernel-based methods y 𝐵 𝐵 𝐵 𝐵 𝐵 ℎ 1 ℎ 2 ℎ 3 ℎ 4 ℎ 5 𝑊 𝑊 𝑊 𝑊 𝑊 ℎ 1 ℎ 2 ℎ 3 ℎ 4 ℎ 5 3) Graphical models 𝑩 𝑩 𝑩 𝑩 𝑩 𝒚 𝟐 𝒚 𝟑 𝒚 𝟒 𝒚 𝟓 𝒚 𝟔 𝑾 𝑾 𝑾 𝑾 𝑾 𝒚 𝟐 𝒚 𝟑 𝒚 𝟒 𝒚 𝟓 𝒚 𝟔 Multi-View Hidden CRF 28

Core Challenge 4: Translation Definition: Process of changing data from one modality to another, where the translation relationship can often be open-ended or subjective. A Example-based B Model-driven 29

Core Challenge 4 – Translation Transcriptions Visual gestures + (both speaker and listener gestures) Audio streams Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013

Core Challenge 5: Co-Learning Definition: T ransfer knowledge between modalities, including their representations and predictive models. Prediction Modality 1 Modality 2 31

Core Challenge 5: Co-Learning C Hybrid B Non-Parallel A Parallel 32

Big dog Prediction on the beach 1 2 𝑢 1 𝑢 4 𝑢 2 𝑢 2 𝑢 5 𝑢 3 𝑢 3 𝑢 6 𝑢 𝑜 𝑢 𝑜 Language Visual Input Modalities 33 Acoustic

Taxonomy of Multimodal Research [ https://arxiv.org/abs/1705.09406 ] ▪ o Encoder-decoder Model-based Representation o Online prediction o Kernel-based ▪ Joint o Alignment Graphical models o Neural networks o Neural networks ▪ Explicit o Graphical models Co-learning o Sequential o Unsupervised ▪ Coordinated ▪ Parallel data o Supervised o Similarity ▪ Implicit o Co-training o Structured o o Transfer learning Graphical models Translation ▪ Non-parallel data o Neural networks ▪ Example-based ▪ Fusion Zero-shot learning ▪ Concept grounding o Retrieval ▪ Model agnostic ▪ Transfer learning o Combination o Early fusion ▪ Hybrid data ▪ Model-based o Late fusion ▪ Bridging o Grammar-based o Hybrid fusion Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy

Multimodal Applications [ https://arxiv.org/abs/1705.09406 ] Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy

Multimodal Representations 36

Core Challenge: Representation Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. Joint representations: Coordinated representations: A B Repres. 1 Repres 2 Representation Modality 1 Modality 2 Modality 1 Modality 2 37

Deep Multimodal autoencoders ▪ A deep representation learning approach ▪ A bimodal auto-encoder ▪ Used for Audio-visual speech recognition [Ngiam et al., Multimodal Deep Learning, 2011]

Deep Multimodal autoencoders - training ▪ Individual modalities can be pretrained ▪ RBMs ▪ Denoising Autoencoders ▪ To train the model to reconstruct the other modality ▪ Use both ▪ Remove audio [Ngiam et al., Multimodal Deep Learning, 2011]

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU - PowerPoint PPT Presentation

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1 CMU Course 11-777: Multimodal Machine Learning 2 Lecture Objectives What is Multimodal? Multimodal:

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Main Goal Define a common taxonomy for multimodal machine learning

Probing the Need for Visual Context in Multimodal Machine Translation Ozan Caglayan 1 , Pranava

Multimodal Machine Translation with Embedding Prediction Tosho Hirasawa, Hayahide Yamagishi,

Learning Large-Scale Multimodal Data Streams Ranking, Mining, and Machine Comprehension Winston

An Exercise in An Exercise in Machine Learning Machine Learning

Machine Learning By Alex Scarlatos What is Machine Learning? Machine Learning is the process by

Machine Learning: Study of algorithms that improve their performance P at some task T

Traditional Machine Learning: Unsupervised Learning Juhan Nam Traditional Machine Learning

Multimodal Deep Learning Ahmed Abdelkader Design & Innovation Lab, ADAPT Centre Talk outline

Recent Advances and Key Challenges Russ Salakhutdinov Machine Learning Department Carnegie Mellon

CS 335 Machine Learning What is Machine Learning? Dan Sheldon Spring 2019 What is Machine

Machine Learning Machine Learning: algorithms that use experience to improve their

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

1 Why Study Machine Learning? Why Study Machine Learning? Cognitive Science The Time is Ripe

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

Apache PredictionIO End-to-End Machine Learning Server with Apache Spark What is Machine

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

Deep Learning: Intro Juhan Nam Review of Traditional Machine Learning The traditional machine

Machine Learning for Auto Optimization What is Machine Learning? Definition: Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest

Machine Learning Modeling and Learning 15-110 Monday 4/13 Learning Goals Given a

Machine Learning @ Amazon Ralf Herbrich Amazon 6/29/17 1 Overview Machine Learning in

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU - PowerPoint PPT Presentation

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1 CMU Course 11-777: Multimodal Machine Learning 2 Lecture Objectives What is Multimodal? Multimodal:

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Main Goal Define a common taxonomy for multimodal machine learning

Probing the Need for Visual Context in Multimodal Machine Translation Ozan Caglayan 1 , Pranava

Multimodal Machine Translation with Embedding Prediction Tosho Hirasawa, Hayahide Yamagishi,

Learning Large-Scale Multimodal Data Streams Ranking, Mining, and Machine Comprehension Winston

An Exercise in An Exercise in Machine Learning Machine Learning

Machine Learning By Alex Scarlatos What is Machine Learning? Machine Learning is the process by

Machine Learning: Study of algorithms that improve their performance P at some task T

Traditional Machine Learning: Unsupervised Learning Juhan Nam Traditional Machine Learning

Multimodal Deep Learning Ahmed Abdelkader Design &amp; Innovation Lab, ADAPT Centre Talk outline

Recent Advances and Key Challenges Russ Salakhutdinov Machine Learning Department Carnegie Mellon

CS 335 Machine Learning What is Machine Learning? Dan Sheldon Spring 2019 What is Machine

Machine Learning Machine Learning: algorithms that use experience to improve their

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

1 Why Study Machine Learning? Why Study Machine Learning? Cognitive Science The Time is Ripe

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

Apache PredictionIO End-to-End Machine Learning Server with Apache Spark What is Machine

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

Deep Learning: Intro Juhan Nam Review of Traditional Machine Learning The traditional machine

Machine Learning for Auto Optimization What is Machine Learning? Definition: Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest

Machine Learning Modeling and Learning 15-110 Monday 4/13 Learning Goals Given a

Machine Learning @ Amazon Ralf Herbrich Amazon 6/29/17 1 Overview Machine Learning in

Multimodal Deep Learning Ahmed Abdelkader Design & Innovation Lab, ADAPT Centre Talk outline