Aligning Audiovisual Features for Audiovisual Speech Recognition - PowerPoint PPT Presentation

Aligning Audiovisual Features for Audiovisual Speech Recognition Fei Tao and Carlos Busso Multimodal Signal Processing (MSP) Laboratory Department of Electrical Engineering, The University of Texas at Dallas, Richardson TX-75080, USA

 Audiovisual approach for robust ASR  DNN emerges for AV-ASR Petridis et al. [2018] Ngiam et al. [2011] Neti et al. [2000] End-to-end AV-ASR Multimodal deep learning GMM-HMM 2

Introduction  Fusing audiovisual features followed static fashion  Linear interpolation (extrapolation) to align  Audiovisual modalities fusion on decision, model or feature levels. Audio Phase difference X Time Video Time How to align audiovisual modalities? 3

Motivation  Phase between lip motion and speech [Tao et al., 2016] Lip movement Acoustic activity Phase difference  Bregler and Konig [1994] show the best alignment was with a shift of 120 milliseconds  However, phase is time variant so this may not be the optimum approach 4

Motivation  Audiovisual features concatenated frame-by-frame:  For some phonemes, lip movements precede speech production  For other phonemes, speech production precede lip movements  In some cases, audiovisual modalities are well aligned [Hazen, 2006]  Pronounce the burst release of /b/  Co-articulation effects and articulator inertia may cause phase difference  Lip movement can precedes audio for phoneme /m/ in transition /g/ to /m/ (e.g., word segment ) 5

Deep Learning for Audiovisual  Deep learning for audiovisual ASR:  Ninomiya et al. (2015) extracted bottleneck feature for audiovisual fusion  Ngiam et al. (2011) proposed bimodal DNN for fusing audiovisual modalities  Tao et al. (2017) extended to bimodal RNN on AV-SAD problem for modeling audiovisual temporal information  Rely on linear interpolation to align audiovisual features Proposed Approach: Learn alignment automatically from data using attention model 6

Outline 1. Introduction 2. Proposed Approach 3. Corpus Description 4. Experiment and Result 5. Conclusion

Proposed Framework  Proposed approach relies on attention model  Attention model learns alignment in sequence-to-sequence learning  Output is represented as linear combination of input at all time points  Learn the weights in linear combination following a data-driven framework Output Sequence … … Y(t) Y(t+1) Different Length a(1) … A(T) a(3) a(2) … h(1) h(2) h(3) h(T) Input Sequence 8

Alignment Neural Network (AliNN) Feature space transform Temporal alignment 9

Alignment Neural Network (AliNN) Feature space transform Temporal alignment 10

Alignment Neural Network (AliNN) Temporal align 11

Alignment Neural Network (AliNN) Feature space transform Temporal align 12

Alignment Neural Network (AliNN) Regression Training 13

Alignment Neural Network (AliNN) Audiovisual Regression Aligned Visual Audio Extraction 14

Training AliNN  Training AliNN on the whole utterance is computationally expensive  We segment the utterance into small sections  Length of each segment is 1 sec, shifted by 0.5 sec  Sequence is padded with zeros if needed Zero padding 0.5 sec 1 sec 15

Corpus Description  CRSS-4ENGLISH-14 corpus:  55 females and 50 males (60 hrs and 48 mins)  Ideal condition: high definition camera and close-talk microphone  Challenge condition: tablet camera and tablet microphone  Clean section (read and spontaneous speech) and noisy section (subset of read speech) 16

Audiovisual Features  Audio feature: 13D MFCCs feature (100 fps)  Visual feature: 25D DCT + 5D geometric distance  30 fps for high definition camera  24 fps for tablet camera 17

Experiment Setting  70 speaker for training, 10 for validation, 25 for testing  Gender balanced  Train with ideal condition under clean environment  Test with different conditions under different environments  Two backend:  GMM-HMM: augmented with delta and delta-delta information  DNN-HMM: 15 context frames  Data of tablet (24 fps) is linearly interpolated to 30 fps  Linear interpolation for pre-processing as baseline  Focus on word error rate (WER) 18

Experiment Results  Under ideal condition, the proposed front-end always achieves the best performance  Under tablet condition, the proposed front-end achieve the best performance except GMM-HMM backend  Linear interpolate tablet data to 30 fps may impair the advantage of AliNN Ideal Conditions Tablet Conditions Front-end MODEL Clean [WER] Noise [WER] Clean [WER] Noise [WER] LInterp GMM-HMM 23.3 24.2 24.7 30.7 AliNN GMM-HMM 17.5 19.2 22.7 35.6 LInterp DNN-HMM 4.2 4.9 15.5 15.9 AliNN DNN-HMM 4.1 4.5 4.6 10.0 19

Results Analysis Ideal Conditions Tablet Conditions Front-end MODEL Clean Noise Clean Noise LInterp GMM-HMM 23.3 24.2 24.7 30.7 AliNN GMM-HMM 17.5 19.2 22.7 35.6 LInterp DNN-HMM 4.2 4.9 15.5 15.9 AliNN DNN-HMM 4.1 4.5 4.6 10.0 Ideal Tablet 20

Conclusions  This study proposed the alignment neural network (AliNN)  Learns the alignment between audio and visual modalities from data  Does not need alignment or task label  The proposed front-end is evaluated on CRSS-4ENGLISH-14 corpus  Large corpus for AV-LVASR (over 60h)  The proposed front-end outperforms simple linear interpolation under various conditions  Future work will extend approach to end-to-end framework 21

References:  C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou, “Audio - visual speech recognition,” Workshop 2000 Final Report, Technical Report 764, October 2000.  J. Ngiam, A. Khosla , M. Kim, J. Nam, H. Lee, and A. Ng, “Multimodal deep learning,” in International conference on machine learning (ICML2011), Bellevue, WA, USA, June-July 2011, pp. 689 – 696.  S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic (2018). End-to-end audiovisual speech recognition. arXiv preprint arXiv:1802.06424  T.J . Hazen, “Visual model structures and synchrony constraints for audio - visual speech recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 3, pp. 1082 – 1089, May 2006.  C. Bregler and Y. Konig , ““ Eigenlips ” for robust speech recognition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1994) , Adelaide, Aus- tralia, April 1994, vol. 2, pp. 669 – 672.  F. Tao, J.H. L. Hansen, and C. Busso, “Improving bound - ary estimation in audiovisual speech activity detection using Bayesian information criterion,” in Interspeech 2016 , San Francisco, CA, USA, September 2016, pp. 2130 – 2134. 22

Aligning Audiovisual Features for Audiovisual Speech Recognition - PowerPoint PPT Presentation

Aligning Audiovisual Features for Audiovisual Speech Recognition Fei Tao and Carlos Busso Multimodal Signal Processing (MSP) Laboratory Department of Electrical Engineering, The University of Texas at Dallas, Richardson TX-75080, USA

Improving Boundary Estimation in Audiovisual Speech Activity Detection Using Bayesian

Using multimodal speech production data to evaluate articulatory animation for audiovisual speech

1 Audiovisual Feedback Project Overview Background and Rationale Aims and Objectives

CATALONIA AUDIOVISUAL LANDSCAPE Where is Catalonia? Barcelona 7,508,106 hab. 32,000 km

Artimate : an articulatory animation framework for audiovisual speech synthesis Ingmar Steiner

Television and on-demand audiovisual services in the Russian Federation A report by Json

TRENDS OF THE EUROPEAN AUDIOVISUAL MARKET Contribution to the Conference of the Italian Presidency

Spotlight on audiovisual platforms Maja Cappello, Head of Department for legal information

BROADCASTING TV DIGITALIZATION in Albania Albania Audiovisual Landscape (Public audiovisual

The business of Zinkia Entertainment is not limited to audiovisual production - Zinkia creates and

The business of Zinkia Entertainment is not limited to audiovisual production - Zinkia creates

Protection and safeguard of the Colombian Audiovisual Heritage, PAC. The country advances in the

3 Rs of RDA A review and refresher on RDA for audiovisual materials Scott M. Dutkiewicz, author

Introduction to Audiovisual Introduction to Audiovisual Introduction to Audiovisual Compression

Automatic Annotation Suggestions for Audiovisual Archives: Evaluation Aspects L.Gazendam,

TIB|AV-Portal Challenges managing audiovisual metadata encoded in RDF Jrg Waitelonis yovisto

REEZOLV AUDIOVISUAL SPECIALISTS Who we are: our services Reezolv offers sound engineering &

CHANGES IN TUVALU: roles and paradigms of audiovisual archives in relation to a changing

GENERATION Fernando Pereira Instituto Superior Tcnico Audiovisual Communication, Fernando

LEE, SANTANA, STANTON, YU Defined as: A system composed of multiple audiovisual devices (i.e.

and Audiovisual Patient Education Materials Sarah J. Shoemaker, PharmD, PhD Abt Associates, Inc.

ACADI: Automatic Character (in Audiovisual Document) Indexing Frdric GIANNI and Julien

YOUR PARTNER IN LIGHT & AUDIOVISUAL TECHNOLOGY FOR EXHIBITIONS, EVENTS & MEETINGS

THE MAIN TECHNICAL AUDIOVISUAL VOCABULARY IN ENGLISH: HOW HOTELLYWOOD CAN HELP By

Aligning Audiovisual Features for Audiovisual Speech Recognition - PowerPoint PPT Presentation

Aligning Audiovisual Features for Audiovisual Speech Recognition Fei Tao and Carlos Busso Multimodal Signal Processing (MSP) Laboratory Department of Electrical Engineering, The University of Texas at Dallas, Richardson TX-75080, USA

Improving Boundary Estimation in Audiovisual Speech Activity Detection Using Bayesian

Using multimodal speech production data to evaluate articulatory animation for audiovisual speech

1 Audiovisual Feedback Project Overview Background and Rationale Aims and Objectives

CATALONIA AUDIOVISUAL LANDSCAPE Where is Catalonia? Barcelona 7,508,106 hab. 32,000 km

Artimate : an articulatory animation framework for audiovisual speech synthesis Ingmar Steiner

Television and on-demand audiovisual services in the Russian Federation A report by Json

TRENDS OF THE EUROPEAN AUDIOVISUAL MARKET Contribution to the Conference of the Italian Presidency

Spotlight on audiovisual platforms Maja Cappello, Head of Department for legal information

BROADCASTING TV DIGITALIZATION in Albania Albania Audiovisual Landscape (Public audiovisual

The business of Zinkia Entertainment is not limited to audiovisual production - Zinkia creates and

The business of Zinkia Entertainment is not limited to audiovisual production - Zinkia creates

Protection and safeguard of the Colombian Audiovisual Heritage, PAC. The country advances in the

3 Rs of RDA A review and refresher on RDA for audiovisual materials Scott M. Dutkiewicz, author

Introduction to Audiovisual Introduction to Audiovisual Introduction to Audiovisual Compression

Automatic Annotation Suggestions for Audiovisual Archives: Evaluation Aspects L.Gazendam,

TIB|AV-Portal Challenges managing audiovisual metadata encoded in RDF Jrg Waitelonis yovisto

REEZOLV AUDIOVISUAL SPECIALISTS Who we are: our services Reezolv offers sound engineering &amp;

CHANGES IN TUVALU: roles and paradigms of audiovisual archives in relation to a changing

GENERATION Fernando Pereira Instituto Superior Tcnico Audiovisual Communication, Fernando

LEE, SANTANA, STANTON, YU Defined as: A system composed of multiple audiovisual devices (i.e.

and Audiovisual Patient Education Materials Sarah J. Shoemaker, PharmD, PhD Abt Associates, Inc.

ACADI: Automatic Character (in Audiovisual Document) Indexing Frdric GIANNI and Julien

YOUR PARTNER IN LIGHT &amp; AUDIOVISUAL TECHNOLOGY FOR EXHIBITIONS, EVENTS &amp; MEETINGS

THE MAIN TECHNICAL AUDIOVISUAL VOCABULARY IN ENGLISH: HOW HOTELLYWOOD CAN HELP By

REEZOLV AUDIOVISUAL SPECIALISTS Who we are: our services Reezolv offers sound engineering &

YOUR PARTNER IN LIGHT & AUDIOVISUAL TECHNOLOGY FOR EXHIBITIONS, EVENTS & MEETINGS