Improving Boundary Estimation in Audiovisual Speech Activity - PowerPoint PPT Presentation

MSP - CRSS � Improving Boundary Estimation in Audiovisual Speech Activity Detection � Using Bayesian Information Criterion � Fei Tao John H.L. Hansen Carlos Busso Multimodal Signal Processing (MSP) Laboratory , Center for Robust Speech Systems (CRSS), Department of Electrical Engineering, The University of Texas at Dallas, Richardson TX 75080, USA busso@utdallas.edu 1 msp.utdallas.edu

MSP - CRSS � Introduction • Speech Activity Detection (SAD) plays an important role in speech-based interfaces • Audio-only SAD (A-SAD) may fail • Noise • Different speech mode (e.g. whisper speech) • Introduce Visual SAD (V-SAD) to improve SAD [Aubrey et al. (2007), Joosten et al.(2013)] busso@utdallas.edu 2 msp.utdallas.edu

MSP - CRSS � • One key problem exists in V-SAD system was the precise detection of boundaries • Lip movement associated with non-speech event (e.g. lip smacking, laughing) • Anticipatory facial movements (e.g. 10 ms) • Low video resolution (30 fps vs. 100 fps) • Bayesian Information Criterion (BIC) to improve boundary detection busso@utdallas.edu 3 msp.utdallas.edu

MSP - CRSS � Previous Work on SAD • Supervised V-SAD • Aubrey et al (2007) applied HMM in developing V-SAD system; • Joosten et al (2013) applied SVM classifier • AV-SAD Fusion • Takeuchi et al. (2009) combined the V-SAD and A-SAD decision boundaries using logical operators. • Almajai and Milner (2008) concatenated acoustic and visual features. • No one has worked on improving the boundary detection busso@utdallas.edu 4 msp.utdallas.edu

MSP - CRSS � AV-SAD System: Audio Component • Framework proposed by Sajadi and Hansen (2013) • Audio feature (5-D) • Principal Component Analysis (PCA) on audio feature: 1- D combo feature harmonicity Combo Feature clarity PCA prediction gain periodicity perceptual spectral flux busso@utdallas.edu 5 msp.utdallas.edu

MSP - CRSS � Unsupervised A-SAD • Unsupervised clustering with EM approach Non-speech Class Threshold Speech Class busso@utdallas.edu 6 6 msp.utdallas.edu

MSP - CRSS � AV-SAD System: Video Component • Video feature [Tao et al (2015)]: • Optical flow: OF x , OF y and OF x +OF y (OF xy ) • Geometric feature: height (H), width (W), W x H and H+W • Short term statistics (0.3 s window) Feature Set Set OFx OFy OFxy H W W+H WxH Temporal Variance ü ü ü ü ü ü ü Zero Crossing Rate ü ü ü ü ü ü ü Speech Periodic Characteris?c ü ü ü ü ü ü ü First Order Deriva?ve ü ü ü ü 25-D feature in total busso@utdallas.edu 7 msp.utdallas.edu

MSP - CRSS � Unsupervised V-SAD • Similar approach to unsupervised A-SAD • PCA on 25-D feature • EM to form two classes on “combo” feature 25-D PCA Visual combo feature Non-speech Class Threshold Speech Class busso@utdallas.edu msp.utdallas.edu

MSP - CRSS � Proposed Approach • Unsupervised A-SAD and V-SAD [Sajadi and Hansen (2013),Tao et al (2015)]: • Audio-visual fusion • Logical fusion: “AND” and “OR” • BIC refine Audio (5D) Video (25D) busso@utdallas.edu 9 msp.utdallas.edu

MSP - CRSS � Bayesian Information Criterion (BIC) Refine • The BIC is a criterion used to select a model among potential candidate models [Zhou and Hansen (2005)] • Hypothesis 1 (H1): one single distribution • Hypothesis 2 (H2): bimodal distribution • ∆ BIC = BIC(H2) – BIC(H1) d is the feature dimension is covariance of N frames, is covariance of the first b frames, is covariance of the N-b frames Hypothesis 1 Hypothesis 2 b frames N frames busso@utdallas.edu 10 msp.utdallas.edu

MSP - CRSS � Bayesian Information Criterion (BIC) Refine • Focus on transition area • Potential boundary given by previous steps • ∆ BIC computed for each frame in search window • Extra frames before and after search window Speech Non-Speech Search Window Search Window 0.5s 0.5s potential boundary busso@utdallas.edu 11 msp.utdallas.edu

MSP - CRSS � Bayesian Information Criterion (BIC) Refine • Focus on transition area • Potential boundary given by previous steps • ∆ BIC computed for each frame in search window • Extra frames before and after search window ∆ BIC? 1 2 extra extra frames frames busso@utdallas.edu 12 msp.utdallas.edu

MSP - CRSS � Corpus Description • MSP Audio-visual Whisper (MSP-AVW) corpus • 20 males and 20 females • 120 TIMIT sentences per speaker (60 in neutral, 60 in whisper) • Audio: SHURE 48 KHz close-talk microphone • Video: high definition SONY cameras (1440 × 1080) at 29.97 fps busso@utdallas.edu 13 msp.utdallas.edu

MSP - CRSS � Experiment and Result • Performance without BIC • Whisper decreases performance by ~20% • V-SAD is robust to different modes • Under neutral condition, the fusion decreases the performance by ~5% • The ground truth of the labels was annotated based only on audio • Original sampling frequency is low (29.97 fps) • Under whisper condition, the fusion improves the performance by ~8% Modality Set Acc [%] Pre [%] Rec [%] F [%] Nsen 94.05 97.15 89.85 93.35 A-SAD Wsen 67.96 61.02 88.65 72.28 Nsen 78.06 75.11 89.45 80.40 V-SAD Wsen 78.20 72.69 89.10 80.06 Nsen 89.47 97.90 79.93 88.00 AV-SAD Wsen 81.28 81.73 79.21 80.45 busso@utdallas.edu 14 msp.utdallas.edu

MSP - CRSS � • Performance with BIC: • Apply BIC on detected boundary from AV-SAD Set ACC [%] Pre [%] Rec [%] F [%] Nsen 89.47 97.90 79.93 88.00 AV-SAD Wsen 81.28 81.73 79.21 80.45 Nsen 91.11 97.47 83.77 90.10 AV-SAD + A-BIC Wsen 82.91 84.47 79.48 81.90 Nsen 88.53 92.22 83.18 87.47 AV-SAD + V-BIC Wsen 78.67 76.63 80.54 78.53 Nsen 91.25 97.49 84.05 90.27 AV-SAD + AV-BIC Wsen 82.87 83.76 80.37 82.03 • A-BIC improves the system: • For speech detection, ~2% absolute improvement • V-BIC impairs the system • Modalities mismatch • AV-BIC achieves best performance on speech detection busso@utdallas.edu 15 msp.utdallas.edu

MSP - CRSS � Median Local Boundary Mismatch • Local Boundary Mismatch (LBM) • the mismatch frames between the detected boundary and ground truth in local regions • Median Local Boundary Mismatch (MLBM) • Represents the boundary detection performance • Lower is better busso@utdallas.edu 16 msp.utdallas.edu

MSP - CRSS � • Boundary detection performance: • Up-sampling to 100 fps for MLBM comparison Set MLBM [fps] Nsen 35.00 AV-SAD Wsen 64.00 Nsen 25.00 AV-SAD + A-BIC Wsen 56.00 Nsen 42.00 AV-SAD + V-BIC Wsen 71.00 Nsen 25.00 AV-SAD + AV-BIC Wsen 53.00 • A-BIC improves the system: • For MLBM, relatively improve 28.5% under neutral and 12.5% under whisper • V-BIC impairs the system • Modalities mismatch • AV-BIC achieves best performance on boundary detection busso@utdallas.edu 17 msp.utdallas.edu

MSP - CRSS � Conclusion and Future Work • Conclusion • AV-SAD is explored showing that visual modality will improve robustness under whisper condition • Proposed a approach to improve boundary detection in SAD by BIC • AV-BIC achieves best performance • Future Work • Better fusion approach need be explored busso@utdallas.edu 18 msp.utdallas.edu

MSP - CRSS � busso@utdallas.edu 19 msp.utdallas.edu

Improving Boundary Estimation in Audiovisual Speech Activity - PowerPoint PPT Presentation

MSP - CRSS Improving Boundary Estimation in Audiovisual Speech Activity Detection Using Bayesian Information Criterion Fei Tao John H.L. Hansen Carlos Busso Multimodal Signal Processing (MSP) Laboratory , Center for Robust Speech

Introduction to Audiovisual Introduction to Audiovisual Introduction to Audiovisual Compression

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Aligning Audiovisual Features for Audiovisual Speech Recognition Fei Tao and Carlos Busso

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

BROADCASTING TV DIGITALIZATION in Albania Albania Audiovisual Landscape (Public audiovisual

Television and on-demand audiovisual services in the Russian Federation A report by Json

Boundary Line Discussion February 2018 Agenda District vision Why boundary line

Boundary value problems What problems does boundary value testing have? ECT2 Boundary

Using multimodal speech production data to evaluate articulatory animation for audiovisual speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

FluxBuster Early Detec+on of Malicious Flux Networks via

A closer look at big- O notation. We all know that in a formula y ax + the values of b = both a

Pasi Lautala Assistant Professor, Dept. of Civil & Env. Engr Director, Michigan Tech Transp.

Graduate Studies in Engineering and Computer Science at the University of Central Florida

Ladder Capsule Network Taewon Joeng, Youngmin Lee, Heeyoung Kim Industrial Statistics Lab, KAIST

Vaccine-Preventable Diseases 1 2/16/15 I have nothing to disclose Resurgence of VPDs

COVID-19 Update AUGUST 13, 2020 QUESTIONS: VCHELP@FNTN.CA Outline 1. MOH Update Dr. Wadieh

Financial Disclosures Clinical Pearls in Assessment and none Treatment Descartes Li, M.D.

Sambuz

Useful Links

Newsletter

Mail Us

Improving Boundary Estimation in Audiovisual Speech Activity - PowerPoint PPT Presentation

MSP - CRSS Improving Boundary Estimation in Audiovisual Speech Activity Detection Using Bayesian Information Criterion Fei Tao John H.L. Hansen Carlos Busso Multimodal Signal Processing (MSP) Laboratory , Center for Robust Speech

Introduction to Audiovisual Introduction to Audiovisual Introduction to Audiovisual Compression

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Aligning Audiovisual Features for Audiovisual Speech Recognition Fei Tao and Carlos Busso

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

BROADCASTING TV DIGITALIZATION in Albania Albania Audiovisual Landscape (Public audiovisual

Television and on-demand audiovisual services in the Russian Federation A report by Json

Boundary Line Discussion February 2018 Agenda District vision Why boundary line

Boundary value problems What problems does boundary value testing have? ECT2 Boundary

Using multimodal speech production data to evaluate articulatory animation for audiovisual speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

FluxBuster Early Detec+on of Malicious Flux Networks via

A closer look at big- O notation. We all know that in a formula y ax + the values of b = both a

Pasi Lautala Assistant Professor, Dept. of Civil &amp; Env. Engr Director, Michigan Tech Transp.

Graduate Studies in Engineering and Computer Science at the University of Central Florida

Ladder Capsule Network Taewon Joeng, Youngmin Lee, Heeyoung Kim Industrial Statistics Lab, KAIST

Vaccine-Preventable Diseases 1 2/16/15 I have nothing to disclose Resurgence of VPDs

COVID-19 Update AUGUST 13, 2020 QUESTIONS: VCHELP@FNTN.CA Outline 1. MOH Update Dr. Wadieh

Financial Disclosures Clinical Pearls in Assessment and none Treatment Descartes Li, M.D.

Sambuz

Useful Links

Newsletter

Mail Us

Pasi Lautala Assistant Professor, Dept. of Civil & Env. Engr Director, Michigan Tech Transp.