Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors - PowerPoint PPT Presentation

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan McCree, Greg Sell, and Daniel Garcia-Romero JHU HLTCOE Odyssey 2016 1

Overview • JHU HLTCOE submission to LRE15 – One of top performers • Approach – DNN features and state labels – Acoustic, phonotactic, and joint i-vectors – Simple fusion: scale factor and duration – Data augmentation 2

Outline • System design – i-vector LID system – Improvement with DNNs – Alternative i-vectors • LRE15 task – Data usage/augmentation • Results and analysis 3

LID System UNSUPERVISED SUPERVISED Raw MFCC Compute i-vector stats LDA ivec scores Gaussian ivec stats extractor scoring • Two-covariance model in i-vector space • Discriminative refinement of Gaussians – Within-class covariance scale factor – Language class means – Multiclass MMI 4

LID System UNSUPERVISED SUPERVISED Raw MFCC Compute i-vector ivec stats LDA scores ivec Gaussian stats extractor scoring GMM MFCC Frame posteriors Compute stats 5

LID System UNSUPERVISED SUPERVISED Raw MFCC Compute i-vector stats LDA ivec scores Gaussian ivec stats extractor scoring DNN ASR features STATS Frame posteriors Compute stats MFCC 6

DNN Architecture 9184 dim Bottleneck goes here • Input 9 spliced frames of 40 dim vectors from LDA+MLLT 5 hidden layers with p-norm pooling (p=2) • – Input/output ratio of 10:1 • Output targets are clustered phone states (senones) • Trained on SWB-1 using Kaldi 7

Types of i-vectors i-vector • Acoustic: = + m m Tw i 0 i – Gaussian probability model (given alignments) – i-vector: analytical solution for MAP estimate – EM estimation of T i-vector • Phonotactic: = + p softmax (log p Tw ) i 0 i – Multinomial (categorical) probability model – No closed form MAP solution: Newton’s method – Iterative approximate estimation of T 8

How to Combine? • Score fusion • I-vector stacking • Joint i-vector: = + m m T w 0 i m i i-vector = + (log ) p softmax p T w i 0 w i 9

Joint I-vector Details • Based on subspace GMM approach • Differences – MAP instead of ML i-vector estimate – Initialize i-vector with acoustic only (closed-form) – Diagonal Hessian in Newton update – Computation: similar to acoustic (was 10x) 10

LRE15 Task • Conversational speech – Telephone and broadcast narrowband • 20 languages, 6 confusable clusters – Metric: average Bayes cost (C avg ) • Limited training condition – Use distributed material only – Switchboard (English) + transcriptions – Variable amount per language 11

LRE15 Systems • All use DNNs, i-vectors, and MMI-trained Gaussian classifier • Variations: – DNNs • Bottleneck • Clustered phone state (senones) – i-vectors • Acoustic • Phonotactic • Joint 12

Back-end and Fusion • Systems are already calibrated – MMI training of covariance scaling and means • Duration modeling/scaling c t = 0 n LL S + m m t t 0 n • Fusion by averaging calibrated scores – Learn overall scaling after average 13

LRE15 Cluster Scoring • Task: closed set detection per cluster – Use Bayes’ rule for each – ID posteriors sum to 1 per cluster (sum to 6 total) – Convert to detection LLRs • No cluster-specific systems or fusion – This is a generic 20 language LID system! 14

Training Data and Augmentation UBM/T Simple: all provided data, full cuts (up to 120 sec), no augmentation Classifier DATA SEGMENT AUGMENT REVERB NOISE LRE Training 3-30 seconds of RESAMPLE ENCODE speech (uniform) 15

Augmentation types • Sample rate perturbation – Distorts pitch and speaking rate • Additive Noise – Multiband modulated Gaussian noise • Reverberation – Long or short synthetic impulse response • Multiband compression – Dynamic range compression • Cellular speech coding – GSM-AMR at 4.75 or 6.7 kb/s 16

Submission Performance System Avg Cavg [0] Bottleneck, joint 19.9 [1] Senone, acoustic - [2] Senone, phonotactic - [3] Senone, joint 19.7 [0,3] Fusion 18.8 [0,1,2] Fusion 18.5 17

Post-eval Improvements • Classifier tuning – ML initialization to MMI instead of single-cut Bayesian (PLDA) • List usage – No Switchboard for UBM/T • No segmentation/augmentation • Smallest possible number of cuts! 18

Post-eval Results System Submission Classifier Class+lists Acoustic baseline - 23.8 22.2 [0] Bottleneck, joint 19.9 19.2 18.5 [1] Senone, acoustic - 20.2 19.2 [2] Senone, phonotactic - 20.9 20.3 [3] Senone, joint 19.7 19.2 18.4 [0,3] Fusion 18.8 18.1 17.3 [0,1,2] Fusion 18.5 18.0 17.3 19

Conclusion • JHU HLTCOE strong performer in LRE15 • Key components – DNN features and state labels – Acoustic, phonotactic, and joint i-vectors – Simple fusion: scale factor and duration – Data augmentation 20

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors - PowerPoint PPT Presentation

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan McCree, Greg Sell, and Daniel Garcia-Romero JHU HLTCOE Odyssey 2016 1 Overview JHU HLTCOE submission to LRE15 One of top performers

Learning Phonotactic Grammars from Surface Forms: Phonotactic Patterns are Neighborhood-distinct

Acoustic Acoustic Control Systems BV Acoustic Acoustic Control Systems BV Control Systems BV

Acoustic Modeling: Tied-state HMMs & DNN-based models Lecture 7 CS 753 Instructor: Preethi

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

Network performance requirements of Augmented Reality Systems Mike P. Wittie 1 Augmented

Expressing (most of) Phonotactic Knowledge as Contrast Bruce Tesar Linguistics Dept. / Center

Improved Modeling of Cross-Decoder Phone Co-occurrences in SVM-based Phonotactic Language

IMPACT OF AUGMENTED REALITY ON SOCIETY BY DEREK MANDL AND STEPHEN SLADEK WHAT IS AUGMENTED

Standalone Training of Context-Dependent Deep Neural Network Acoustic Models Chao Zhang &

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

1/08/2012 Augmented Reality How Does This Technology Fit in the Commercial World? Augmented

Portfolio of Work (9 pages) T H E N E X T R E V O L U T I O N I N R E T A I L AUGMENTED

ubiquitous computing and augmented realities virtual and augmented reality m aking the

AUGMENTED REALITY A complete overview of what augmented reality is and how it will revolutionize

Is Augmented Reality the Future? TJ VanToll (@tjvantoll) Augmented Reality TJ VanToll

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training Hongyu Zhu 1,2 ,

Joint work with Jessica Hwang & Paulo Orenstein (Stanford), Judah Cohen & Karl Pfeiffer

Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks

Joint ITU-T and OASIS Workshop and Demonstration of Advances in ICT Standards for Public Warning

Joint Source-Channel LZ'77 Coding Stefano Lonardi University of California, Riverside Wojciech

Graph Representation Learning: Where Probability Theory, Data Mining, and Neural Networks Meet

It slices, dices, and makes julienne data! or, Processing data with RecordStream, also known

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 6 Yan n Gu

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 10 Yan n Gu

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors - PowerPoint PPT Presentation

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan McCree, Greg Sell, and Daniel Garcia-Romero JHU HLTCOE Odyssey 2016 1 Overview JHU HLTCOE submission to LRE15 One of top performers

Learning Phonotactic Grammars from Surface Forms: Phonotactic Patterns are Neighborhood-distinct

Acoustic Acoustic Control Systems BV Acoustic Acoustic Control Systems BV Control Systems BV

Acoustic Modeling: Tied-state HMMs &amp; DNN-based models Lecture 7 CS 753 Instructor: Preethi

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

Network performance requirements of Augmented Reality Systems Mike P. Wittie 1 Augmented

Expressing (most of) Phonotactic Knowledge as Contrast Bruce Tesar Linguistics Dept. / Center

Improved Modeling of Cross-Decoder Phone Co-occurrences in SVM-based Phonotactic Language

IMPACT OF AUGMENTED REALITY ON SOCIETY BY DEREK MANDL AND STEPHEN SLADEK WHAT IS AUGMENTED

Standalone Training of Context-Dependent Deep Neural Network Acoustic Models Chao Zhang &amp;

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

1/08/2012 Augmented Reality How Does This Technology Fit in the Commercial World? Augmented

Portfolio of Work (9 pages) T H E N E X T R E V O L U T I O N I N R E T A I L AUGMENTED

ubiquitous computing and augmented realities virtual and augmented reality m aking the

AUGMENTED REALITY A complete overview of what augmented reality is and how it will revolutionize

Is Augmented Reality the Future? TJ VanToll (@tjvantoll) Augmented Reality TJ VanToll

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training Hongyu Zhu 1,2 ,

Joint work with Jessica Hwang &amp; Paulo Orenstein (Stanford), Judah Cohen &amp; Karl Pfeiffer

Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks

Joint ITU-T and OASIS Workshop and Demonstration of Advances in ICT Standards for Public Warning

Joint Source-Channel LZ'77 Coding Stefano Lonardi University of California, Riverside Wojciech

Graph Representation Learning: Where Probability Theory, Data Mining, and Neural Networks Meet

It slices, dices, and makes julienne data! or, Processing data with RecordStream, also known

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 6 Yan n Gu

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 10 Yan n Gu

Acoustic Modeling: Tied-state HMMs & DNN-based models Lecture 7 CS 753 Instructor: Preethi

Standalone Training of Context-Dependent Deep Neural Network Acoustic Models Chao Zhang &

Joint work with Jessica Hwang & Paulo Orenstein (Stanford), Judah Cohen & Karl Pfeiffer