augmented data training of joint acoustic phonotactic dnn
play

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors - PowerPoint PPT Presentation

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan McCree, Greg Sell, and Daniel Garcia-Romero JHU HLTCOE Odyssey 2016 1 Overview JHU HLTCOE submission to LRE15 One of top performers


  1. Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan McCree, Greg Sell, and Daniel Garcia-Romero JHU HLTCOE Odyssey 2016 1

  2. Overview • JHU HLTCOE submission to LRE15 – One of top performers • Approach – DNN features and state labels – Acoustic, phonotactic, and joint i-vectors – Simple fusion: scale factor and duration – Data augmentation 2

  3. Outline • System design – i-vector LID system – Improvement with DNNs – Alternative i-vectors • LRE15 task – Data usage/augmentation • Results and analysis 3

  4. LID System UNSUPERVISED SUPERVISED Raw MFCC Compute i-vector stats LDA ivec scores Gaussian ivec stats extractor scoring • Two-covariance model in i-vector space • Discriminative refinement of Gaussians – Within-class covariance scale factor – Language class means – Multiclass MMI 4

  5. LID System UNSUPERVISED SUPERVISED Raw MFCC Compute i-vector ivec stats LDA scores ivec Gaussian stats extractor scoring GMM MFCC Frame posteriors Compute stats 5

  6. LID System UNSUPERVISED SUPERVISED Raw MFCC Compute i-vector stats LDA ivec scores Gaussian ivec stats extractor scoring DNN ASR features STATS Frame posteriors Compute stats MFCC 6

  7. DNN Architecture 9184 dim Bottleneck goes here • Input 9 spliced frames of 40 dim vectors from LDA+MLLT 5 hidden layers with p-norm pooling (p=2) • – Input/output ratio of 10:1 • Output targets are clustered phone states (senones) • Trained on SWB-1 using Kaldi 7

  8. Types of i-vectors i-vector • Acoustic: = + m m Tw i 0 i – Gaussian probability model (given alignments) – i-vector: analytical solution for MAP estimate – EM estimation of T i-vector • Phonotactic: = + p softmax (log p Tw ) i 0 i – Multinomial (categorical) probability model – No closed form MAP solution: Newton’s method – Iterative approximate estimation of T 8

  9. How to Combine? • Score fusion • I-vector stacking • Joint i-vector: = + m m T w 0 i m i i-vector = + (log ) p softmax p T w i 0 w i 9

  10. Joint I-vector Details • Based on subspace GMM approach • Differences – MAP instead of ML i-vector estimate – Initialize i-vector with acoustic only (closed-form) – Diagonal Hessian in Newton update – Computation: similar to acoustic (was 10x) 10

  11. LRE15 Task • Conversational speech – Telephone and broadcast narrowband • 20 languages, 6 confusable clusters – Metric: average Bayes cost (C avg ) • Limited training condition – Use distributed material only – Switchboard (English) + transcriptions – Variable amount per language 11

  12. LRE15 Systems • All use DNNs, i-vectors, and MMI-trained Gaussian classifier • Variations: – DNNs • Bottleneck • Clustered phone state (senones) – i-vectors • Acoustic • Phonotactic • Joint 12

  13. Back-end and Fusion • Systems are already calibrated – MMI training of covariance scaling and means • Duration modeling/scaling c t = 0 n LL S + m m t t 0 n • Fusion by averaging calibrated scores – Learn overall scaling after average 13

  14. LRE15 Cluster Scoring • Task: closed set detection per cluster – Use Bayes’ rule for each – ID posteriors sum to 1 per cluster (sum to 6 total) – Convert to detection LLRs • No cluster-specific systems or fusion – This is a generic 20 language LID system! 14

  15. Training Data and Augmentation UBM/T Simple: all provided data, full cuts (up to 120 sec), no augmentation Classifier DATA SEGMENT AUGMENT REVERB NOISE LRE Training 3-30 seconds of RESAMPLE ENCODE speech (uniform) 15

  16. Augmentation types • Sample rate perturbation – Distorts pitch and speaking rate • Additive Noise – Multiband modulated Gaussian noise • Reverberation – Long or short synthetic impulse response • Multiband compression – Dynamic range compression • Cellular speech coding – GSM-AMR at 4.75 or 6.7 kb/s 16

  17. Submission Performance System Avg Cavg [0] Bottleneck, joint 19.9 [1] Senone, acoustic - [2] Senone, phonotactic - [3] Senone, joint 19.7 [0,3] Fusion 18.8 [0,1,2] Fusion 18.5 17

  18. Post-eval Improvements • Classifier tuning – ML initialization to MMI instead of single-cut Bayesian (PLDA) • List usage – No Switchboard for UBM/T • No segmentation/augmentation • Smallest possible number of cuts! 18

  19. Post-eval Results System Submission Classifier Class+lists Acoustic baseline - 23.8 22.2 [0] Bottleneck, joint 19.9 19.2 18.5 [1] Senone, acoustic - 20.2 19.2 [2] Senone, phonotactic - 20.9 20.3 [3] Senone, joint 19.7 19.2 18.4 [0,3] Fusion 18.8 18.1 17.3 [0,1,2] Fusion 18.5 18.0 17.3 19

  20. Conclusion • JHU HLTCOE strong performer in LRE15 • Key components – DNN features and state labels – Acoustic, phonotactic, and joint i-vectors – Simple fusion: scale factor and duration – Data augmentation 20

Recommend


More recommend