Tandem modeling investigations Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 What makes Tandem successful? 2 Can we make Tandem better? 3 Does Tandem work with LVCSR tricks? Tandem investigations - Dan Ellis 2001-01-25 - 1
What makes Tandem work? 1 (with Manuel Reyes) Combo over msg: +20% plp Neural net classifier C 0 h# pcl bcl C 1 tcl dcl C 2 C k t n t n+w Gauss mix HTK PCA models decoder Pre-nonlinearity over orthog'n + posteriors: +12% Input sound s ah t msg Neural net Words Combo-into-HTK over KLT over classifier Combo-into-noway: direct: h# C 0 pcl bcl C 1 tcl +15% dcl +8% C 2 C k t n Combo over plp: t n+w +20% Combo over mfcc: NN over HTK: Tandem over HTK: Tandem over hybrid: +25% +15% +35% +25% Tandem combo over HTK mfcc baseline: +53% • Model diversity? - try a phone-based GMM model - try training the NN model to HTK state labels • Discriminative network training? - (try posteriors from GMM & Bayes) Tandem investigations - Dan Ellis 2001-01-25 - 2
Phone vs. word models Neural net Gauss mix KLT HTK plp classifier models orthog'n decoder h# C 0 pcl bcl C 1 tcl dcl Input C 2 Words s ah t C k sound t n t n+w Trained on Trained to phoneme targets subword states • Try a phone-based HTK model (instead of whole-word models) • Try training NN model to subword-state labels - 181 net outputs; reduce to 40 in KLT • Results (Aurora2k, HTK-baseline WER ratio): System test A: matched test B: var noise test C: var chan Tandem PLP baseline 63.5% 70l.3% 59.5% Phone-based HTK sys 63.6% 72.5% 61.5% Subword-based NN sys 63.1% 62.8% 55.1% • Diversity doesn’t help - subword units may be good for NN Tandem investigations - Dan Ellis 2001-01-25 - 3
Enhancements to Tandem-Aurora 2 • More tandem-feature-domain processing: Neural net Gauss mix KLT classifier models norm / orthog'n norm / deltas? deltas? h# C 0 pcl bcl C 1 tcl dcl C 2 C k t n t n+w • Results (HTK baseline WER ratio): System test A: matched test B: var noise test C: var chan PLP: Tandem baseline 63.5% 70l.3% 59.5% PLP: norm - KLT 72.6% 71.2% 63.6% PLP: KLT - norm 57.8% 58.8% 51.3% PLP: KLT - delta 59.0% 60.2% 52.9% PLP: KLT - delta - norm 58.1% 59.9% 48.9% PLP: delta - KLT - norm 54.7% 53.6% 46.9% - delta-KLT-norm: 80% Tdm baseline WER Tandem investigations - Dan Ellis 2001-01-25 - 4
Best effort Tandem system • Deltas & norms help PLP: try on combo (PLP+MSG) system : System test A: matched test B: var noise test C: var chan PLP+MSG: baseline 51.1% 52.0% 45.6% PLP+MSG: dlt-KLT-nrm 50.9% 50.5% 43.6% PLP+MSG: KLT-nrm 48.3% 49.5% 39.4% - deltas hurt for MSG: features too sluggish? • Deltas help clean, norms help noisy: 7 70 baseline K-D K-N 6 60 5 50 WER / % WER / % 40 4 3 30 2 20 10 1 0 0 -5 0 5 10 15 20 clean SNR / dB Tandem investigations - Dan Ellis 2001-01-25 - 5
Tandem for LVCSR: the SPINE task 3 (with Rita Singh/CMU & Sunil Sivadas/OGI) • Noisy spontaneous speech, ~5000 word vocab • Recognition: PLP feature Neural net Tandem SPHINX calculation classifier 1 feature recognizer calculation C 0 h# pcl bcl C 1 tcl dcl C 2 C k t n t n+w PCA GMM HMM classifier decoder decorrelation + Pre-nonlinearity Input outputs Subword Words sound s ah t likelihoods MSG feature Neural net calculation classifier 2 MLLR adaptation h# C 0 pcl bcl C 1 tcl dcl C 2 C k t n t n+w - same tandem features - NN training from Broadcast News boot + iterate - GMM-HMM has context-dependence, MLLR Tandem investigations - Dan Ellis 2001-01-25 - 6
SPINE-Tandem results • Evaluation WER results: Features (dimensions) CI system CD system CD + MLLR MFCC + d + dd (39) 69.5% 35.1% 33.5% Tandem features (56) 47.6% 35.7% 32.8% - much better for CI systems - differences evaporate with CD, MLLR • Not quite fair: - CD senones optimized for MFCC - worth 2-3% absolute? • Not unexpected: - NN confounds CD variants - Tandem ‘space’ very nonlinear - bad for MLLR • Any hope? - more training data / train CD classes / ... Tandem investigations - Dan Ellis 2001-01-25 - 7
Recommend
More recommend