Hallucinating system outputs for discriminative language modeling Brian Roark Center for Spoken Language Understanding, OHSU Joint work with D. Bikel, C. Callison-Burch, Y. Cao, A. C ¸ elebi, E. Dikici, N. Glenn, K. Hall, E. Hasler, D. Karakos, S. Khudanpur, P. Koehn, M. Lehr, A. Lopez, M. Post, E. Prud’hommeaux, D. Riley, K. Sagae, H. Sak, M. Sarac ¸lar, I. Shafran, P. Xu Symposium on Machine Learning in Speech and Language Processing (MLSLP), Portland
Project overview • NSF funded project and recent JHU summer workshop team • General topic: discriminative language modeling for ASR and MT – Learning language models with discriminative objectives • Specific topic: learning models from text only – Enabling use of much more training data; adaptation scenarios • Have made some progress with ASR models (topic today) – Less progress on improving MT (even fully supervised) • Talk includes a few other observations about DLM in general 1
Motivation • Generative language models built from monolingual corpora are task agnostic – But tasks differ in the kinds of ambiguities that arise • Supervised discriminative language modeling needs paired input:output sequences – Limited data vs. vast amounts of monolingual text used in generative models • Semi-supervised discriminative language modeling would have large benefits – Optimize models for task specific objectives – Applicable to arbitrary amounts of monolingual text in target language • How would this work? Here’s one method: – Use baseline models to discover confusable sequences for observed target – Learn to discriminate between observed sequence and confusables • Similar to Contrastive Estimation but with observed output rather than input 2
Prior work • Some prior research on ASR simulation for modeling – Work on small vocabulary tasks ∗ Jyothi and Fosler-Lussier (2010) used phone confusion WFSTs for generating confusion sets for training ∗ Kurata et al. (2009; 2011) used phone confusions to perform “pseudo-ASR” and train discriminative language models – Tan et al. (2010) used machine translation approaches to simu- late ASR, though without system gains • Zhifei Li has also applied similar techniques for MT modeling (Li et al. COLING 2010; EMNLP 2011) 3
Discriminative language modeling • Supervised training of language models – Training data ( x, y ) , x ∈ X (inputs), y ∈ Y (outputs) – e.g., x input speech, y output reference transcript • Run system on training inputs, update model – Commonly a linear model, n-gram features and others – Learn parameterizations using perceptron-like or global conditional likelihood methods – Use n-best or lattice output (Roark et al., 2004; 2007); or update directly on decoding graph WFST (Kuo et al., 2007) • Run to some stopping criterion; regularize final model 4
Acoustic confusions in speech recognition • Given a text string from the NY Times: He has not hesitated to use his country’s daunting problems as a kind of threat ... country’s problems ... kind of threat countries problem time threats country proms kinds thread countries’ kinda threads trees spread conferees read conference fred company copy 5
Synthesizing confusions in ASR (A) (B) ◦ (C) ⇓ (D) 6
Open questions • Various ways to approach this simulation, many open questions – What is the best unit of confusion for simulation? – How might simulation output diverge from system output? – How to make simulation output “look like” system output? – What kind of data can be used to train simulation models? • Experimented with some answers to these questions – Confusion models based on phones, syllables, words, phrases – Sampling to get n-best lists with particular characteristics – Training confusion models without the use of the reference 7
Three papers Going to highlight results from three papers in this talk • Sagae et al. Hallucinated n-best lists for discriminative language modeling. In Proceedings ICASSP 2012. – Controlled experiments with three methods for hallucination • C ¸ elebi et al. Semi-supervised discriminative language modeling for Turkish ASR. In Proceedings ICASSP 2012. – Experiments in Turkish with many other confusion model alternations – Also sampling from simulated output to match WER distribution • Xu, Khudanpur and Roark. Phrasal Cohort Based Unsupervised Discriminative Language Modeling. Proceedings of Interspeech 2012. – Unsupervised methods for deriving confusion model 8
Sagae et al. (ICASSP, 2012) • Simulating ASR errors or pseudo-ASR on English CTS task; then training a discriminative LM (DLM) for n-best reranking • Running controlled experimentation under several conditions: – Three different methods of training data “hallucination” – Different sized training corpora • Comparing WER reductions from real vs. hallucinated n-best lists • Standard methods for training linear model – Simple features: unigrams, bigrams and trigrams – Using averaged perceptron algorithm 9
Perceptron algorithm • On-line learning approach, i.e., – Consider each example in training set in turn – Use the current model to produce output for example – Update model based on example, move on to the next one • For structured learning problems (parsing, tagging, transcription) – Given a set of input utterances and reference output sequences – Typically trying to learn parameters for features in a linear model – Need some kind of regularization (typically averaging) • Learning a language model – Consider each input utterance in training set in turn – Use the current model to produce output for example (transcription) – Update feature parameters based on example, move on to the next one 10
Hallucination methods • Three methods for hallucinating being compared: – FST phone-based confusion model ∗ Phone confusion model encoded as a pair language model – Machine translation system: reference to ASR 1-best ∗ Build a parallel corpus and let Moses loose – Word-based phrasal cohorts model ∗ Direct induction of anchored phrasal alternatives • Learn hallucination models by aligning ASR output and reference – Given new reference text, hallucinate confusion set 11
FST phone-based confusion model • Let S be a string; L a pronunciation lexicon; and G an n-gram language model • Learn a phone confusion model X • Create lattice of confusions: S ◦ L ◦ X ◦ L − 1 ◦ G – (Prune this very large composition in a couple of ways) • Experimented with various methods to derive X – Best method was to train a pair language model, encoded as a transducer ∗ First, align the 1-best phone sequence with the reference phone sequence: e: ǫ s:s k:k a:a p:t ǫ :r e.g., escape to skater: – Treat each symbol pair a:b as a token, train a language model – Convert the resulting LM automaton into a transducer by splitting tokens 12
Machine translation system (using Moses) 13
Phrasal cohorts • Levenshtein word-level alignment between reference and each candidate • Find cohorts that share pivots (or anchors) on either side of variation • Build phrase table, weighted by relative frequency reference <s> What kind of company is it </s> cohort members weights 1st-best <s> What kind of company that </s> company that 2/4 = 0.5 2nd-best <s> What kind of campaign that </s> campaign that 1/4 = 0.25 3rd-best <s> What kind of company is it </s> company is it 1/4 = 0.25 4th-best <s> Well kind of company that </s> 14
Experimental setup • Task and baseline ASR specs: ASR software IBM Attila 2000 hours (25M words) English conversational telephone speech (CTS), 11000 conver- Training data sations from Fisher, 3500 from Switchboard. Dev & test data NIST RT04 Fall dev and eval test: 38K words, 37K words respectively, ~2 hours. Acoustic models 41 phones, 3-state left-to-right HMM topology for phones, 4K clustered quinphone states, 150K Gaussians, linear discriminant transform, semi-tied covariance transform. Features 13 coefficient perceptual linear coding vectors with speaker-specific VTLN. 4-gram language model defined over 50K word vocabulary estimated by interpolating the Baseline LM transcripts and similar data extracted from the web. • Complicated double cross-validation method: 20 folds of 100 hours each – Needed to train confusion models as well as discriminative LM • 100-best list output from ASR system or from simulation • Varied amount of DLM training data; compared with supervised DLM 15
Development set results e . Results: Dev Set ASR 1-best 2 folds 4 folds Real n-best 8 folds Phone MT Cohorts 21 21.5 22 22.5 23 WER 16 rs
Evaluation set results Results: Eval Set ASR 1-best 8 folds Real n-best Phone MT Cohorts 24 24.5 25 25.5 26 WER rs 17
Discussion • All three methods yield models that improve on the baseline – About half the gain of wholly supervised DLM methods • Phrasal cohorts slightly better than the others – Not significantly different • Some take-away impressions/speculations – Hybrid/joint methods (e.g., phone and phrase) worth looking at – None of the approaches (incl. supervised) make as much use of the extra data as we would hope 18
Recommend
More recommend