Hallucinating system outputs for discriminative language modeling - PowerPoint PPT Presentation

Hallucinating system outputs for discriminative language modeling Brian Roark Center for Spoken Language Understanding, OHSU Joint work with D. Bikel, C. Callison-Burch, Y. Cao, A. C ¸ elebi, E. Dikici, N. Glenn, K. Hall, E. Hasler, D. Karakos, S. Khudanpur, P. Koehn, M. Lehr, A. Lopez, M. Post, E. Prud’hommeaux, D. Riley, K. Sagae, H. Sak, M. Sarac ¸lar, I. Shafran, P. Xu Symposium on Machine Learning in Speech and Language Processing (MLSLP), Portland

Project overview • NSF funded project and recent JHU summer workshop team • General topic: discriminative language modeling for ASR and MT – Learning language models with discriminative objectives • Specific topic: learning models from text only – Enabling use of much more training data; adaptation scenarios • Have made some progress with ASR models (topic today) – Less progress on improving MT (even fully supervised) • Talk includes a few other observations about DLM in general 1

Motivation • Generative language models built from monolingual corpora are task agnostic – But tasks differ in the kinds of ambiguities that arise • Supervised discriminative language modeling needs paired input:output sequences – Limited data vs. vast amounts of monolingual text used in generative models • Semi-supervised discriminative language modeling would have large benefits – Optimize models for task specific objectives – Applicable to arbitrary amounts of monolingual text in target language • How would this work? Here’s one method: – Use baseline models to discover confusable sequences for observed target – Learn to discriminate between observed sequence and confusables • Similar to Contrastive Estimation but with observed output rather than input 2

Prior work • Some prior research on ASR simulation for modeling – Work on small vocabulary tasks ∗ Jyothi and Fosler-Lussier (2010) used phone confusion WFSTs for generating confusion sets for training ∗ Kurata et al. (2009; 2011) used phone confusions to perform “pseudo-ASR” and train discriminative language models – Tan et al. (2010) used machine translation approaches to simu- late ASR, though without system gains • Zhifei Li has also applied similar techniques for MT modeling (Li et al. COLING 2010; EMNLP 2011) 3

Discriminative language modeling • Supervised training of language models – Training data ( x, y ) , x ∈ X (inputs), y ∈ Y (outputs) – e.g., x input speech, y output reference transcript • Run system on training inputs, update model – Commonly a linear model, n-gram features and others – Learn parameterizations using perceptron-like or global conditional likelihood methods – Use n-best or lattice output (Roark et al., 2004; 2007); or update directly on decoding graph WFST (Kuo et al., 2007) • Run to some stopping criterion; regularize final model 4

Acoustic confusions in speech recognition • Given a text string from the NY Times: He has not hesitated to use his country’s daunting problems as a kind of threat ... country’s problems ... kind of threat countries problem time threats country proms kinds thread countries’ kinda threads trees spread conferees read conference fred company copy 5

Synthesizing confusions in ASR (A) (B) ◦ (C) ⇓ (D) 6

Open questions • Various ways to approach this simulation, many open questions – What is the best unit of confusion for simulation? – How might simulation output diverge from system output? – How to make simulation output “look like” system output? – What kind of data can be used to train simulation models? • Experimented with some answers to these questions – Confusion models based on phones, syllables, words, phrases – Sampling to get n-best lists with particular characteristics – Training confusion models without the use of the reference 7

Three papers Going to highlight results from three papers in this talk • Sagae et al. Hallucinated n-best lists for discriminative language modeling. In Proceedings ICASSP 2012. – Controlled experiments with three methods for hallucination • C ¸ elebi et al. Semi-supervised discriminative language modeling for Turkish ASR. In Proceedings ICASSP 2012. – Experiments in Turkish with many other confusion model alternations – Also sampling from simulated output to match WER distribution • Xu, Khudanpur and Roark. Phrasal Cohort Based Unsupervised Discriminative Language Modeling. Proceedings of Interspeech 2012. – Unsupervised methods for deriving confusion model 8

Sagae et al. (ICASSP, 2012) • Simulating ASR errors or pseudo-ASR on English CTS task; then training a discriminative LM (DLM) for n-best reranking • Running controlled experimentation under several conditions: – Three different methods of training data “hallucination” – Different sized training corpora • Comparing WER reductions from real vs. hallucinated n-best lists • Standard methods for training linear model – Simple features: unigrams, bigrams and trigrams – Using averaged perceptron algorithm 9

Perceptron algorithm • On-line learning approach, i.e., – Consider each example in training set in turn – Use the current model to produce output for example – Update model based on example, move on to the next one • For structured learning problems (parsing, tagging, transcription) – Given a set of input utterances and reference output sequences – Typically trying to learn parameters for features in a linear model – Need some kind of regularization (typically averaging) • Learning a language model – Consider each input utterance in training set in turn – Use the current model to produce output for example (transcription) – Update feature parameters based on example, move on to the next one 10

Hallucination methods • Three methods for hallucinating being compared: – FST phone-based confusion model ∗ Phone confusion model encoded as a pair language model – Machine translation system: reference to ASR 1-best ∗ Build a parallel corpus and let Moses loose – Word-based phrasal cohorts model ∗ Direct induction of anchored phrasal alternatives • Learn hallucination models by aligning ASR output and reference – Given new reference text, hallucinate confusion set 11

FST phone-based confusion model • Let S be a string; L a pronunciation lexicon; and G an n-gram language model • Learn a phone confusion model X • Create lattice of confusions: S ◦ L ◦ X ◦ L − 1 ◦ G – (Prune this very large composition in a couple of ways) • Experimented with various methods to derive X – Best method was to train a pair language model, encoded as a transducer ∗ First, align the 1-best phone sequence with the reference phone sequence: e: ǫ s:s k:k a:a p:t ǫ :r e.g., escape to skater: – Treat each symbol pair a:b as a token, train a language model – Convert the resulting LM automaton into a transducer by splitting tokens 12

Machine translation system (using Moses) 13

Phrasal cohorts • Levenshtein word-level alignment between reference and each candidate • Find cohorts that share pivots (or anchors) on either side of variation • Build phrase table, weighted by relative frequency reference <s> What kind of company is it </s> cohort members weights 1st-best <s> What kind of company that </s> company that 2/4 = 0.5 2nd-best <s> What kind of campaign that </s> campaign that 1/4 = 0.25 3rd-best <s> What kind of company is it </s> company is it 1/4 = 0.25 4th-best <s> Well kind of company that </s> 14

Experimental setup • Task and baseline ASR specs: ASR software IBM Attila 2000 hours (25M words) English conversational telephone speech (CTS), 11000 conver- Training data sations from Fisher, 3500 from Switchboard. Dev & test data NIST RT04 Fall dev and eval test: 38K words, 37K words respectively, ~2 hours. Acoustic models 41 phones, 3-state left-to-right HMM topology for phones, 4K clustered quinphone states, 150K Gaussians, linear discriminant transform, semi-tied covariance transform. Features 13 coefficient perceptual linear coding vectors with speaker-specific VTLN. 4-gram language model defined over 50K word vocabulary estimated by interpolating the Baseline LM transcripts and similar data extracted from the web. • Complicated double cross-validation method: 20 folds of 100 hours each – Needed to train confusion models as well as discriminative LM • 100-best list output from ASR system or from simulation • Varied amount of DLM training data; compared with supervised DLM 15

Development set results e . Results: Dev Set ASR 1-best 2 folds 4 folds Real n-best 8 folds Phone MT Cohorts 21 21.5 22 22.5 23 WER 16 rs

Evaluation set results Results: Eval Set ASR 1-best 8 folds Real n-best Phone MT Cohorts 24 24.5 25 25.5 26 WER rs 17

Discussion • All three methods yield models that improve on the baseline – About half the gain of wholly supervised DLM methods • Phrasal cohorts slightly better than the others – Not significantly different • Some take-away impressions/speculations – Hybrid/joint methods (e.g., phone and phrase) worth looking at – None of the approaches (incl. supervised) make as much use of the extra data as we would hope 18

Hallucinating system outputs for discriminative language modeling - PowerPoint PPT Presentation

Hallucinating system outputs for discriminative language modeling Brian Roark Center for Spoken Language Understanding, OHSU Joint work with D. Bikel, C. Callison-Burch, Y. Cao, A. C elebi, E. Dikici, N. Glenn, K. Hall, E. Hasler, D.

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Discriminative word alignment by learning the Discriminative word alignment by learning the

Generative vs. discriminative Generative Discriminative Belief network A is more More

Three models for discriminative machine Three models for discriminative machine translation using

indicators Julia Slay, NEF Introductions around the room The first distinction: outputs

The HDU Discriminative SMT System for Constrained Data PatentMT at NTCIR10 Patrick Simianer, Gesa

2011 Census Outputs and dissemination Ian White: Head of Census Legislation, Parliamentary

Experiences from the Research Outputs Team, UQ Library Helen Morgan, Fei Yu, Sarah Evans, Andrew

ARDUINO SERIES BASIC DIGITAL INPUTS AND OUTPUTS 1. THE BASICS: INPUTS, OUTPUTS, SOURCES AND

Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation Matthieu

Dynamic Re-ordering in Mining Top- k Productive Discriminative Patterns Yoshitaka Kameya * and

Inducing a Discriminative Parser to Optimize Machine Translation Reordering Graham Neubig 1,2,3 ,

NLP Programming Tutorial 6 - Advanced Discriminative Learning Graham Neubig Nara Institute of

On Discriminative Learning of Prediction Uncertainty Vojtch Franc, Daniel Pra Department of

Generative Models for Discriminative Problems Chris Dyer DeepMind ASRU 2017 December 19,

Logistic Regression, Generative and Discriminative Classifiers Recommended reading: Ng and

Learning Deep Representation for Imbalanced Classification Chen Huang, Yining Li, Chen Change

Multibody dynamics Applications Human and animal motion Robotics control Hair

Descriptive Geometry A typical problem can you work out the area of the green area just using

Rebuilding the Maya Cosmos from the knees up Judith M. Maxwell Tulane University Rukux

Psychosis in Parkinsons Disease Autumn Brunia, DO Psychiatric Resident, PGY2 Dr. Janice

Sharing the dream The consensual hallucination offered by the Bundle Protocol Lloyd Wood, Daniel

Learning Outcomes Identify strategies to introduce standardized patient simulations in

Implicit Generation and Generalization with Energy Based Models Yilun Du and Igor Mordatch