Language Modeling for Codeswitching HILA GONEN PHD STUDENT AT YOAV GOLDBERG’S LAB BAR ILAN UNIVERSITY
Outline • Background • Codeswitching • Language Modeling and Perplexity • New Evaluation Method • Definition • Creation of data set • Incorporation of Monolingual Data • Discriminative Training • Conclusion LANGUAGE MODELING FOR CODE SWITCHING 2
Codeswitching “the alternation of two languages within a single discourse, sentence or constituent” (Poplack, 1980) English – Spanish: that es su tío that has lived with him like I don't know how like ya several years... that his uncle who has lived with him like, I don't know how, like several years already... French – Arabic: mais les filles ta3na ysedkou n'import quoi ana hada face book jamais cheftou khlah kalbi Our girls believe anything, I have never seen this Facebook before. LANGUAGE MODELING FOR CODE SWITCHING 3
Codeswitching and its challenges • Very popular, mainly among bilingual communities • Extremely limited data • Non standard platforms (spoken data, social media) • Important challenge for automatic speech recognition (ASR) systems LANGUAGE MODELING FOR CODE SWITCHING 4
ASR with monolingual models • Output of IBM Models: • LANGUAGE MODELING FOR CODE SWITCHING 5
Language Modeling • The task of assigning a probability to a given sentence. • Useful for translation and for Automatic Speech Recognition: • The system produces several candidates and the LM scores them • Given a word sequence, the LM estimates the probability of each word in the vocabulary to follow. • Standard training: • Lots of unlabeled text is used • Training examples: all sentence prefixes, along with the following word LANGUAGE MODELING FOR CODE SWITCHING 6
Language Modeling chocolate 0.1 0.08 cheesecakes 0.06 strawberries I love 0.04 winter … 0.6 me LANGUAGE MODELING FOR CODE SWITCHING 7
Automatic Speech Recognition (ASR) • Language models are traditionally used in the decoding process • The ASR system produces candidates for a given acoustic signal • The LM is used to rank the candidates: • Needs to differentiate between “good” and “bad” sentences • ASR systems are hard to set up and tune, not standarized LANGUAGE MODELING FOR CODE SWITCHING 8
Previous Work – LM for CS • Artificial CS data (Vu et al. 2012, Pratapa et al. 2018) • Syntactic constraints (Li and Fung 2012, 2014) • Factored LM (Adel et al. 2013, 2014, 2015) • Most of previous work depend on an ASR system (conflates the LM performance with other aspects of the ASR system, hard to replicate the evaluation procedure and fairly compare results) • No previous work compares to another LANGUAGE MODELING FOR CODE SWITCHING 9
Previous Work – LM for CS We want to evaluate the language model independently of an ASR LANGUAGE MODELING FOR CODE SWITCHING 10
Perplexity (Standard LM Evaluation) Given a language model and a test sequence of words � , � the perplexity of over the sequence is defined as: where � is the probability the model assigns to � . The lower the perplexity, the better the LM is. LANGUAGE MODELING FOR CODE SWITCHING 11
Shortcomings of Perplexity • Not always well aligned with the quality of a language model (Tran et al. 2018) • Better perplexities often do not translate to better word-error-rate (WER) scores (Huang et al. 2018) • Does not penalize for assigning high probability to highly implausible sentences • Strong dependence on vocabulary (e.g. word-based vs. char-based) LANGUAGE MODELING FOR CODE SWITCHING 12
Shortcomings of Perplexity - Example • We train a simple model on some data • We then measure the effect of the vocabulary: • We add words to the vocabulary • We train a model in the same manner, on the same data • We do not train the additional words • This results in a 2.37-points loss on the perplexity measure • Addition of words alone, with no change in the training procedure, results in significant change in perplexity – why? LANGUAGE MODELING FOR CODE SWITCHING 13
Shortcomings of Perplexity - Example We do not want to evaluate the language model with perplexity LANGUAGE MODELING FOR CODE SWITCHING 14
New Evaluation Method • We seek a method that meets the following requirements: 1. Prefers LMs that prioritize correct sentences. 2. Does not depend on the vocabulary of the LM. 3. Is independent of an ASR system. LANGUAGE MODELING FOR CODE SWITCHING 15
New Evaluation Method • We suggest a method that simulates the task of an LM in ASR • Sets of sentences: • A single gold sentence in each set • ~30 similar-sounding alternatives in each set • LM should identify the gold sentence in each set • We use accuracy as our metric LANGUAGE MODELING FOR CODE SWITCHING 16
New Evaluation Method • This method answers all of our requirements: 1. Prefers LMs that prioritize correct sentences. 2. Does not depend on the vocabulary of the LM. 3. Is independent of an ASR system. LANGUAGE MODELING FOR CODE SWITCHING 17
Codeswitching Corpus Gold data: ◦ Bangor Miami Corpus – transcripts of conversations by Spanish- speakers in Florida, all of whom are bilingual in English ◦ 45,621 sentences, split into train/dev/test ◦ All three types: English, Spanish and CS sentences Examples: ◦ So I asked what was happening ◦ Quieres un vaso de agua ? ◦ Que by the way se vino ilegal LANGUAGE MODELING FOR CODE SWITCHING 18
Our Created data How do we obtain similar-sounding sentences to build the sets? We create them! For each gold sentence, we create alternative sentences of all types: ◦ English sentences ◦ Spanish sentences ◦ CS sentences We do that using finite state transducers (FSTs) – to be explained LANGUAGE MODELING FOR CODE SWITCHING 19
Examples from the Dataset LANGUAGE MODELING FOR CODE SWITCHING 20
Dataset Statistics LANGUAGE MODELING FOR CODE SWITCHING 21
Finite State Transducer - FSTs • Similar to FSA (finite state automata), but with an additional component of output (transitions have both input and output labels) • Capable of transforming a string into another • An FST can convert a string � into the string � , � � if there is a path with as its input labels and as its output labels • Composition – FSTs can be composed • Weighted FSTs – transitions can be labelled with weights LANGUAGE MODELING FOR CODE SWITCHING 22
Finite State Transducer - FSTs Formally, an FST is a 6-tuple such that: – the set of states (finite) ◦ – input alphabet (finite) ◦ – output alphabet (finite) ◦ – initial states (subset of ) ◦ – final states (subset of ) ◦ – transition function ◦ LANGUAGE MODELING FOR CODE SWITCHING 23
FSTs – Toy Example sad: happy The girl is sad The girl is happy This is a sad story This is a happy story others: others LANGUAGE MODELING FOR CODE SWITCHING 24
Dataset Creation We implement the creation of the dataset with Carmel, an FST toolkit: 1. An FST for converting a sentence into a sequence of phonemes 2. An FST that allows minor changes in the phoneme sequence 3. An FST for decoding a sequence of phonemes into a sentence (the inverse of 1). LANGUAGE MODELING FOR CODE SWITCHING 25
1. Sentence to Phonemes We use pronunciation dictionaries for both languages: ◦ book__en B UH K ◦ cat__en K AE T ◦ libro__sp L IY B R OW ◦ gato__sp G AA T OW LANGUAGE MODELING FOR CODE SWITCHING 26
2. Change Phoneme Sequence We allow minor change in the phoneme sequence to increase flexibility: LANGUAGE MODELING FOR CODE SWITCHING 27
3. Phonemes to Sentence We use the same pronunciation dictionaries: smell y que to ◦ S M EH L IY K EY T UW smelly que to To favor frequent words over infrequent ones, we add unigram probabilities to the edges of the transducer LANGUAGE MODELING FOR CODE SWITCHING 28
Phoneme sequence: Gold sentence: smelly:EN gato:SP S M EH L IY G AA T OW S M EH L IY G AA T OW smelly gato Sentence to phonemes G FST K Changing AA phonemes EY FST OW UW Phonemes to Sentence Alternative sequence: Phoneme sequence: FST smell y que to 29 S M EH L IY K EY T UW Alternative sequence: T UW K EY IY S M EH L smelly que to que:SP to:EN y:SP smell:EN LANGUAGE MODELING FOR CODE SWITCHING 29 29
Dataset Creation – cont. Implementation details: • We can create monolingual and CS sentences regardless the source sentence • We only convert a sampled part of the gold sentence when creating a code-switched alternative • For CS alternatives, we encourage sentences to include both languages, and to differ from each other, using some heuristics (e.g more words from the less dominant language) • We randomly choose 250/750 sets in which the gold sentence is code-switched/monolingual LANGUAGE MODELING FOR CODE SWITCHING 30
So far… New evaluation method that enables comparison of wide range of models : ◦ Directly penalizing for preferring “bad” sentences ◦ Does not depend on the vocabulary ◦ Independent of an ASR system Creating dataset using FSTs Applicable for any language or language-pair LANGUAGE MODELING FOR CODE SWITCHING 31
Recommend
More recommend