Focusing Language Models For Automatic Speech Recognition Daniele - PowerPoint PPT Presentation

Focusing Language Models For Automatic Speech Recognition Daniele Falavigna, Roberto Gretter FBK, Italy The work leading to these results has received funding from the European Union under grant agreement n° 287658 Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Outline • Problem definition • Auxiliary data selection • TFxIDF • Proposed method • Perplexity based method • Computational issues • TFxIDF vs proposed method • Experiments • Discussion Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Problem definition • Given a general purpose text corpus and a given speech to transcribe • Build a LM which is focused on the particular (unknown) topic of the speech • No need for instantaneous, but should be quick • Approach: • Perform a first ASR pass • Use recognition output to select text data “similar” to the context • Build a focused language model • Use the focused language model in the next ASR pass Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Recognition setup off-line automatic auxiliary text corpus selection corpus baseline LM auxiliary LM 1-best ASR first + ASR word graph second step rescoring 1-best speech word graph Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

terminology • text corpus text corpus • composed by N rows (N documents) • average length of a document: Lc t 1 ¡ t 2 ¡ • dictionary t 3 ¡ • composed by t d terms, 1 ≤ d ≤ D t 4 … ¡ • auxiliary corpus t D ¡ • composed by rows of the text corpus, size: K words • speech to recognize auxiliary • TED talks, average length: Lt corpus Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Auxiliary data selection • rationale: • score each row in the text corpus against ASR output • sort rows according to score • select the first rows  auxiliary corpus (having size K) • 3 approaches implemented and compared: • TFxIDF • Proposed method • Perplexity based method • domain specific data (TED LM) Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Auxiliary data selection: TFxIDF • for each talk i and for each word t d compute: D i i c [t ] (1 log(tf )) log( ) 1 d D = + ≤ ≤ d d df d tf d i = frequency of term t d inside talk df d = # of documents in the corpus containing t d • compute the same for each row R n in the corpus, 1 ≤ n ≤ N • estimate a similarity score: i n C . R i n s(C , R ) = i n | C | | R | Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Auxiliary data selection: Proposed method • sort words in dictionary according to frequency • discard most frequent words (< D 1 = 100) • they don’t carry semantic information • discard most rare words (> D 2 = 200K) • too rare to help, include typos • replace words in corpus by their index in dictionary • sort indices in each row to allow quick comparison • estimate a similarity score: i n common (C' , R' ) i n s' (C' , R' ) = i n dim(C' ) dim(R' ) + Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Auxiliary data selection: Proposed method • example: • I would like your advice about rule one hundred forty three concerning inadmissibility • 47 54 108 264 2837 63 1019 6 12 65 24 4890 166476 • 108 264 2837 1019 4890 166476 (like your advice rule concerning inadmissibility) • 108 264 1019 2837 4890 166476 Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Auxiliary data selection: Proposed method • similarity score computation: • the lower index increment 108 264 1019 2837 4890 166476 155 264 2222 2345 2837 166476 score = 3 / 12 Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Auxiliary data selection: Perplexity based method • train a 3-gram LM using ASR output • estimate perplexity for each row in the corpus • use perplexity as a similarity score Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Auxiliary data selection: Run time computational complexity • corpus size: N (5.7M) rows, average row length L (272) • dictionary size: D (1.6M) (D 2 =200K) TFxIDF ¡ Proposed method ¡ Arithme.c ¡ ¡ O(2 ¡x ¡N ¡x ¡L) ¡ O(N ¡x ¡L ¡/ ¡2) ¡ opera.ons ¡ Memory ¡ O(D ¡+ ¡N ¡x ¡L) ¡ -‑-‑-‑ ¡ requirements ¡ Process ¡size ¡ 650MB ¡ 10MB ¡ .me ¡ 114 ¡min ¡ 16 ¡min ¡ Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Training data • text corpus • google news • 5.7 M documents, 1.6 G words • 272 words per document • LM for rescoring: • 4-gram backoff LM, modified shift • 1.6M unigrams, 73M bigrams, 120M 3-grams and 195M 4- grams. • FSN for first & second step: • 200K words, 37M bigrams, 34M 3-grams, 38M 4-grams. • auxiliary corpus • most similar documents, K words Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Test data • TED talks (test sets of IWSLT 2011) • auxiliary corpus and auxiliary LM computed for each talk dev-‑set ¡ test-set (19 ¡talks) ¡ (8 talks) ¡ #words ¡ 44505 ¡ 12431 ¡ (min,max,mean) ¡ ¡ (591,4509,2342) ¡ ¡ (484,2855,1553) ¡ • performance are reported as a function of K, the number of words used to train the auxiliary LMs Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Results • Perplexity as a function of K • 0 means no interpolation 230 ¡ 250 ¡ 225 ¡ 245 ¡ dev ¡set ¡ test ¡set ¡ 220 ¡ 240 ¡ ¡ 215 ¡ 235 ¡ 210 ¡ 230 ¡ PP ¡ PP ¡ 205 ¡ 225 ¡ NEW ¡ NEW ¡ 200 ¡ 220 ¡ TFIDF ¡ TFIDF ¡ 195 ¡ 215 ¡ 190 ¡ 210 ¡ 185 ¡ 205 ¡ 180 ¡ 200 ¡ K is expressed in Kwords • Perplexity interpolating the baseline LM with a domain specific LM (trained on ted2011 text, 2 Mwords): dev set: 158 test set: 142 Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Results • WER as a function of K • 0 means no interpolation 19.5 ¡ 19.4 ¡ 19.4 ¡ 19.2 ¡ test ¡set ¡ dev ¡set ¡ 19.3 ¡ 19.0 ¡ 19.2 ¡ 19.1 ¡ 18.8 ¡ PP ¡ PP ¡ 19.0 ¡ NEW ¡ 18.6 ¡ NEW ¡ 18.9 ¡ TFIDF ¡ TFIDF ¡ 18.8 ¡ 18.4 ¡ 18.7 ¡ 18.2 ¡ 18.6 ¡ 18.5 ¡ 18.0 ¡ K is expressed in Kwords • WER interpolating the baseline LM with a domain specific LM (trained on ted2011 text, 2 Mwords): dev set: 18.7 test set: 18.4 Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Conclusion • Method for focusing LMs without using in-domain data • Comparison between the proposed method and TFxIDF • similar performance • less demanding computational requirements • Comparable results if using in-domain data • in this setting… • Future work: • how to add new words (to reduce OOV?) • instantaneous LM focusing Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Thank you for the attention Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

LM interpolation • LM probability associated to every arc of the word graph: J P [ w | h ] P [ w | h ] ∑ = λ j j j 1 = • J = number of LMs to combine • λ j = weights estimated to minimize the overall perplexity on a development set ¡ ¡ The interpolation weights, i base and i aux, associated to the two LMs (LMbase and Lmi aux) are estimated so as to minimize the overall LM perplexity on the 1-best output (the same used to build the ith query document), of the second ASR decoding step. Text für Fußzeile 12/7/12 Roberto Gretter – FBK www.eu-bridge.eu

Focusing Language Models For Automatic Speech Recognition Daniele - PowerPoint PPT Presentation

Focusing Language Models For Automatic Speech Recognition Daniele Falavigna, Roberto Gretter FBK, Italy The work leading to these results has received funding from the European Union under grant agreement n 287658 Text fr Fuzeile 12/7/12

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 15: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 16: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

EXEMPLAR-BASED SPEECH RECOGNITION IN A RESCORING APPROACH Georg Heigold, Google, USA Joint work

Missing data speech recognition in rever- berant acoustic conditions Kalle, Guy and Jon ONE SIX

GSM SPEECH PROCESSING ECE 2526 MOBILE COMMUNICATION Wednesday, 18 March 2020 1 BASIC SPEECH

KitAi-PI: Summarization System for NTCIR-14 QA Lab-PoliInfo Satoshi Hiai, Yuka Otani, Takashi

simon Open-Source Speech Recognition Developed by the non profit organization Simon Listens in

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 28 January 2009 What is text-to-speech

SSML for Indian Languages Text to Speech Synthesis Presented by: Vibhu Agarwal President and co-

Speech Technology Using in Wechat FENG RAO Powered by WeChat Outline Introduce Algorithm of

Sambuz

Useful Links

Newsletter

Mail Us