Acoustic Modeling for Speech Recognition Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 8 2. The HTK Book (for HTK Version 3.2)
Introduction X = x x x , ,..., • For the given acoustic observation , the 1 2 n goal of speech recognition is to find out the = W corresponding word sequence that has w ,w ,...,w 1 2 m ( ) the maximum posterior probability W X P ( ) = ˆ W W X arg max P = W w ,w ,..w ,...,w W 1 2 i m ) ( ) ( { } ∈ W X W where w V : v ,v ,.....,v P P = i 1 2 N arg max ( ) X P W ) ( ) ( = W X W arg max P P W Acoustic Modeling Language Modeling To be discussed Possible speaker, pronunciation, later on ! domain, topic, and variations environment, context, etc. style, etc. 2
Review: HMM Modeling • Acoustic Modeling using HMMs Modeling the cepstral feature vectors Frequency Domain Time Domain overlapping speech frames • Three types of HMM state output probabilities are used 3
Review: HMM Modeling • Discrete HMM (DHMM): b j ( v k )= P ( o t = v k | s t =j ) – The observations are quantized into a number of symbols – The symbols are normally generated by a vector quantizer A left-to-right HMM – With multiple codebooks M ( ) ∑ M ( ) ∑ = = = = o v c 1 b v c p m , s j jm j k jm t k t = = m 1 m 1 4 codebook index
Review: HMM Modeling • Continuous HMM (CHMM) – The state observation distribution of HMM is modeled by multivariate Gaussian mixture density functions ( M mixtures) M ( ) ( ) ∑ = o o b c b j t jm jm t = m 1 ⎛ ⎞ M ⎜ ⎟ ( ) ( ) ( ) , ∑ ⎛ ⎞ M M 1 1 = ∑ ∑ = = − − Σ − − c 1 o µ Σ ⎜ o µ T o µ ⎟ 1 ⎜ ⎟ c N ; , c exp ( ) jm jm jm t jm jm jm t jm t jm ⎝ ⎠ L 1 ⎜ ⎟ = 2 π m 1 Σ = = 2 2 2 m 1 m 1 ⎝ ⎠ jm 5
Review: HMM Modeling • Semicontinuous or tied-mixture HMM (SCHMM) – The HMM state mixture density functions are tied together across all the models to form a set of shared kernels (shared Gaussians) ( ) ( ) L L ( ) ( ) ( ) ∑ ∑ = = o o o µ Σ b b k f v b k N ; , j j k j k k = = k 1 k 1 – With multiple codebooks ( ) ( ) ( ) ( ) M L M L ( ) ∑ ∑ ∑ ∑ = = o o o µ Σ b c b k f v c b k N ; , j m jm m , k m jm m , k m , k 6 = = = = m 1 k 1 m 1 k 1
Review: HMM Modeling • Comparison of Recognition Performance 7
Measures of Speech Recognition Performance • Evaluating the performance of speech recognition systems is critical, and the Word Recognition Error Rate (WER) is one of the most important measures • There are typically three types of word recognition errors – Substitution • An incorrect word was substituted for the correct word – Deletion • A correct word was omitted in the recognized sentence – Insertion • An extra word was added in the recognized sentence • How to determine the minimum error rate? 8
Measures of Speech Recognition Performance • Calculate the WER by aligning the correct word string against the recognized word string – A maximum substring matching problem – Can be handled by dynamic programming deleted Correct : “the effect is clear” • Example: Recognized: “effect is not clear” matched matched inserted – Error analysis: one deletion and one insertion – Measures: word error rate (WER), word correction rate (WCR), word accuracy rate (WAR) Might be higher than 100% + + Sub. Del. Ins. words 2 = = = Word Error Rate 100% 50 % No. of words in the correct sentence 4 WER+ Matched words 3 WAR = = = Word Correction Rate 100% 75 % =100% No. of words in the correct sentence 4 − Matched - Ins. words 3 1 = = = Word Accuracy Rate 100% 50 % No. of words in the correct sentence 4 9 Might be negative
Measures of Speech Recognition Performance • A Dynamic Programming Algorithm (Textbook) //denotes for the word length of the correct/reference sentence //denotes for the word length of the recognized/test sentence minimum word /hit error alignment at the a grid [ i,j ] Test j kinds of alignment /hit Ref i 10
Measures of Speech Recognition Performance Step 2 : Iteration : • Algorithm (by Berlin Chen) = for i 1,..., n { //test = for j 1,..., m { //referen ce Step 1 : Initializa tion : + ⎡ ⎤ = G[i - 1][j] 1 (Insertion ) G[0][0] 0; ⎢ ⎥ + G[i][j - 1] 1 (Delection ) = ⎢ ⎥ for i 1,..., n { //test = G[i][j] min ⎢ ⎥ + = G[i - 1][j - 1] 1 (if LR[i]! LT[i], Substituti on) = + ⎢ ⎥ G[i][0] G[i - 1][0] 1; = ⎣ ⎦ G[i - 1][j - 1] (if LR[i] LT[i], Match) = B[i][0] 1; //Inserti on ⎧ 1; //Inserti on, (Horizonta l Direction) ⎪ } (Horizonta l Direction) ⎪ 2; //Deletio n , (Vertical Direction) = ⎨ B[i][j] for j 1,..., m { //referen ce ⎪ 3 ; //Substitu tion (Diagonal Direction) = + G[0][j] G[0][j - 1] 1; ⎪ ⎩ 4 ; //match (Diagonal Direction) = B[0][j] 2; // Deletion } (Vertical Direction) } //for j, reference } //for i, test Note: the penalties for substitution, deletion Step 3 : Measure and Backtrace : and insertion errors are all set to be 1 here G[n][m] = × Word Error Rate 100% m = − Word Accuracy Rate 100 % Word Error Rate = → → Optimal backtrace path (B[n][m] ..... B[0][0]) = if B[i][j] 1 print " LT[i]" ; //Insertio n, then go left = else if B[i][j] 2 print " LR[j] " ; //Deletion , then go down 11 else print " LR[j] LR[i] " ; //Hit/Matc h or Substituti on, then go down diagonally
Measures of Speech Recognition Performance • A Dynamic Programming Algorithm – Initialization Ins. ( n,m ) m Correct/Reference Word Del. m -1 Sequence . for (j=1;j<=m;j++) { //reference Ins. ( i,j ) grid[0][j] = grid[0][j-1]; ( i -1 ,j ) j Del. grid[0][j].dir = VERT ; . grid[0][j].score . ( i -1 ,j -1) ( i,j -1) += DelPen ; grid[0][j].del ++; . 4 } 3Del. 3 2Del. 2 Del. 1Del. 1 HTK 0 1 2 3 4 5 …. … i … … n -1 n 0 2Ins. 3Ins. 1Ins. Recognized/test Word for (i=1;i<=n;i++) { // test grid[0][0].score = grid[0][0].ins Sequence grid[i][0] = grid[i-1][0]; = grid[0][0].del = 0; grid[i][0].dir = HOR ; grid[0][0].sub = grid[0][0].hit = 0; grid[i][0].score += InsPen ; grid[0][0].dir = NIL; grid[i][0].ins ++; 12 }
Measures of Speech Recognition Performance • Program • Example 1 (Ins,Del,Sub,Hit) Correct for (i=1;i<=n;i++) //test (0,5,0,0) C { gridi = grid[i]; gridi1 = grid[i-1]; Delete C (0,2,2, 1 ) (0,4,0, 1 ) (0,3,1, 1 ) (0,3,1, 1 ) for (j=1;j<=m;j++) //reference (1,2,0, 3 ) or (1,3,0, 2 ) { h = gridi1[j].score +insPen; (0,4,0,0) C d = gridi1[j-1].score; (0,2,1, 1 ) (0,3,0, 1 ) (1,2,0, 2 ) if (lRef[j] != lTest[i]) (1,1,0, 3 ) Hit C d += subPen; v = gridi[j-1].score + delPen; (0,3,0,0) B (0,2,0, 1 ) (1,1,0, 2 ) if (d<=h && d<=v) { /* DIAG = hit or sub */ (1,2,0, 1 ) (2,1,0, 2 ) j gridi[j] = gridi1[j-1]; or (0,1,2, 0 ) Hit B or (1,0,2, 1 ) HTK gridi[j].score = d; gridi[j].dir = DIAG; C (0,2,0,0) (0,1,1,0) (2,0,0, 2 ) (1,1,0, 1 ) (1,0,1, 1 ) if (lRef[j] == lTest[i]) ++gridi[j].hit; or(0,0,2,0) Del C else ++gridi[j].sub; } A (0,1,0,0) else if (h<v) { /* HOR = ins */ (0,0,1,0) (1,0,0, 1 ) (2,0,0, 1 ) (3,0,0, 1 ) gridi[j] = gridi1[j]; gridi[j].score = h; Hit A Test gridi[j].dir = HOR; 0 0 ++ gridi[j].ins; B A B C Ins B } (3,0,0,0) (4,0,0,0) (0,0,0,0) (1,0,0,0) (2,0,0,0) i else { /* VERT = del */ Alignment 1: WER= 60% gridi[j] = gridi[j-1]; Correct: gridi[j].score = v; A C B C C Still have an gridi[j].dir = VERT; Test: B A B C Other optimal ++gridi[j].del; } alignment ! } /* for j */ Hit A Hit c Del c Ins B Del C Hit b } /* for i */ 13
Measures of Speech Recognition Performance Correct • Example 2 (0,5,0,0) C Delete C (0,2,2, 1 ) (0,4,0, 1 ) (0,3,1, 1 ) (0,3,1, 1 ) Note: the penalties for (1,2,1, 2 ) or (1,3,0, 2 ) substitution, deletion (0,4,0,0) C and insertion errors are (0,2,1, 1 ) (0,3,0, 1 ) (1,2,1, 1 ) (1,1,1, 2 ) all set to be 1 here Hit C (0,3,0,0) B (0,2,0, 1 ) (1,1,1, 1 ) (1,2,0, 1 ) (2,1,0, 2 ) (Ins,Del,Sub,Hit) j or (0,1,2, 0 ) or (1,0,2, 1 ) Sub B C (0,2,0,0) (0,1,1,0) (2,0,0, 2 ) (1,0,1, 1 ) (1,1,0, 1 ) or(0,0,2,0) Del C (0,1,0,0) A (0,0,1,0) (1,0,0, 1 ) (2,0,0, 1 ) (3,0,0, 1 ) Hit A Test Alignment 1: WER= 80% 0 0 Ins B (0,0,0,0) B A A C (3,0,0,0) (4,0,0,0) (1,0,0,0) (2,0,0,0) Correct: A C B C C Alignment 3: i B A A C WER=80% Test: Hit A Hit c Del c A C B C C Ins B Del C Sub B Correct: B A A C Test: Correct: A C B C C Alignment 2: B A A C Test: Hit c Del c Ins B Hit A Sub C Del B WER=80% 14 Hit A Del c Del C Hit c Sub B
Recommend
More recommend