Acoustic Modeling Hsin-min Wang References: 1. X. Huang et. al., Spoken Language Processing, Chapter 9 2. The HTK Book
Definition of Speech Recognition Problem � For the given acoustic observation X = X 1 X 2 … X n , the goal of speech recognition is to find out the corresponding word sequence W = w 1 w 2 … w m that has the maximum posterior probability P ( W | X ) ( ) ˆ = W arg max P W X W = W w w ... w ...w ) ( ) ( 1 2 i m P W P X W { } ∈ = where w V : v , v , ...,v arg max ( ) i 1 2 N P X W ) ( ) ( = arg max P W P X W W Language Modeling Acoustic Modeling 2
Major Challenges � The practical challenge is how to build accurate acoustic models, P ( X | W ), and language models, P ( W ), that can truly reflect the spoken language to be recognized – For large vocabulary speech recognition, there are a large number of words. We need to decompose a word into a subword sequence, thus P ( X | W ) is closely related to phonetic modeling – P ( X | W ) should take into account speaker variations, pronunciation variations, environment variations, and context-dependent phonetic coarticulation variations – Any static acoustic or language model will not meet the needs of real applications, so it is vital to dynamically adapt both P ( X | W ) and P ( W ) to maximize P ( W | X ) � The decoding process of finding the best word sequence W to match the input speech signal X in speech recognition systems is more than a simple pattern recognition problem since there are an infinite number of word patterns to search 3
Variability in the Speech Signal � Context Variability – Context variability at the word/sentence level • E.g. “ Mr. Wright should write to Ms. Wright right away about his Ford or four door Honda ” • Same pronunciation but different meaning: Wright, write, right • Phonetically identical and semantically relevant: Ford or, four door – Context variability at the phonetic level • The acoustic realization of phoneme / ee / for word peat and wheel depends on its left and right context fast speech or spontaneous speech ? 4
Variability in the Speech Signal (cont.) � Style Variability – Isolated speech recognition • Users have to pause between each word • Eliminate errors such as Ford or and four door • A significant reduction in computational complexity • Unnatural to most people • The throughput is significant lower than that for continuous speech – Continuous speech recognition • Error rate for causal, spontaneous, and conversational speech is higher than for carefully articulated read speech • The higher the speaking rate, the higher the error rate • Emotional changes cause more significantly variations 5
Variability in the Speech Signal (cont.) � Speaker Variability – Inter-speaker differences • vocal tract size, length and width of the neck and a range of physical characteristics: gender, age, dialect, health, education, and personal style – The same speaker is often unable to precisely produce the same utterance • The shape of the vocal tract movement and rate of delivery may vary from utterance to utterance – Speaker-independent (SI) speech recognition • Large performance fluctuations among different speakers • Speakers with accents have higher error rates – Speaker-dependent (SD) speech recognition • With the SD data and training, the system can capture the SD acoustic characteristics and, thus, can improve the recognition accuracy • A typically SD speech recognition system can reduce the word recognition error by more than 30% as compared with a comparable SI speech recognition system 6
Variability in the Speech Signal (cont.) � Environment Variability – The world we live in is full of sounds of varying loudness from different sources • We have to deal with various background sounds (noises) – In mobile environments, the spectrum of noises varies significantly because the speaker moves around – Noise may also be present from the input device itself, such as microphone and A/D interference noises – We can reduce the error rates by using multi-style training or adaptive techniques – Environment variability remains as one of the most severe challenges facing today’s state-of-the-art speech systems 7
Evaluation of Automatic Speech Recognition � Performance evaluation of speech recognition systems is critical, and the Word recognition Error Rate (WER) is one of the most important measures � There are typically three types of word recognition errors – Substitution – Deletion Correct: “the effect is clear” – Insertion Recognized: “effect is not clear” one deletion and one insertion + + Subs Dels Ins = Word Error Rate 100% No. of words in the correct sentence � Calculating the WER by aligning the correct word string against the recognized word string – Maximum substring matching problem – Handled by the dynamic programming algorithm 8
Algorithm to Measure the WER //denotes for the word length of the correct sentence //denotes for the word length of the recognized sentence Rec j Cor i //Two common settings of error penalties subPen = 10; /* HTK error penalties */ delPen = 7; insPen = 7; subPenNIST = 4;/* NIST error penalties*/ delPenNIST = 3; insPenNIST = 3; Presentation topic: Write a tool to calculate speech recognition accuracy of the 2nd project, and give a presentation to introduce your algorithm and source codes. 9
Signal Processing – Extracting Features � Signal Acquisition – microphone + PC soundcard (sampling rate) � End-Point Detection push and hold while talking – We can use either push to talk or continuously listening to activate speech signal acquisition Need a speech end-point detector � MFCC and Its Dynamic Features – Time-domain features vs. Frequency-domain features – Capture temporal changes by using delta coefficients � Feature Transformation – We can transform the feature vectors to improve class separability – We can use a number of dimension reduction techniques to map the feature vectors into more effective representations, e.g. principal component analysis (PCA), linear discriminant analysis (LDA), etc Presentation topic: LDA for speech recognition 10
Phonetic Modeling – Selecting Appropriate Units � For general-purpose large vocabulary speech recognition, it is difficult to build whole-word models because – Every new task contains novel words without available training data, such as proper nouns and newly invented words – There is simply too many words, and these different words may have different acoustic realization. It is unlikely to have sufficient repetitions of all words in various contexts � Issues in choosing appropriate modeling units – Accurate : accurately represent the acoustic realization that appears in different contexts – Trainable : have enough data to estimate the parameters of the unit (HMM model parameters) – Generalizable : any new word can be derived from a predefined unit inventory for task-independent speech recognition 11
Comparison of Different Units � Word vs. Subword – Word : semantic meaning, capturing within-word coarticulation, accurate if enough data are available, trainable only for small tasks, not generalizable • For small-vocabulary speech recognition, e.g. digit recognition, whole word models are both accurate and trainable but there is no need to be generalizable – Phone : more trainable and generalizable, but less accurate • There are only about 50 phones in English and 30 in Mandarin Chinese • The realization of a phoneme is strongly affected by its immediately neighboring phonemes – Syllable : a compromise between the word and phonetic models. • Syllables are larger than phones. • There only about 1,300 tone-dependent syllables in Mandarin Chinese and 50 in Japanese, which makes syllable a suitable unit for these languages • The large number of syllables (over 30,000) in English presents a challenge in term of trainability 12
Context Dependency � Phone and Phoneme – In speech science, the term phoneme is used to denote any of the minimal units of speech sound in a language that can serve to distinguish one word from another – The term phone is used to denote a phoneme’s acoustic realization – E.g. English phoneme /t/ has two very different acoustic realizations in the word sa t and me t er. We had better treat them as two different phones when building a spoken language system � Why Context Dependency? – If we make unit context dependent, we can significantly improve the recognition accuracy, provided there are enough training data – A context usually refers to the immediate left and/or right neighboring phones 13
Context Dependency (cont.) � Triphone (Intra-word triphone) – A triphone model is a phonetic model that takes into consideration both the left and right neighboring phones – Two phones having the same identity but different left or right contexts are considered different triphones – Triphone models capture the most important co-articulatory effects – Trainability is a challenging issue. We need to balance trainability and accuracy with a number of parameter-sharing techniques � Modeling inter-word context-dependent phones is complicated – The juncture effect on word boundaries is one of the most serious coarticulation phenomena in continuous speech 14
Recommend
More recommend