8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types 1
7-Speech Recognition (Cont ’ d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training In HMM 2
Recognition Tasks Isolated Word Recognition (IWR) Connected Word (CW) , And Continuous Speech Recognition (CSR) Speaker Dependent, Multiple Speaker, And Speaker Independent Vocabulary Size Small <20 Medium >100 , <1000 Large >1000, <10000 Very Large >10000 3
Speech Recognition Concepts Speech recognition is inverse of Speech Synthesis Speech Text Speech Speech Synthesis NLP Processing Speech Phone Text Speech NLP Understanding Sequence Processing Speech Recognition 4
Speech Recognition Approaches Bottom-Up Approach Top-Down Approach Blackboard Approach 5
Bottom-Up Approach Signal Processing Voiced/Unvoiced/Silence Feature Extraction Knowledge Sources Segmentation Sound Classification Rules Signal Processing Phonotactic Rules Feature Extraction Lexical Access Segmentation Language Model Segmentation Recognized Utterance 6
Top-Down Approach Inventory Word Task Grammar of speech Dictionary Model recognition units Syntactic Unit Lexical Semantic Feature Hypo Matching Hypo Hypo Analysis thesis thesis System thesis Utterance Verifier/ Matcher Recognized Utterance 7
Blackboard Approach Acoustic Lexical Processes Processes Black Environmental board Processes Semantic Processes Syntactic Processes 8
Recognition Theories Articulatory Based Recognition Use Articulatory system modeling for recognition This theory is the most successful so far Auditory Based Recognition Use Auditory system for recognition Hybrid Based Recognition Is a combination of the above theories Motor Theory Model the intended gesture of speaker 9
Recognition Problem We have the sequence of acoustic symbols and we want to find the words uttered by speaker Solution : Find the most probable word sequence given Acoustic symbols 10
Recognition Problem A : Acoustic Symbols W : Word Sequence ˆ we should find so that W ˆ ( | ) max ( | ) P W A P W A W 11
Bayse Rule ( | ) ( ) ( , ) P x y P y P x y ( | ) ( ) P y x P x ( | ) P x y ( ) P y ( | ) ( ) P A W P W ( | ) P W A ( ) P A 12
Bayse Rule (Cont ’ d) ˆ ( | ) max ( | ) P W A P W A W ( | ) ( ) P A W P W max ( ) P A W ˆ max ( | ) W Arg P W A W max ( | ) ( ) Arg P A W P W W 13
Simple Language Model w w w w w 1 2 3 n n ( ) ( | ) P w P w w w w 1 2 1 i i i 1 i ( ) ( | ) ( | , ) P W P W W P W W W 1 2 1 3 2 1 ( | , , )..... P W W W W 4 3 2 1 ( | , ,..., ) P W W W W 1 2 1 n n n ( , , ,..., ) P W W W W 1 2 1 n n n Computing this probability is very difficult and we need a very big database. So we use Trigram and Bigram models. 14
Simple Language Model (Cont ’ d) n ( ) ( | ) P w P w w w Trigram : 1 2 i i i 1 i n ( ) ( | ) P w P w w Bigram : 1 i i 1 i n ( ) ( ) P w P w Monogram : i 1 i 15
Simple Language Model (Cont ’ d) Computing Method : Number of happening W3 after W1W2 ( | ) P w w w 3 2 1 Total number of happening W1W2 Ad hoc Method : ( | ) ( | ) ( | ) ( ) P w w w f w w w f w w f w 3 2 1 1 3 2 1 2 3 2 3 3 16
Error Production Factor Prosody (Recognition should be Prosody Independent) Noise (Noise should be prevented) Spontaneous Speech 17
P(A|W) Computing Approaches Dynamic Time Warping (DTW) Hidden Markov Model (HMM) Artificial Neural Network (ANN) Hybrid Systems 18
Dynamic Time Warping 19
Dynamic Time Warping 20
Dynamic Time Warping 21
Dynamic Time Warping 22
Dynamic Time Warping Search Limitation : - First & End Interval - Global Limitation - Local Limitation 23
Dynamic Time Warping Global Limitation : 24
Dynamic Time Warping Local Limitation : 25
Artificial Neural Network x 0 w 0 w y x 1 1 N . 1 ( ) y w i x i . 0 i . w 1 N x Simple Computation Element 1 N of a Neural Network 26
Artificial Neural Network (Cont ’ d) Neural Network Types Perceptron Time Delay Time Delay Neural Network Computational Element (TDNN) 27
Artificial Neural Network (Cont ’ d) Single Layer Perceptron x x 1 N 0 . . . . . . y y 0 1 M 28
Artificial Neural Network (Cont ’ d) Three Layer Perceptron . . . . . . . . . . . . 29
2.5.4.2 Neural Network Topologies 30
TDNN 31
2.5.4.6 Neural Network Structures for Speech Recognition 32
2.5.4.6 Neural Network Structures for Speech Recognition 33
Hybrid Methods Hybrid Neural Network and Matched Filter For Recognition Acoustic Output Units Speech Features Delays PATTERN CLASSIFIER 34
Neural Network Properties The system is simple, But too much iteration is needed for training Doesn ’ t determine a specific structure Regardless of simplicity, the results are good Training size is large, so training should be offline 35
Pre-processing Different preprocessing techniques are employed as the front end for speech recognition systems The choice of preprocessing method is based on the task, the noise level, the modeling tool, etc. 36
37
38
39
40
41
42
MFCC شور MFCC نتبميم تاوصا زا ناسنا شوگ کاردا هوحن ربيدشاب. شور MFCC اس هب تبسنيو ريحم رد اهيِگژياهطيزيونيم لمع رتهبيدنک. شور MFCC اهدربراک تهج ًاساساياسانشييارا راتفگياسانش رد اما تسا هدش هيي وگين هدنيبسانم نامدنار زيدراد. Mel ميز هطبار کمک هب هک دشابيم تسدب ريآيد: نش دحاوي ناسنا شوگ راد 43
MFCC شور لحارم هلحرم1 :س تشاگني کمک هب سناکرف هزوح هب نامز هزوح زا لانگ FFT هاتوک نامز. z(n) :سي لانگراتفگ w(n) هرجنپ دننام هرجنپ عباتگنيمه : W F = e -j2 π /F m : 0, … ,F – 1; F :رف لوطيراتفگ مي. 44
MFCC شور لحارم هلحرم2 :يژرنا نتفايف کناب لاناک رهيرتل. M اهکناب دادعتينتبم رتليفيم لم رايعم ربيدشاب. ف عباتياهرتليتسا رتليف کناب. ( ) W j 0,1,...,1 k M k 45
لم رايعم رب ينتبم رتليف عيزوت 46
MFCC شور لحارم DCT هب لوصح تهج هلحرم4 :زاس هدرشفيدبت لامعا و فيطي ل MFCC ارضي ب MFCC ميدشاب. n ارض هبترمي ب 0 L = ، ... ، لباب هطبار رد 47
|FFT| 2 Mel-scaling یدنب میرف Logarithm IDCT Cepstra Delta & Delta Delta Cepstra Low-order Differentiator coefficients 48
مورتسپک لم بیارض (MFC MFCC) 49
مورتسپک لم یاه یگژیو (MFCC) ایراو هک یتهجرد لمرتلیف کناب یاه یژرنا تشاگن سن DCT ) دشاب ممیسکام اهنآ(زا هدافتسا اب سن لماکریغ تروص هب راتفگ یاه یگژیو للبقتسا هب تب DCT ) رگیدکی(ریثات زیمت یاهطیحم رد بسانم خساپ یزیون یاهطیحم رد نآ ییاراک شهاک 50
Time-Frequency analysis Short-term Fourier Transform Standard way of frequency analysis: decompose the incoming signal into the constituent frequency components. W(n): windowing function N: frame length p: step size 51
Critical band integration Related to masking phenomenon: the threshold of a sinusoid is elevated when its frequency is close to the center frequency of a narrow-band noise Frequency components within a critical band are not resolved. Auditory system interprets the signals within a critical band as a whole 52
Bark scale 53
Feature orthogonalization Spectral values in adjacent frequency channels are highly correlated The correlation results in a Gaussian model with lots of parameters: have to estimate all the elements of the covariance matrix Decorrelation is useful to improve the parameter estimation. 54
Recommend
More recommend