8 speech recognition
play

8-Speech Recognition Speech Recognition Concepts Speech Recognition - PowerPoint PPT Presentation

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types 1 7-Speech Recognition (Cont d) HMM Calculating Approaches


  1. 8-Speech Recognition  Speech Recognition Concepts  Speech Recognition Approaches  Recognition Theories  Bayse Rule  Simple Language Model  P(A|W) Network Types 1

  2. 7-Speech Recognition (Cont ’ d)  HMM Calculating Approaches  Neural Components  Three Basic HMM Problems  Viterbi Algorithm  State Duration Modeling  Training In HMM 2

  3. Recognition Tasks  Isolated Word Recognition (IWR) Connected Word (CW) , And Continuous Speech Recognition (CSR)  Speaker Dependent, Multiple Speaker, And Speaker Independent  Vocabulary Size  Small <20  Medium >100 , <1000  Large >1000, <10000  Very Large >10000 3

  4. Speech Recognition Concepts Speech recognition is inverse of Speech Synthesis Speech Text Speech Speech Synthesis NLP Processing Speech Phone Text Speech NLP Understanding Sequence Processing Speech Recognition 4

  5. Speech Recognition Approaches  Bottom-Up Approach  Top-Down Approach  Blackboard Approach 5

  6. Bottom-Up Approach Signal Processing Voiced/Unvoiced/Silence Feature Extraction Knowledge Sources Segmentation Sound Classification Rules Signal Processing Phonotactic Rules Feature Extraction Lexical Access Segmentation Language Model Segmentation Recognized Utterance 6

  7. Top-Down Approach Inventory Word Task Grammar of speech Dictionary Model recognition units Syntactic Unit Lexical Semantic Feature Hypo Matching Hypo Hypo Analysis thesis thesis System thesis Utterance Verifier/ Matcher Recognized Utterance 7

  8. Blackboard Approach Acoustic Lexical Processes Processes Black Environmental board Processes Semantic Processes Syntactic Processes 8

  9. Recognition Theories  Articulatory Based Recognition  Use Articulatory system modeling for recognition  This theory is the most successful so far  Auditory Based Recognition  Use Auditory system for recognition  Hybrid Based Recognition  Is a combination of the above theories  Motor Theory  Model the intended gesture of speaker 9

  10. Recognition Problem  We have the sequence of acoustic symbols and we want to find the words uttered by speaker  Solution : Find the most probable word sequence given Acoustic symbols 10

  11. Recognition Problem  A : Acoustic Symbols  W : Word Sequence ˆ  we should find so that W ˆ  ( | ) max ( | ) P W A P W A W 11

  12. Bayse Rule  ( | ) ( ) ( , ) P x y P y P x y ( | ) ( ) P y x P x  ( | ) P x y ( ) P y ( | ) ( ) P A W P W   ( | ) P W A ( ) P A 12

  13. Bayse Rule (Cont ’ d) ˆ  ( | ) max ( | ) P W A P W A W ( | ) ( ) P A W P W  max ( ) P A W ˆ  max ( | ) W Arg P W A W  max ( | ) ( ) Arg P A W P W W 13

  14. Simple Language Model   w w w w w 1 2 3 n n    ( ) ( | ) P w P w w w w   1 2 1 i i i  1 i  ( ) ( | ) ( | , ) P W P W W P W W W 1 2 1 3 2 1 ( | , , )..... P W W W W 4 3 2 1 ( | , ,..., ) P W W W W   1 2 1 n n n  ( , , ,..., ) P W W W W   1 2 1 n n n Computing this probability is very difficult and we need a very big database. So we use Trigram and Bigram models. 14

  15. Simple Language Model (Cont ’ d) n   ( ) ( | ) P w P w w w Trigram :   1 2 i i i  1 i n   ( ) ( | ) P w P w w Bigram :  1 i i  1 i n   ( ) ( ) P w P w Monogram : i  1 i 15

  16. Simple Language Model (Cont ’ d) Computing Method :  Number of happening W3 after W1W2 ( | ) P w w w 3 2 1 Total number of happening W1W2 Ad hoc Method :       ( | ) ( | ) ( | ) ( ) P w w w f w w w f w w f w 3 2 1 1 3 2 1 2 3 2 3 3 16

  17. Error Production Factor  Prosody (Recognition should be Prosody Independent)  Noise (Noise should be prevented)  Spontaneous Speech 17

  18. P(A|W) Computing Approaches  Dynamic Time Warping (DTW)  Hidden Markov Model (HMM)  Artificial Neural Network (ANN)  Hybrid Systems 18

  19. Dynamic Time Warping 19

  20. Dynamic Time Warping 20

  21. Dynamic Time Warping 21

  22. Dynamic Time Warping 22

  23. Dynamic Time Warping Search Limitation : - First & End Interval - Global Limitation - Local Limitation 23

  24. Dynamic Time Warping Global Limitation : 24

  25. Dynamic Time Warping Local Limitation : 25

  26. Artificial Neural Network x 0 w 0 w y  x   1 1 N . 1    ( ) y w i x i .  0 i . w  1 N x Simple Computation Element  1 N of a Neural Network 26

  27. Artificial Neural Network (Cont ’ d)  Neural Network Types  Perceptron  Time Delay  Time Delay Neural Network Computational Element (TDNN) 27

  28. Artificial Neural Network (Cont ’ d) Single Layer Perceptron x x  1 N 0 . . . . . . y y  0 1 M 28

  29. Artificial Neural Network (Cont ’ d) Three Layer Perceptron . . . . . . . . . . . . 29

  30. 2.5.4.2 Neural Network Topologies 30

  31. TDNN 31

  32. 2.5.4.6 Neural Network Structures for Speech Recognition 32

  33. 2.5.4.6 Neural Network Structures for Speech Recognition 33

  34. Hybrid Methods  Hybrid Neural Network and Matched Filter For Recognition Acoustic Output Units Speech Features Delays PATTERN CLASSIFIER 34

  35. Neural Network Properties  The system is simple, But too much iteration is needed for training  Doesn ’ t determine a specific structure  Regardless of simplicity, the results are good  Training size is large, so training should be offline 35

  36. Pre-processing  Different preprocessing techniques are employed as the front end for speech recognition systems  The choice of preprocessing method is based on the task, the noise level, the modeling tool, etc. 36

  37. 37

  38. 38

  39. 39

  40. 40

  41. 41

  42. 42

  43. MFCC شور MFCC نتبميم تاوصا زا ناسنا شوگ کاردا هوحن ربيدشاب.  شور MFCC اس هب تبسنيو ريحم رد اهيِگژياهطيزيونيم لمع رتهبيدنک.  شور MFCC اهدربراک تهج ًاساساياسانشييارا راتفگياسانش رد اما تسا هدش هيي  وگين هدنيبسانم نامدنار زيدراد. Mel ميز هطبار کمک هب هک دشابيم تسدب ريآيد:  نش دحاوي ناسنا شوگ راد 43

  44. MFCC شور لحارم هلحرم1 :س تشاگني کمک هب سناکرف هزوح هب نامز هزوح زا لانگ FFT هاتوک نامز. z(n) :سي لانگراتفگ w(n) هرجنپ دننام هرجنپ عباتگنيمه : W F = e -j2 π /F m : 0, … ,F – 1; F :رف لوطيراتفگ مي. 44

  45. MFCC شور لحارم هلحرم2 :يژرنا نتفايف کناب لاناک رهيرتل. M اهکناب دادعتينتبم رتليفيم لم رايعم ربيدشاب. ف عباتياهرتليتسا رتليف کناب. ( ) W j   0,1,...,1 k M k 45

  46. لم رايعم رب ينتبم رتليف عيزوت 46

  47. MFCC شور لحارم DCT هب لوصح تهج هلحرم4 :زاس هدرشفيدبت لامعا و فيطي ل  MFCC ارضي ب MFCC ميدشاب. n ارض هبترمي ب 0 L = ، ... ، لباب هطبار رد  47

  48. |FFT| 2 Mel-scaling یدنب میرف Logarithm IDCT Cepstra Delta & Delta Delta Cepstra Low-order Differentiator coefficients 48

  49. مورتسپک لم بیارض (MFC MFCC) 49

  50. مورتسپک لم یاه یگژیو (MFCC)  ایراو هک یتهجرد لمرتلیف کناب یاه یژرنا تشاگن سن DCT ) دشاب ممیسکام اهنآ(زا هدافتسا اب  سن لماکریغ تروص هب راتفگ یاه یگژیو للبقتسا هب تب DCT ) رگیدکی(ریثات  زیمت یاهطیحم رد بسانم خساپ  یزیون یاهطیحم رد نآ ییاراک شهاک 50

  51. Time-Frequency analysis  Short-term Fourier Transform  Standard way of frequency analysis: decompose the incoming signal into the constituent frequency components.  W(n): windowing function  N: frame length  p: step size 51

  52. Critical band integration  Related to masking phenomenon: the threshold of a sinusoid is elevated when its frequency is close to the center frequency of a narrow-band noise  Frequency components within a critical band are not resolved. Auditory system interprets the signals within a critical band as a whole 52

  53. Bark scale 53

  54. Feature orthogonalization  Spectral values in adjacent frequency channels are highly correlated  The correlation results in a Gaussian model with lots of parameters: have to estimate all the elements of the covariance matrix  Decorrelation is useful to improve the parameter estimation. 54

Recommend


More recommend