speech recognition
play

Speech recognition Brief history Technology Computer Literacy 1 - PDF document

Topics Definition of speech recognition Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does speech recognition work 10/11/2008 Speaker recognition Problems of speech and speaker recognition


  1. Topics  Definition of speech recognition Speech recognition  Brief history  Technology Computer Literacy 1 Lecture 22  How does speech recognition work 10/11/2008  Speaker recognition  Problems of speech and speaker recognition Definition History - Homer Dudley  In the 1930s Homer Dudley created the first human  Can also be called automatic speech voice synthesizer at the Bell Labs recognition or computer speech recognition  He started experimenting with electromechanical  Definition: devices to produce analogues of human speech in the 20s Speech recognition converts the spoken  His findings let to the patent for “Vocoder” (voice + words into machine readable into encoder) machine readable input by using binary  a method of reproducing speech through electronic code! means and allowing it to be transmitted over distances (e.g. telephone lines) 1

  2. Speech recognition - Voice The Vocoder recognition  Originally developed as a speech decoder for  What you can already see is that speech and telecommunication voice recognition can refer to the same  Primary use for secure radio communication, where voice technology has to be encrypted before transmitted  So you can treat these terms as synonyms  Was used in SIGSALY system for high-level communications during WW-II  BUT there is also speaker recognition (which  Additionally Vocoder’s hardware and software has falls into the area of speech/voice been used as an electronic music instrument recognition) (Robert Moog, Kraftwek, Pink Floyd) Technology More Technology  A speech signal is recoded by a microphone and captured with a sound card  The speech signal has now to pass through various stages  Here various mathematical and statistical methods are applied 2

  3. Inside the computer Fast Fourier Transforms (FFT)  After the voice input is captured on your  The Fourier Transform is, in mathematics, an sound card operation that transforms one function of a real variable into another  The digital audio output of your card is processed using FFT (Fast Fourier  It works similar to the way that a chord of music we can hear can be transcribed by notes that are Transform) being played  This now already fine-tuned signal is further  The FFT is an algorithm to compute the processed by a HMM (Hidden Markov Model) Discrete Fourier Transform (DFT), which is one form of Fourier analysis Hidden Markov Model (HMM) HMM  Simply said: An HMM figures out when speech starts and stops  It is a statistical model  An HMM can be considered as the simplest dynamic Bayesian network  x = states; y = possible variations; a = state transition probabilities; b = output probabilities 3

  4. Sound How does this work?  The speech recognition software has a database  Sound itself is analogue that’s why we need containing thousands of frequencies  Phonemes to translate the signal into a digital signal  A phoneme is the smallest unit of speech in a language or which is readable by a speech recognising dialect software  The sound of one phoneme is usually different from another, this can change the meaning of a word  That’s what the FFT does, it transforms the  E.g. sound ‘b’ in bat, ‘r’ in rat incoming signal in a band of frequencies  The phoneme data base is matching the audio frequency bands that were sampled  When this is done the next step is  Each phoneme is tagged with a feature number recognising these bands How does it figure out the right sound? Pruning  The software has to use complex technique  When pruning the software generates several to approximate the sound and figure out what hypothesis on what could have been spoken phonemes are used  It then generates scores for these hypothesis  One way of identifying relevant phonemes is and decides to go for the one with the highest to train your speech recognition software score  Or you could prune your software for a  The ones with the lower scores get pruned particular speech out 4

  5. Train your Speech Recogniser More training  So your software applied feature numbers to  When you train your software frequency bands  You feed it with many variations of the same  Now it uses statistics to figure out the phoneme and your software analyses all of probability of a particular feature number these through a statistical methods (e.g. appearing in a phoneme using HMM)  The feature number with the highest  With the help of this great amount of training probability would correspond with the phonemes your software gives again feature phoneme you’ve spoken numbers to specific frequency bands The 2 phases of speaker Speaker recognition recognition  Speaker’s voice is recorded and a number of  Speaker recognition = WHO is speaking individual features (characteristics) of voice  Speech recognition = WHAT is said are used to make a voice print  Identifying characteristics of one voice  In speaker verification this print will be compared  Characteristics of voice are e.g. pitch, to a previous recorded template to verify your voice melody, hoarse vs soft, frequency  In speaker identification your voice print is compared to multiple voice prints in order to determine the best match 5

  6. Possible Problems of Speech and Speaker Recognition Key points  Speech recognition can’t work perfect since  The Vocoder, first speech synthesizer people speak in different dialects, use all kind  Speech recognition and it’s technology of different pronunciation, HMMs can’t always  Fast Fourier Transformation distinguish when speech starts and ends  The Hidden Markov Model since background noise can be confused with  Train and prune your recogniser speech, etc…  Voice recognition involves verification and  Speaker recognition fails as soon as your identification voice quality is different to your sample, e.g.  We all speak so differently and our voices are changing through life which makes it very hard to be when you have a cold, aging can have an a good speech recogniser effect on your voice, etc… 6

Recommend


More recommend