Topics Definition of speech recognition Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does speech recognition work 10/11/2008 Speaker recognition Problems of speech and speaker recognition Definition History - Homer Dudley In the 1930s Homer Dudley created the first human Can also be called automatic speech voice synthesizer at the Bell Labs recognition or computer speech recognition He started experimenting with electromechanical Definition: devices to produce analogues of human speech in the 20s Speech recognition converts the spoken His findings let to the patent for “Vocoder” (voice + words into machine readable into encoder) machine readable input by using binary a method of reproducing speech through electronic code! means and allowing it to be transmitted over distances (e.g. telephone lines) 1
Speech recognition - Voice The Vocoder recognition Originally developed as a speech decoder for What you can already see is that speech and telecommunication voice recognition can refer to the same Primary use for secure radio communication, where voice technology has to be encrypted before transmitted So you can treat these terms as synonyms Was used in SIGSALY system for high-level communications during WW-II BUT there is also speaker recognition (which Additionally Vocoder’s hardware and software has falls into the area of speech/voice been used as an electronic music instrument recognition) (Robert Moog, Kraftwek, Pink Floyd) Technology More Technology A speech signal is recoded by a microphone and captured with a sound card The speech signal has now to pass through various stages Here various mathematical and statistical methods are applied 2
Inside the computer Fast Fourier Transforms (FFT) After the voice input is captured on your The Fourier Transform is, in mathematics, an sound card operation that transforms one function of a real variable into another The digital audio output of your card is processed using FFT (Fast Fourier It works similar to the way that a chord of music we can hear can be transcribed by notes that are Transform) being played This now already fine-tuned signal is further The FFT is an algorithm to compute the processed by a HMM (Hidden Markov Model) Discrete Fourier Transform (DFT), which is one form of Fourier analysis Hidden Markov Model (HMM) HMM Simply said: An HMM figures out when speech starts and stops It is a statistical model An HMM can be considered as the simplest dynamic Bayesian network x = states; y = possible variations; a = state transition probabilities; b = output probabilities 3
Sound How does this work? The speech recognition software has a database Sound itself is analogue that’s why we need containing thousands of frequencies Phonemes to translate the signal into a digital signal A phoneme is the smallest unit of speech in a language or which is readable by a speech recognising dialect software The sound of one phoneme is usually different from another, this can change the meaning of a word That’s what the FFT does, it transforms the E.g. sound ‘b’ in bat, ‘r’ in rat incoming signal in a band of frequencies The phoneme data base is matching the audio frequency bands that were sampled When this is done the next step is Each phoneme is tagged with a feature number recognising these bands How does it figure out the right sound? Pruning The software has to use complex technique When pruning the software generates several to approximate the sound and figure out what hypothesis on what could have been spoken phonemes are used It then generates scores for these hypothesis One way of identifying relevant phonemes is and decides to go for the one with the highest to train your speech recognition software score Or you could prune your software for a The ones with the lower scores get pruned particular speech out 4
Train your Speech Recogniser More training So your software applied feature numbers to When you train your software frequency bands You feed it with many variations of the same Now it uses statistics to figure out the phoneme and your software analyses all of probability of a particular feature number these through a statistical methods (e.g. appearing in a phoneme using HMM) The feature number with the highest With the help of this great amount of training probability would correspond with the phonemes your software gives again feature phoneme you’ve spoken numbers to specific frequency bands The 2 phases of speaker Speaker recognition recognition Speaker’s voice is recorded and a number of Speaker recognition = WHO is speaking individual features (characteristics) of voice Speech recognition = WHAT is said are used to make a voice print Identifying characteristics of one voice In speaker verification this print will be compared Characteristics of voice are e.g. pitch, to a previous recorded template to verify your voice melody, hoarse vs soft, frequency In speaker identification your voice print is compared to multiple voice prints in order to determine the best match 5
Possible Problems of Speech and Speaker Recognition Key points Speech recognition can’t work perfect since The Vocoder, first speech synthesizer people speak in different dialects, use all kind Speech recognition and it’s technology of different pronunciation, HMMs can’t always Fast Fourier Transformation distinguish when speech starts and ends The Hidden Markov Model since background noise can be confused with Train and prune your recogniser speech, etc… Voice recognition involves verification and Speaker recognition fails as soon as your identification voice quality is different to your sample, e.g. We all speak so differently and our voices are changing through life which makes it very hard to be when you have a cold, aging can have an a good speech recogniser effect on your voice, etc… 6
Recommend
More recommend