Mobile Speech Processing David Huggins-Daines Language Technologies Institute Carnegie Mellon University September 19, 2008
Outline ◮ Mobile Devices ◮ What are they? ◮ What would we like to do with them? ◮ Mobile Speech Applications ◮ Mobile Speech Technologies ◮ Current Research
Mobile Devices ◮ What is a “mobile device”? ◮ A hammer is a device, and you can carry it around with you! ◮ But no, that’s not what we mean here
Mobile Devices ◮ What is a “mobile device”? ◮ A device that goes everywhere with you ◮ ... which provides some or all of the functions of computer ◮ ... and some things it doesn’t, such as a cell phone or GPS.
Speech on Mobile Devices ◮ Why do we care about speech processing on these devices? ◮ Because they are the future of computers ◮ Because speech is actually a useful way to interact with them, unlike full-sized computers ◮ What kind of speech processing do we care about? ◮ Speech coding to improve voice quality for cellular and VoIP ◮ Speech recognition for hands-free input to apps ◮ Speech synthesis for eyes-free output from apps ◮ In some cases, speech is a natural and convenient modality ◮ In other cases, it is a necessity (e.g. in-car navigation)
Speech on Mobiles vs. Mobile Speech ◮ None of this necessarily implies doing actual speech processing (aside from coding) on the device itself ◮ Telephone dialog systems are “mobile” by any definition ◮ Let’s Go - bus scheduling information ◮ HealthLine - medical information for rural health workers ◮ But all synthesis and recognition is done on a server ◮ This can be a good thing especially in the latter case ◮ You can’t run a speech recognizer on a Motofone or a Nokia 1010 ◮ Speech processing on the device is useful for: ◮ Multimodal applications ◮ Disconnected applications ◮ Access to local data
Some Mobile Speech Applications ◮ GPS navigation ◮ Older systems used a small number of recorded prompts (“turn left”, “100 metres”, etc) ◮ More recently, TTS has been used to speak street names ◮ Even more recently, ASR is used for input ◮ Voice dialing ◮ Old systems used DTW and required training ◮ Newer ones build models from your address book ◮ Cactus for iPhone - uses CMU Flite and Sphinx ◮ Voice-driven search (local, web, etc) ◮ Nuance, Vlingo, TellMe, Microsoft are all doing this ◮ Voice-to-text ◮ Typically server-based, requires a data connection ◮ “on-line”, ASR-based: Vlingo, Nuance ◮ “off-line”, human-assisted: SpinVox, Jott, ReQall ◮ Speech to Speech Translation
Mobile Speech Technologies ◮ Speech Coding ◮ Efficient digital representation of speech signals ◮ Fundamental for 2G and 3G cell networks and VoIP ◮ Speech Synthesis ◮ Speech output for commands, directions ◮ Text-to-speech for messages, books, other content ◮ Speech Recognition ◮ Command and control (“voice control”) ◮ Dictation (Speech-to-text for e-mail, SMS) ◮ Search input (questions, keywords) ◮ Dialogue
Speech Coding ◮ A fairly mature technology (started in the 1960s) ◮ Early versions were mostly for military applications ◮ Digital cell phone networks changed this dramatically ◮ Almost universally based on linear prediction and the source-filter model . ◮ Each sample is a weighted sum of P previous samples. ◮ Weights are linear prediction coefficients (LPCs), and are calculated to minimize mean squared error. ◮ Conveniently enough, this is actually a good model of the frequency response of the vocal tract (given enough LPCs). ◮ An “excitation function” models the glottal source. ◮ Everything else is just tweaking ◮ Better excitation functions (CELP) ◮ Variable bit rates (AMR) ◮ Compression tricks (VAD + comfort noise)
Mobile Speech Synthesis ◮ Two traditional categories, one new one ◮ Synthesis by rule, e.g. formant synthesis ◮ Concatenative synthesis, e.g. diphone, unit selection ◮ Statistical-parametric synthesis (“HMM synthesis”) ◮ We have had very efficient (often hardware-based) implementations of TTS for decades ◮ They sound terrible (but are often quite intelligible) ◮ The challenges for mobile devices are: ◮ Achieving natural-sounding speech ◮ Dealing with very large, irregular vocabularies ◮ Dealing with raw and diverse input text
Mobile Speech Synthesis ◮ Unit selection currently gives the most natural output ◮ But it is very ill-suited to mobile implementations ◮ Best systems use gigabytes of speech data ◮ But, you say... I have an 8GB microSD card in my phone! ◮ Search time: finding the right units of speech ◮ Access time: loading them from the storage medium ◮ Signal generation can also be time-consuming if not efficiently implemented ◮ Some ways to improve efficiency: ◮ Compress the speech database ◮ Prune the speech database by discarding units that are infrequently or never used ◮ Approximate search algorithms (much like ASR)
Mobile Speech Synthesis ◮ Statistical-parametric synthesis is quite promising ◮ Models are quite small (1-2MB) ◮ The search problem is nonexistent ◮ Parameter and waveform generation are the most time consuming parts currently ◮ Requires higher dimensionality parameterizations than concatenative synthesis ◮ Output parameters are smoothed using an iterative algorithm (similar to EM) ◮ Waveform generation from mcep is much slower than LPC ◮ Dictionary compression and text normalization ◮ Dictionary can be compressed by building letter-to-sound models and listing only the exceptions ◮ Efficient finite-state transducer representations can be created for pronunciation and text processing rules
Mobile Speech Recognition ◮ Challenges for mobile devices are: ◮ Variable and noisy acoustic environments ◮ Large vocabularies ◮ Open domain dictation input ◮ As with speech synthesis, simple ASR is not very resource intensive, although it has not been as widely implemented ◮ Even with large vocabularies, ASR can be done efficiently ◮ The most important factor is the complexity of the grammar ◮ Commercial systems achieve impressive performance based on very constrained grammars ◮ Systems tend to be extensively tuned for a given application
Mobile Speech Recognition: Acoustic Issues ◮ How do you talk to a device? ◮ This depends on the application, user, and environment ◮ Acoustic feature vectors can look very different ◮ Microphones may not be optimized for all positions ◮ Noisy environments ◮ Mobile devices are more likely to be used in noisy environments ◮ Worse, they are more likely to be used in difficult ones ◮ Non-stationary noise, crosstalk, human babble ◮ Array processing is not well suited to handheld devices ◮ On the bright side: ◮ Usually a mobile device has only one user ◮ Speaker adaptation can improve acoustic modeling ◮ Speaker identification can be used to filter out babble and crosstalk
Mobile Speech Recognition: Computational Issues ◮ Acoustic feature extraction ◮ Efficient, as long as it is implemented properly ◮ Fixed-point arithmetic, data-parallel processing ◮ Most processing time is consumed by, in roughly equal amounts: ◮ Acoustic model evaluation ◮ Search (hypothesis generation and evaluation) ◮ These can be made computationally efficient but must also be made memory efficient, search in particular. ◮ This necessarily involves tuning heuristics because a complete solution is intractable.
Mobile Speech Recognition: Acoustic Modeling ◮ Exact acoustic model evaluation is intractable K D − ( o d − µ ikd ) 2 1 � � P ( o | s i , λ ) = w ik exp 2 σ 2 � (2 π ) D | Σ ik | ikd k =1 d =1 ◮ Typical continuous-density acoustic model: ◮ 5000 tied states, each with ◮ 32 Gaussian densities, of ◮ 39 dimensions ◮ Complete evaluation of all log-likelihoods for one 10ms frame: ◮ 155000 log-additions ◮ 12480000 subtractions ◮ 12480000 multiplications ◮ That’s 2500 million operations per second! ◮ Your new MacBook Pro can do that, but just barely ◮ (yes, its video card can do it easily)
Mobile Speech Recognition: Acoustic Modeling ◮ How do we make this fast enough? ◮ Only evaluate densities for “active” phones in search ◮ Predict which densities will score highly using a smaller, approximate model set, and only evaluate these ones ◮ Use fewer densities and: ◮ Share them between all HMM states (semi-continuous HMM) ◮ or all the states for some phonetic class (phonetically-tied HMM) ◮ Make density computation faster by quantizing acoustic features and parameters ◮ Skip some frames in the input, either by ◮ Blindly computing only multiples of N (usually 2 or 3) ◮ Detecting “interesting” regions in the input and only computing densities there (landmark detection) ◮ Every ASR system in existence uses some combination of these ◮ However, too many approximations can make the system slower
Mobile Speech Recognition: Search ◮ Search is not arithmetically intensive ◮ It largely consists of adding up scores and comparing them to other scores ◮ However it is very memory intensive ◮ The search module in an ASR system touches: ◮ Acoustic scores ◮ Language model scores ◮ Dictionary entries ◮ Viterbi path scores and backpointers ◮ Backpointer table entries ◮ In other words, pretty much every piece of memory except the acoustic model parameters ◮ Worse yet, there are sequential dependencies between all these memory accesses
Recommend
More recommend