mobile speech processing

Mobile Speech Processing David Huggins-Daines Language Technologies - PowerPoint PPT Presentation

Mobile Speech Processing David Huggins-Daines Language Technologies Institute Carnegie Mellon University September 19, 2008 Outline Mobile Devices What are they? What would we like to do with them? Mobile Speech Applications

  1. Mobile Speech Processing David Huggins-Daines Language Technologies Institute Carnegie Mellon University September 19, 2008

  2. Outline ◮ Mobile Devices ◮ What are they? ◮ What would we like to do with them? ◮ Mobile Speech Applications ◮ Mobile Speech Technologies ◮ Current Research

  3. Mobile Devices ◮ What is a “mobile device”? ◮ A hammer is a device, and you can carry it around with you! ◮ But no, that’s not what we mean here

  4. Mobile Devices ◮ What is a “mobile device”? ◮ A device that goes everywhere with you ◮ ... which provides some or all of the functions of computer ◮ ... and some things it doesn’t, such as a cell phone or GPS.

  5. Speech on Mobile Devices ◮ Why do we care about speech processing on these devices? ◮ Because they are the future of computers ◮ Because speech is actually a useful way to interact with them, unlike full-sized computers ◮ What kind of speech processing do we care about? ◮ Speech coding to improve voice quality for cellular and VoIP ◮ Speech recognition for hands-free input to apps ◮ Speech synthesis for eyes-free output from apps ◮ In some cases, speech is a natural and convenient modality ◮ In other cases, it is a necessity (e.g. in-car navigation)

  6. Speech on Mobiles vs. Mobile Speech ◮ None of this necessarily implies doing actual speech processing (aside from coding) on the device itself ◮ Telephone dialog systems are “mobile” by any definition ◮ Let’s Go - bus scheduling information ◮ HealthLine - medical information for rural health workers ◮ But all synthesis and recognition is done on a server ◮ This can be a good thing especially in the latter case ◮ You can’t run a speech recognizer on a Motofone or a Nokia 1010 ◮ Speech processing on the device is useful for: ◮ Multimodal applications ◮ Disconnected applications ◮ Access to local data

  7. Some Mobile Speech Applications ◮ GPS navigation ◮ Older systems used a small number of recorded prompts (“turn left”, “100 metres”, etc) ◮ More recently, TTS has been used to speak street names ◮ Even more recently, ASR is used for input ◮ Voice dialing ◮ Old systems used DTW and required training ◮ Newer ones build models from your address book ◮ Cactus for iPhone - uses CMU Flite and Sphinx ◮ Voice-driven search (local, web, etc) ◮ Nuance, Vlingo, TellMe, Microsoft are all doing this ◮ Voice-to-text ◮ Typically server-based, requires a data connection ◮ “on-line”, ASR-based: Vlingo, Nuance ◮ “off-line”, human-assisted: SpinVox, Jott, ReQall ◮ Speech to Speech Translation

  8. Mobile Speech Technologies ◮ Speech Coding ◮ Efficient digital representation of speech signals ◮ Fundamental for 2G and 3G cell networks and VoIP ◮ Speech Synthesis ◮ Speech output for commands, directions ◮ Text-to-speech for messages, books, other content ◮ Speech Recognition ◮ Command and control (“voice control”) ◮ Dictation (Speech-to-text for e-mail, SMS) ◮ Search input (questions, keywords) ◮ Dialogue

  9. Speech Coding ◮ A fairly mature technology (started in the 1960s) ◮ Early versions were mostly for military applications ◮ Digital cell phone networks changed this dramatically ◮ Almost universally based on linear prediction and the source-filter model . ◮ Each sample is a weighted sum of P previous samples. ◮ Weights are linear prediction coefficients (LPCs), and are calculated to minimize mean squared error. ◮ Conveniently enough, this is actually a good model of the frequency response of the vocal tract (given enough LPCs). ◮ An “excitation function” models the glottal source. ◮ Everything else is just tweaking ◮ Better excitation functions (CELP) ◮ Variable bit rates (AMR) ◮ Compression tricks (VAD + comfort noise)

  10. Mobile Speech Synthesis ◮ Two traditional categories, one new one ◮ Synthesis by rule, e.g. formant synthesis ◮ Concatenative synthesis, e.g. diphone, unit selection ◮ Statistical-parametric synthesis (“HMM synthesis”) ◮ We have had very efficient (often hardware-based) implementations of TTS for decades ◮ They sound terrible (but are often quite intelligible) ◮ The challenges for mobile devices are: ◮ Achieving natural-sounding speech ◮ Dealing with very large, irregular vocabularies ◮ Dealing with raw and diverse input text

  11. Mobile Speech Synthesis ◮ Unit selection currently gives the most natural output ◮ But it is very ill-suited to mobile implementations ◮ Best systems use gigabytes of speech data ◮ But, you say... I have an 8GB microSD card in my phone! ◮ Search time: finding the right units of speech ◮ Access time: loading them from the storage medium ◮ Signal generation can also be time-consuming if not efficiently implemented ◮ Some ways to improve efficiency: ◮ Compress the speech database ◮ Prune the speech database by discarding units that are infrequently or never used ◮ Approximate search algorithms (much like ASR)

  12. Mobile Speech Synthesis ◮ Statistical-parametric synthesis is quite promising ◮ Models are quite small (1-2MB) ◮ The search problem is nonexistent ◮ Parameter and waveform generation are the most time consuming parts currently ◮ Requires higher dimensionality parameterizations than concatenative synthesis ◮ Output parameters are smoothed using an iterative algorithm (similar to EM) ◮ Waveform generation from mcep is much slower than LPC ◮ Dictionary compression and text normalization ◮ Dictionary can be compressed by building letter-to-sound models and listing only the exceptions ◮ Efficient finite-state transducer representations can be created for pronunciation and text processing rules

  13. Mobile Speech Recognition ◮ Challenges for mobile devices are: ◮ Variable and noisy acoustic environments ◮ Large vocabularies ◮ Open domain dictation input ◮ As with speech synthesis, simple ASR is not very resource intensive, although it has not been as widely implemented ◮ Even with large vocabularies, ASR can be done efficiently ◮ The most important factor is the complexity of the grammar ◮ Commercial systems achieve impressive performance based on very constrained grammars ◮ Systems tend to be extensively tuned for a given application

  14. Mobile Speech Recognition: Acoustic Issues ◮ How do you talk to a device? ◮ This depends on the application, user, and environment ◮ Acoustic feature vectors can look very different ◮ Microphones may not be optimized for all positions ◮ Noisy environments ◮ Mobile devices are more likely to be used in noisy environments ◮ Worse, they are more likely to be used in difficult ones ◮ Non-stationary noise, crosstalk, human babble ◮ Array processing is not well suited to handheld devices ◮ On the bright side: ◮ Usually a mobile device has only one user ◮ Speaker adaptation can improve acoustic modeling ◮ Speaker identification can be used to filter out babble and crosstalk

  15. Mobile Speech Recognition: Computational Issues ◮ Acoustic feature extraction ◮ Efficient, as long as it is implemented properly ◮ Fixed-point arithmetic, data-parallel processing ◮ Most processing time is consumed by, in roughly equal amounts: ◮ Acoustic model evaluation ◮ Search (hypothesis generation and evaluation) ◮ These can be made computationally efficient but must also be made memory efficient, search in particular. ◮ This necessarily involves tuning heuristics because a complete solution is intractable.

  16. Mobile Speech Recognition: Acoustic Modeling ◮ Exact acoustic model evaluation is intractable K D − ( o d − µ ikd ) 2 1 � � P ( o | s i , λ ) = w ik exp 2 σ 2 � (2 π ) D | Σ ik | ikd k =1 d =1 ◮ Typical continuous-density acoustic model: ◮ 5000 tied states, each with ◮ 32 Gaussian densities, of ◮ 39 dimensions ◮ Complete evaluation of all log-likelihoods for one 10ms frame: ◮ 155000 log-additions ◮ 12480000 subtractions ◮ 12480000 multiplications ◮ That’s 2500 million operations per second! ◮ Your new MacBook Pro can do that, but just barely ◮ (yes, its video card can do it easily)

  17. Mobile Speech Recognition: Acoustic Modeling ◮ How do we make this fast enough? ◮ Only evaluate densities for “active” phones in search ◮ Predict which densities will score highly using a smaller, approximate model set, and only evaluate these ones ◮ Use fewer densities and: ◮ Share them between all HMM states (semi-continuous HMM) ◮ or all the states for some phonetic class (phonetically-tied HMM) ◮ Make density computation faster by quantizing acoustic features and parameters ◮ Skip some frames in the input, either by ◮ Blindly computing only multiples of N (usually 2 or 3) ◮ Detecting “interesting” regions in the input and only computing densities there (landmark detection) ◮ Every ASR system in existence uses some combination of these ◮ However, too many approximations can make the system slower

  18. Mobile Speech Recognition: Search ◮ Search is not arithmetically intensive ◮ It largely consists of adding up scores and comparing them to other scores ◮ However it is very memory intensive ◮ The search module in an ASR system touches: ◮ Acoustic scores ◮ Language model scores ◮ Dictionary entries ◮ Viterbi path scores and backpointers ◮ Backpointer table entries ◮ In other words, pretty much every piece of memory except the acoustic model parameters ◮ Worse yet, there are sequential dependencies between all these memory accesses


More recommend