pocketsphinx
play

PocketSphinx: Open-Source Speech Recognition for Hand-held and - PowerPoint PPT Presentation

PocketSphinx: Open-Source Speech Recognition for Hand-held and Embedded Devices David Huggins-Daines (dhuggins@cs.cmu.edu) Mohit Kumar (mohitkum@cs.cmu.edu) Arthur Chan (archan@cs.cmu.edu) Alan W Black (awb@cs.cmu.edu) Mosur Ravishankar


  1. PocketSphinx: Open-Source Speech Recognition for Hand-held and Embedded Devices David Huggins-Daines (dhuggins@cs.cmu.edu) Mohit Kumar (mohitkum@cs.cmu.edu) Arthur Chan (archan@cs.cmu.edu) Alan W Black (awb@cs.cmu.edu) Mosur Ravishankar (rkm@cs.cmu.edu) Alexander I. Rudnicky (air@cs.cmu.edu) Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 1

  2. What is PocketSphinx? ● Based on Sphinx-II – Open source code under MIT-style license – Widely used in CMU and elsewhere – Mature and stable API ● Design goals – Statistical Language Model support ● Finite-State Grammars also available – Medium-Large Vocabulary (1-10kwords) – Make it go faster Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 2

  3. Why do we need it? ● Typical desktop/workstation of 2006 – 128-bit memory bus (6-10GB/sec) – 1.8-3GHz processor (5000 MIPS) – ATA, SATA, or SCSI storage (100-300MB/sec) ● Typical PDA/SOC/smartphone of 2006 – 16 or 32-bit memory bus (100-400MB/sec) – 200-600MHz processor (200-700 MIPS) – SD/MMC or CF storage (1-16MB/sec) – no FPU or vector unit (sometimes a DSP...) Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 3

  4. ASR bottlenecks ● Wait, you say: – My cell phone is pretty darn fast! – At least as fast as that DEC we had a real-time 20k system on back in 1996! ● However: ASR is system bandwidth limited – Sphinx benchmarks (shown to the right) favor large caches and high memory bandwidth (Intel) – Search, LM, and dictionary look- up are highly memory-intensive – We will have to deal with them (Source: techreport.com) Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 4

  5. Scaling: Hand-held vs Desktop 2.25 2 1.75 Speed (xRT) 1.5 1.25 Hand-held 1 Desktop 0.75 0.5 0.25 0 10 1000 5000 # of words in vocabulary Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 5

  6. How to make it go faster ● Low-hanging fruit – Front-end optimizations (fixed-point, logarithm) – Speeding up GMM computation – Old-fashioned beam tuning ● Non-speech-related work – Memory optimization (+ model compression) – Machine-level optimization (assembly code) ● What's left? – Search optimization – dynamic beam tuning – Language model compression and optimization Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 6

  7. Front-End Optimizations ● Fixed-point calculations – 32-bit, 16.16 or 18.14 format – Using 64-bit multiply (SMULL) on ARM, 16.16 multiply-accumulate on DSP – MFCC calculated in log domain, using a lookup of log 2 w/conversion to log 1.0001 ● Audio downsampling – Allows smaller order FFT and MFCC – Not as useful for large-vocabulary systems Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 7

  8. GMM Optimizations ● Top- N based Gaussian selection (Mosur 96) – Use previous frame's top codewords to select current frame – standard Sphinx-II technique ● Partial frame-based downsampling (Woszczyna 98) – Only update top- N every M th frame – Can significantly affect accuracy ● k d-tree based Gaussian selection (Fritsch 96) – Approximate nearest neighbor search in k dimensions using stable partition trees – 10% speedup, little or no effect on accuracy Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 8

  9. Search Optimizations ● Absolute pruning – Approximations in the front end and GMM increase the effective beam width, paradoxically decreasing performance – We would like to enforce a hard limit on the number of states or word exits evaluated per frame - how? ● Histogram pruning (Ney 1996) – Partition the beam width into bins – Dynamically recompute beam based on bin occupancy counts – 30% speedup with 10% relative degradation in WER Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 9

  10. Memory Optimizations ● Read-only model files – mmap(2)able, shareable between processes – leverage OS-level caching (virtual memory) ● Precompiled (binary) LM – Inherited from Sphinx-II – Adapted for memory-mapping – 5000+ vocabulary in <32M of RAM ● Read-only binary model definition file – Pre-built radix tree of triphones->senones Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 10

  11. Performance Task Vocabulary Perplexity xReal-Time Word Error TIDIGITS 10 13.86 0.5 0.87% RM1 994 46.79 0.71 13.11% WSJ devel5k 4989 143.5 0.96 18.50% ● Test platform: iPaq 3670 – 206MHz StrongARM running Linux (FPU emulation in kernel) ● Also running on: – Other embedded Linux platforms – Analog Devices Blackfin, uClinux – WinCE using GNU toolchain (untested) Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 11

  12. How to get it ● Web Site: http://www.speech.cs.cmu.edu/pocketsphinx/ ● Compiles with GCC for i386, ARM, PowerPC, and Blackfin ● Cross-compiles using an arm-wince-pe toolchain (available in various Linux distributions) for Windows CE ● Compatible with Sphinx2 fbs.h interface ● Good (fast) acoustic models forthcoming Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 12

  13. Future work ● Improve accuracy – Remove Sphinx-II codebook limitations ● Optimize the language model and dictionary – Statistical profiling of LM access patterns ● Investigate dynamic search strategies ● Remove various legacy code ● Fast speaker and channel adaptation Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 13

  14. Thank you ● Any questions? This work was supported by DARPA grant NB CH-D-03-0010. The content of the information in this publication does not necessarily reflect the position or the policy of the US Government, and no official endorsement should be inferred. Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 14

Recommend


More recommend