computer speech recognition mimicking the human system
play

Computer Speech Recognition: Mimicking the Human System Li Deng - PowerPoint PPT Presentation

Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Microsoft Research, Redmond July 24, 2005 Banff/BIRS Fundamental Equations Enhancement (denoising): Recognition: = = W P W x P x W


  1. Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Microsoft Research, Redmond July 24, 2005 Banff/BIRS

  2. Fundamental Equations • Enhancement (denoising): • Recognition: ˆ = = W P W x P x W P W arg max ( | ) arg max ( | ) ( ) W W • Importance of speech modeling

  3. Speech Recognition--- Introduction • Converting naturally uttered speech into text and meaning • Conventional technology --- statistical modeling and estimation (HMM) • Limitations – noisy acoustic environments – rigid speaking style – constrained task – unrealistic demand of training data – huge model sizes, etc. – far below human speech recognition performance • Trend: Incorporate key aspects of human speech processing mechanisms

  4. Segment-Level Speech Dynamics

  5. Production & Perception: Closed-Loop Chain LISTENER SPEAKER decoded message Internal message model ear/auditory reception motor/articulators Speech Acoustics in closed-loop chain

  6. Encoder: Two-Stage Production Mechanisms Phonology (higher level): •Symbolic encoding of linguistic message •Discrete representation by phonological features SPEAKER •Loosely-coupled multiple feature tiers •Overcome beads-on-a-string phone model •Theories of distinctive features, feature geometry & articulatory phonology • Account for partial/full sound deletion/modification message in casual speech motor/articulators Phonetics Phonetics (lower level): (lower level): •Convert discrete linguistic features to Convert discrete linguistic features to • continuous acoustics continuous acoustics • •Mediated by motor control & articulatory Mediated by motor control & articulatory dynamics dynamics •Mapping from articulatory variables to Mapping from articulatory variables to • VT area function to acoustics VT area function to acoustics Speech Acoustics • •Account for co Account for co- -articulation and reduction articulation and reduction (target undershoot), etc. (target undershoot), etc.

  7. Encoder: Phonological Modeling Computational phonology: • Represent pronunciation variations as constrained factorial Markov chain SPEAKER • Constraint: from articulatory phonology • Language-universal representation ten themes message / t ε n ө i: m z / Labial- motor/articulators LIPS: closure Alveolar Tongue Alveolar dental TT: Alveolar constr. Tip closure constr. closure Tongue High / Front gesture-iy TB: Mid / Front gesture-eh Body Speech Acoustics VEL: Nasality Nasality GLO: Voicing Voicing Aspiration

  8. Encoder: Phonetic Modeling Computational phonetics: Computational phonetics: • • Segmental factorial HMM for sequential target Segmental factorial HMM for sequential target in articulatory or vocal tract resonance domain in articulatory or vocal tract resonance domain SPEAKER • Switching trajectory model for target Switching trajectory model for target- -directed directed • articulatory dynamics articulatory dynamics • Switching nonlinear state • Switching nonlinear state- -space model for space model for dynamics in speech acoustics dynamics in speech acoustics • Illustration: Illustration: • message motor/articulators Speech Acoustics

  9. Phonetic Encoder: Computation S S S S �����....... S K 3 1 2 4 SPEAKER targets t t t t t 1 3 K 2 4 z articulation z z z z 2 3 K 1 4 message distortion-free acoustics o o o o o motor/articulators K 2 3 4 1 distorted acoustics y y y y y K 1 2 3 4 h n n n n n Speech Acoustics 3 K 1 2 4 distortion factors & feedback to articulation N N N N N K 4 3 1 2

  10. Decoder I: Auditory Reception LISTENER • Convert speech acoustic waves into efficient & robust auditory representation decoded • This processing is largely independent message of phonological units • Involves processing stages in cochlea Internal (ear), cochlear nucleus, SOC, IC,…, all message model the way to A1 cortex • Principal roles: ear/auditory reception 1) combat environmental acoustic distortion; motor/articulators 2) detect relevant speech features 3) provide temporal landmarks to aid decoding • Key properties: 1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds 4) modulation in independent frequency bands, 5) binaural noise suppression, etc.

  11. Decoder II: Cognitive Perception LISTENER • Cognitive process: recovery of linguistic message • Relies on decoded 1) “Internal” model: structural knowledge of message the encoder (production system) 2) Robust auditory representation of features Internal 3) Temporal landmarks message model • Child speech acquisition process is one that gradually establishes the “internal” model ear/auditory reception • Strategy: analysis by synthesis • i.e., Probabilistic inference on (deeply) motor/articulators hidden linguistic units using the internal model • No motor theory: the above strategy requires no articulatory recovery from speech acoustics

  12. Speaker-Listener Interaction • On-line modification of speaker’s articulatory behavior (speaking effort, rate, clarity, etc.) based on listener’s “decoding” performance (i.e. discrimination) • Especially important for conversational speech recognition and understanding • On-line adaptation of “encoder” parameters • Novel criteria: – maximize discrimination while minimizing articulation effort • In this closed-loop model, the “effort” quantified as “curvature” of temporal sequence of articulatory vector z t . • No such concept of “effort” in conventional HMM systems

  13. Model synthesis in FT • xxx • xxx

  14. Model synthesis in cepstra Model synthesis in cepstra 2 1 C1 Model 0 −1 data −2 0 50 100 150 200 250 0.5 C2 0 −0.5 −1 0 50 100 150 200 250 0.5

  15. Procedure --- N-best Evaluation test data H*= arg Max { P(H1), P(H2),…P(H1000)} LPCC feature extraction + table nonlinear Gaussian Hyp 1 FIR - Scorer lookup mapping + table Hyp 2 nonlinear Gaussian FIR triphone lookup mapping Scorer - HMM system … … … … … … … … … … … … … … … … … … + table nonlinear Gaussian Hyp N FIR lookup mapping Scorer - N-best list (N=1000); each hypothesis has γ γ parameter µ µ σ σ T γ γ µ µ σ σ 2 (k) (k) phonetic xcript & time s s ss ss free

  16. Results (recognition accuracy %) (work with Dong Yu) Lattice Decode Models 75.1 New model 72.5 72.5 72.5 HMM system 100 . . . 90 80 70 Acc% HMM 60 50 40 N in N-best 30 1 11 101 1001 10001

  17. Summary & Conclusion • Human speech production/perception viewed as synergistic elements in a closed-looped communication chain • They function as encoding & decoding of linguistic messages, respectively. • In human, speech “encoder” (production system) consists of phonological (symbolic) and phonetic (numeric) levels. • Current HMM approach approximates these two levels in a crude way: – phone-based phonological model (“beads-on-a-string”) – multiple Gaussians as phonetic model for acoustics directly – very weak hidden structure

  18. Summary & Conclusion (cont’d) • “Linguistic message recovery” (decoding) formulated as: – auditory reception for efficient & robust speech representation & for providing temporal landmarks for phonological features – cognition perception using “encoder” knowledge or “internal model” to perform probabilistic analysis by synthesis or pattern matching • Dynamic Bayes network developed as a computational tool for constructing encoder and decoder • Speaker-listener interaction (in addition to poor acoustic environment) cause substantial changes of articulation behavior and acoustic patterns

  19. Issues for discussion • Differences and similarities in processing/analysis techniques for audio/speech and image/video processing • Integrated processing vs. modular processing ˆ = = W P W x P x W P W argmax ( | ) argmax ( | ) ( ) W W • Feature extraction vs. classification • Use of semantics (class) information for feature extraction (dim reduction, discriminative features, etc.) • Arbitrary signal vs. structured signal (e.g. face image, human body motion, speech, music)

Recommend


More recommend