auditory system for a mobile robot
play

Auditory System For a Mobile Robot PhD Thesis Jean-Marc Valin - PowerPoint PPT Presentation

Auditory System For a Mobile Robot PhD Thesis Jean-Marc Valin Department of Electrical Engineering and Computer Engineering Universit de Sherbrooke, Qubec, Canada Jean-Marc.Valin@USherbrooke.ca Motivations Robots need information


  1. Auditory System For a Mobile Robot PhD Thesis Jean-Marc Valin Department of Electrical Engineering and Computer Engineering Université de Sherbrooke, Québec, Canada Jean-Marc.Valin@USherbrooke.ca

  2. Motivations ● Robots need information about their environment in order to be intelligent ● Artificial vision has been popular for a long time, but artificial audition is new ● Robust audition is essential for human- robot interaction ( cocktail party effect )

  3. Approaches To Artificial Audition ● Single microphone – Human-robot interaction – Unreliable ● Two microphones (binaural audition) – Imitate human auditory system – Limited localisation and separation ● Microphone array audition – More information available – Simpler processing

  4. Objectives ● Localise and track simultaneous moving sound sources ● Separate sound sources ● Perform automatic speech recognition ● Remain within robotics constraints – complexity, algorithmic delay – robustness to noise and reverberation – weight/space/adaptability – moving sources, moving robot

  5. Experimental Setup cube (C1) ● Eight microphones on shell(C2) the Spartacus robot ● Two configurations ● Noisy conditions ● Two environments ● Reverberation time – Lab (E1) 350 ms – Hall (E2) 1 s

  6. Sound Source Localisation

  7. Approaches to Sound Source Localisation ● Binaural – Interaural phase difference (delay) – Interaural intensity difference ● Microphone array – Estimation through TDOAs – Subspace methods (MUSIC) – Direct search (steered beamformer) ● Post-processing – Kalman filtering – Particle filtering

  8. Steered Beamformer ● Delay-and-sum beamformer ● Maximise output energy ● Frequency domain computation

  9. Spectral Weighting ● Normal cross-correlation peaks are very wide ● PHAse Transform (PHAT) has narrow peaks ● Apply weighting – Weight according to noise and reverberation – Models the precedence effect ● Sensitivity is decreased after a loud sound

  10. Direction Search ● Finding directions with highest energy ● Fixed number of sources Q=4 ● Lookup-and-sum algorithm ● 25 times less complex

  11. Post-Processing: Particle Filtering ● Need to track sources over time ● Steered beamformer output is noisy ● Representing pdf as particles ● One set of (1000) particles per source ● State=[position, speed]

  12. Particle Filtering Steps 1) Prediction 2) Instantaneous probabilities estimation – As a function of steered beamformer energy

  13. Particle Filtering Steps (cont.) 3) Source-observation assignment – Need to know which observation is related to which tracked source – Compute ● : Probability that q is a false alarm ● : Probability that q is source j ● : Probability that q is a new source

  14. Particle Filtering Steps (cont.) 4) Particle weights update – Merging past and present information – Taking into account source-observation assignment 5) Addition or removal of sources 6) Estimation of source positions – Weighted mean of the particle positions 7) Resampling

  15. Localisation Results (E1) Detection accuracy over distance Localisation accuracy

  16. Tracking Results Two sources crossing with C2 ● Video E1 E2

  17. Tracking Results (cont.) Four moving sources with C2 E1 E2

  18. Sound Source Separation & Speech Recognition

  19. Overview of Sound Source Separation ● Frequency domain processing – Simple, low complexity ● Linear source separation ● Non-linear post-filter Tracking information Microphones Separated X n  k ,l  Sources Geometric  Y m  k ,l  Sources S m  k ,l  Post- source S m  k ,l  filter separation

  20. Geometric Source Separation ● Frequency domain: ● Constrained optimization – Minimize correlation of the outputs: – Subject to geometric constraint: ● Modifications to original GSS algorithm – Instantaneous computation of correlations – Regularisation

  21. Multi-Source Post-Filter

  22. Interference Estimation ● Source separation leaks – Incomplete adaptation – Inaccuracy in localization – Reverberation/diffraction – Imperfect microphones ● Estimation from other separated sources

  23. Reverberation Estimation ● Exponential decay model ● Example: 500 Hz frequency bin

  24. Results (SNR) ● Three speakers ● C2 (shell), E1 (lab) 15 12,5 10 7,5 SNR (dB) 5 Source 1 2,5 Source 2 0 Source 3 -2,5 -5 -7,5 Input Delay- GSS GSS + GSS + and- single- multi- sum source source

  25. Speech Recognition Accuracy (Nuance) ● Proposed post-filter reduces errors by 50% ● Reverberation removal helps in E2 only ● No significant difference between C1 and C2 E2, C2, 3 speakers ● Digit recognition 90% 85% ● 3 speakers: 83% Word correct (%) 80% 75% ● 2 speakers: 90% Right 70% Front 65% Left 60% microphone separated 55% 50% GSS only Post-filter Proposed (no dere- system verb.)

  26. Man vs. Machine ● How does a human compare? 90% 85% 80% Word correct (%) 75% 70% 65% 60% 55% 50% Listener Listener Listener Listener Listener Pro- 1 2 3 4 5 posed ● Is it fair? system – Yes and no!

  27. Real-Time Application ● Video from AAAI conference

  28. Speech Recognition With Missing Feature Theory ● Speech is transformed into features (~12) ● Not all features are reliable ● MFT = ignore unreliable features – Compute missing feature mask – Use the mask to compute probabilities

  29. Missing Feature Mask Interference: black: reliable unreliable white: unreliable Stationary noise: reliable

  30. Results (MFT) ● Japanese isolated word recognition (SIG2 robot, CTK) – 3 simultaneous sources – 200-word vocabulary – 30, 60, 90 degrees separation 80 70 Word correct (%) 60 50 40 Right 30 Front 20 Left 10 0 GSS GSS+post- GSS+post- filter filter+MFT

  31. Summary of the System

  32. Conclusion ● What have we achieved? – Localisation and tracking of sound sources – Separation of multiple sources – Robust basis for human-robot interaction ● What are the main innovations? – Frequency-domain steered beamformer – Particle filtering source-observation assignment – Separation post-filtering for multiple sources and reverberation – Integration with missing feature theory

  33. Where From Here? ● Future work – Complete dialogue system – Echo cancellation for the robot's own voice – Use human-inspired techniques – Environmental sound recognition – Embedded implementation ● Other applications – Video-conference: automatically follow speaker with a camera – Automatic transcription

  34. Questions? Comments?

Recommend


More recommend