microphone array processing
play

Microphone Array Processing M4 Progress Report Iain McCowan - PowerPoint PPT Presentation

Microphone Array Processing M4 Progress Report Iain McCowan January 28, 2003 Objective and Aims Objective to demonstrate viability and advantages of microphone arrays for speech acquisition in meetings Aims 1. measurement


  1. Microphone Array Processing M4 Progress Report Iain McCowan January 28, 2003

  2. Objective and Aims Objective � • to demonstrate viability and advantages of microphone arrays for speech acquisition in meetings Aims � 1. measurement and analysis of speaker turns 2. benchmark microphone array against close-talking microphones for speech recognition 3. precise tracking of people

  3. Progess in past 6 months 1. measurement and analysis of speaker turns • location based speaker segmentation 2. speech recognition evaluation • comparison between lapel,array and single distant microphone on small vocab task 3. audio-visual speaker tracking (Daniel)

  4. measuring and analysing speaker turns � speaker turn segmention important for • selecting audio for playback • speaker recognition • speaker adaptation for recognition • segmenting speech transcriptions � also... • analysis of speaker turns could be useful to detect higher level dialogue actions (monologues, general discussion, ...) � but traditional techniques struggle in meetings • multiple speakers, significant proportion of overlapping speech (~15% of words)

  5. location based speaker segmentation � Assumptions • distinct source locations can be associated with distinct speakers • speech sounds dominate others in meetings � Proposed Technique • Measurement : source location of principal sounds represented by microphone pair time delays as features (vector with 1 value per microphone pair) • Model : GMM distribution characterising centroid of known speaker location (set manually from vector of theoretical delay values) • System : incorporate GMM’s into minimum duration HMM for all (4) speakers, segment using Viterbi decoding. • to appear in Lathoud,ICASSP 03

  6. location based speaker segmentation � experiments • data: 20 minutes, including 5 minutes from each of 4 distinct speaker locations. spliced together to give segments between 5-20 seconds. • evaluation: • frame accuracy (FA, % of correctly labeled frames) • precision, recall, F-measure • results system FA precision recall F location 99.1% 0.98 0.98 0.98 LPCC 88.3% 0.81 0.73 0.77

  7. location based speaker segmentation � extension to dual speaker overlap segments • manually constructed HMM of alternating short speaker turns. 6 additional classes in the HMM (+4 individual speaker classes) • data: same but each speaker change had 5-15 seconds of overlap • results test set FA precision recall F no overlap 99.1% 0.98 0.98 0.98 overlap 94.1% (85.5%) 0.94 0.86 0.90

  8. measuring and analysing speaker turns � extensions... • current system measures activity of each speaker location • simpler detection of overlap � ongoing work • remove limiting assumptions • remove a priori knowledge of locations • automatic clustering of locations • allow many-many relationship between speaker-location • couple with speaker clustering/identification based on standard LPCC features • not all sounds are speech • classify detected segments as speech/non-speech • analysis of speaker turns • recognition of higher-level structure, such as overlap, dialogues, monologues, discusssions, etc... • to be discussed more in meeting segmentation work tomorrow...

  9. speech recognition evaluation data collection � • re-recorded Numbers 95 corpus in meeting room, across a circular 8-element microphone array, and 3 lapel microphones • loud-speakers used (lapel microphones attached to material just below speaker) • scenarios 1. single speaker (~ 20dB) 2. one overlapping speaker (2 different locations) (~ 0dB) 3. two overlapping speakers (~ -3dB) • will be made publicly available in conjunction with OGI

  10. speech recognition evaluation • GMM/HMM recognition system (HTK) • in each case, adaptation from clean models using development set • first results (baseline on clean test set 6.3% WER) • for single speaker in normal conditions • lapel microphone and microphone array give 7% word error rate • single table-top microphone gave 10% word error rate. • with a competing speaker (overlapping speech) at same level • lapel microphone gives 27% word error rate • microphone array gives 19% word error rate • single table-top microphone gives 60% word error rate • two competing speakers • lapel 35%, array 26%, single table-top 74% • indicates that array can be as good as, or better than lapel microphones for speech recognition • but, comparing with unenhanced lapel at this point...

  11. speech recognition evaluation

  12. speech recognition � for more details see Moore, ICASSP 03 � future work • benchmark against lapel microphones on large vocabulary task (M4 data, ICSI data???) • additional (post-beamforming) enhancement in case of detected overlapping speech (dual channel techniques)

  13. audio-visual speaker tracking � use audio source localisation to help a visual tracker to • initialise • recover from tracking errors / visual occlusion � see Daniel’s presentation...

  14. summary � microphone arrays proving to be useful devices for recording and analysing meetings • facilitate accurate speaker turn segmentation (esp. multi- speaker overlap) • comparable speech recognition performance to (unenhanced) lapel microphones in ideal case, better in case of noise (eg overlap speech) • accurate tracking of speakers � also... • developing prototype stand-alone real-time system (8 inputs, 8 outputs, Analog Devices TigerSHARC, Firewire o/p)

  15. relevance to M4 partners � collaboration • sharing of location-based speech activity features to facilitate multi-modal research • provide ‘mixed’ single audio channel as alternative to simple addition of lapel channels � comparison • array vs close-talking microphone speaker turn segmentation • provide array output signal for comparison with lapels on (eventual) large vocablary recognition system • compare with close-talking microphone enhancement (??) during overlap segments on Numbers task

Recommend


More recommend