Microphone Array Processing M4 Progress Report Iain McCowan January 28, 2003
Objective and Aims Objective � • to demonstrate viability and advantages of microphone arrays for speech acquisition in meetings Aims � 1. measurement and analysis of speaker turns 2. benchmark microphone array against close-talking microphones for speech recognition 3. precise tracking of people
Progess in past 6 months 1. measurement and analysis of speaker turns • location based speaker segmentation 2. speech recognition evaluation • comparison between lapel,array and single distant microphone on small vocab task 3. audio-visual speaker tracking (Daniel)
measuring and analysing speaker turns � speaker turn segmention important for • selecting audio for playback • speaker recognition • speaker adaptation for recognition • segmenting speech transcriptions � also... • analysis of speaker turns could be useful to detect higher level dialogue actions (monologues, general discussion, ...) � but traditional techniques struggle in meetings • multiple speakers, significant proportion of overlapping speech (~15% of words)
location based speaker segmentation � Assumptions • distinct source locations can be associated with distinct speakers • speech sounds dominate others in meetings � Proposed Technique • Measurement : source location of principal sounds represented by microphone pair time delays as features (vector with 1 value per microphone pair) • Model : GMM distribution characterising centroid of known speaker location (set manually from vector of theoretical delay values) • System : incorporate GMM’s into minimum duration HMM for all (4) speakers, segment using Viterbi decoding. • to appear in Lathoud,ICASSP 03
location based speaker segmentation � experiments • data: 20 minutes, including 5 minutes from each of 4 distinct speaker locations. spliced together to give segments between 5-20 seconds. • evaluation: • frame accuracy (FA, % of correctly labeled frames) • precision, recall, F-measure • results system FA precision recall F location 99.1% 0.98 0.98 0.98 LPCC 88.3% 0.81 0.73 0.77
location based speaker segmentation � extension to dual speaker overlap segments • manually constructed HMM of alternating short speaker turns. 6 additional classes in the HMM (+4 individual speaker classes) • data: same but each speaker change had 5-15 seconds of overlap • results test set FA precision recall F no overlap 99.1% 0.98 0.98 0.98 overlap 94.1% (85.5%) 0.94 0.86 0.90
measuring and analysing speaker turns � extensions... • current system measures activity of each speaker location • simpler detection of overlap � ongoing work • remove limiting assumptions • remove a priori knowledge of locations • automatic clustering of locations • allow many-many relationship between speaker-location • couple with speaker clustering/identification based on standard LPCC features • not all sounds are speech • classify detected segments as speech/non-speech • analysis of speaker turns • recognition of higher-level structure, such as overlap, dialogues, monologues, discusssions, etc... • to be discussed more in meeting segmentation work tomorrow...
speech recognition evaluation data collection � • re-recorded Numbers 95 corpus in meeting room, across a circular 8-element microphone array, and 3 lapel microphones • loud-speakers used (lapel microphones attached to material just below speaker) • scenarios 1. single speaker (~ 20dB) 2. one overlapping speaker (2 different locations) (~ 0dB) 3. two overlapping speakers (~ -3dB) • will be made publicly available in conjunction with OGI
speech recognition evaluation • GMM/HMM recognition system (HTK) • in each case, adaptation from clean models using development set • first results (baseline on clean test set 6.3% WER) • for single speaker in normal conditions • lapel microphone and microphone array give 7% word error rate • single table-top microphone gave 10% word error rate. • with a competing speaker (overlapping speech) at same level • lapel microphone gives 27% word error rate • microphone array gives 19% word error rate • single table-top microphone gives 60% word error rate • two competing speakers • lapel 35%, array 26%, single table-top 74% • indicates that array can be as good as, or better than lapel microphones for speech recognition • but, comparing with unenhanced lapel at this point...
speech recognition evaluation
speech recognition � for more details see Moore, ICASSP 03 � future work • benchmark against lapel microphones on large vocabulary task (M4 data, ICSI data???) • additional (post-beamforming) enhancement in case of detected overlapping speech (dual channel techniques)
audio-visual speaker tracking � use audio source localisation to help a visual tracker to • initialise • recover from tracking errors / visual occlusion � see Daniel’s presentation...
summary � microphone arrays proving to be useful devices for recording and analysing meetings • facilitate accurate speaker turn segmentation (esp. multi- speaker overlap) • comparable speech recognition performance to (unenhanced) lapel microphones in ideal case, better in case of noise (eg overlap speech) • accurate tracking of speakers � also... • developing prototype stand-alone real-time system (8 inputs, 8 outputs, Analog Devices TigerSHARC, Firewire o/p)
relevance to M4 partners � collaboration • sharing of location-based speech activity features to facilitate multi-modal research • provide ‘mixed’ single audio channel as alternative to simple addition of lapel channels � comparison • array vs close-talking microphone speaker turn segmentation • provide array output signal for comparison with lapels on (eventual) large vocablary recognition system • compare with close-talking microphone enhancement (??) during overlap segments on Numbers task
Recommend
More recommend