Meeting Recorder: Audio Processing Dan Ellis <dpwe@ee.columbia.edu> Lab ROSA , Columbia University and ICSI, Berkeley Outline 1 ICSI Meeting Recorder 2 Close-mics: cancellation & turn estimation 3 Tabletop mics: turns & speaker location 4 Visualization tools 5 Future Work Meeting Audio - Dan Ellis 2002-08-29 - 1/11
ICSI Meeting Recorder data 1 (with UW, SRI, IBM, Columbia) • Microphones in conventional meetings - for summarization/retrieval/behavior analysis - informal, overlapped speech • Data collection (ICSI, UW, ...): - 100 hours collected, ongoing transcription • NSF ‘Mapping Meetings’ project - also interest from NIST, DARPA Meeting Audio - Dan Ellis 2002-08-29 - 2/11
Data from the ICSI project TX1 Lapel mic ICSI Meeting Recorder Room Audio Setup Wireless TX2 2000-05-05 headsets TX3 TX4 TX5 Audio PC Ambient PZM1 Wireless RX 5 mics ADAT lightpipe Mackie PZM2 2 A/D 1 MainL/R mixer STUDI/O PCI card PZM3 2 Aux1/2 ADAT lightpipe A/D 2 PZM4 6/8 2 Jimlet Dummy JimBox PDA 2 Jimlet PSU & breakout 2 Jimlet Computer headsets 2 Jimlet Notes: 1. The JimBox and the Jimlets are the custom electronics manufactured at ICSI to interface PC-style headsets to pro-audio XLR. • 16 channels @ 16 kHz, 16 bit • Preprocessing Avg spec, 20s, mr-2000-11-02-1440-chanE (pzm) 60 level / dB - high-pass filter! 40 20 - 64 sample skew! 0 20 1 2 3 4 10 10 10 10 freq / Hz Meeting Audio - Dan Ellis 2002-08-29 - 3/11
Close-mic channels 2 backchannel floor seizure (signals desire to regain floor?) mr-2000-06-30-1600 Spkr A speaker active speaker B Spkr B cedes floor Spkr C interruptions Spkr D breath noise Spkr E crosstalk level/dB 40 Table 20 top 0 120 125 130 135 140 145 150 155 time / secs • Crosstalk • Speaker activity detection Meeting Audio - Dan Ellis 2002-08-29 - 4/11
Impulse response coupling • Cross-correlation recovers impulse response Example cross coupling response, chan3 to chan0 0.02 0 0.02 20 E / dB 40 (8 pt hann) 60 80 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 time / s • Coupling to each mic gives motion participant movement mr-2000-11-02-1440 20 freq / chan 15 Spkr C 10 5 20 delay / samples 15 Coupling 10 C → A 5 20 delay / samples 15 Coupling 10 C → Tabletop 5 1020 1020.5 1021 1021.5 1022 1022.5 1023 1023.5 1024 1024.5 time / sec Meeting Audio - Dan Ellis 2002-08-29 - 5/11
Speaker Activity Detection (with Sam Keene) m C s ⋅ n • Noisy crosstalk model: = + • Estimate subband C xA from A’s peak energy - i.e. ‘sparsity’ assumption - ... then linear inversion to recover speaker act. • 20 subband crosstalk gains for each spkr x mic mr-2000-06-30-1600 chan0 0B 0A 20 0 0 frq chan -10 10 -20 -20 0 -30 -40 0 mr-2000-06-30-1600 chanB B0 BA 20 0 0 10 -50 -50 0 -100 -100 mr-2000-06-30 1600 chanA A0 AB 0 -20 20 -40 10 -50 -60 0 -100 -80 125 130 135 140 145 150 155 160 0 10 20 0 10 20 time/s Meeting Audio - Dan Ellis 2002-08-29 - 6/11
Tabletop mics: Turn detection 3 • 4 mics ~ 1m separated along 5 4 center of table 3 2 - 3 timing differences 1 0 - slight L/R offset to -1 -2 -3 disambiguate -4 -5 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 • Hi-res cross-correlation for timings - use normalized peak value for confidence - cluster results mr-2000-11-02-1440: PZM xcorr lags 1 Example cross coupling response, chan3 to chan0 4 300 0 250 100xR skew/samps lag 3-4 / ms 200 -1 150 100 3 -2 50 2 1 0 0 50 100 150 200 250 300 -3 time / s -3 -2 -1 0 1 2 3 lag 1-2 / ms Meeting Audio - Dan Ellis 2002-08-29 - 7/11
Speaker localization (with Huan Wei Hee) • Timing differences → speaker positions (x,y,z) Inferred talker positions ( x =mic) 4 1 2 z / m 0.5 2 3 1 0 0 x / m 2 1 0 -1 -2 -2 y / m - gradient descent on implied ∆ t s • Ambiguity: - mic positions not fixed - speaker motions • Iterative estimation of speaker, mic locations Meeting Audio - Dan Ellis 2002-08-29 - 8/11
Visualization: transPlotter 3 • Speaker turn patterns are informative • Browser for ‘high-level’ view, quick examination - snack, iwidgets based - public release Meeting Audio - Dan Ellis 2002-08-29 - 9/11
Meeting IR tool • IR on (ASR) transcripts from meetings - repurposed from Thisl project Meeting Audio - Dan Ellis 2002-08-29 - 10/11
Future work 5 • Speaker turns - evaluation of close-mic system - speaker characteristics for tabletop mics • Nonspeech events - unsupervised clustering of audio - finding the feature space... • Speech fragment recognition - missing-data recognition based on ‘good’ signal - recognition of overlapping voices • High-level browsing - the ‘meeting map’ concept - summarization Meeting Audio - Dan Ellis 2002-08-29 - 11/11
Recommend
More recommend