the icsi meeting corpus
play

The ICSI Meeting Corpus Barbara Peskin [on behalf of ICSIs - PowerPoint PPT Presentation

The ICSI Meeting Corpus Barbara Peskin [on behalf of ICSIs MeetingRecorder Team] International Computer Science Institute Berkeley, CA M4 Meeting, Sheffield 29-30 January 2003 1 Basic Facts 75 natural meetings collected at


  1. The ICSI Meeting Corpus Barbara Peskin [on behalf of ICSI’s MeetingRecorder Team] International Computer Science Institute Berkeley, CA M4 Meeting, Sheffield 29-30 January 2003 1

  2. Basic Facts • 75 “natural” meetings collected at ICSI, 2000-2002 – regular weekly meetings of ICSI working teams (mostly) – 3 – 10 participants per meeting, averaging ~6 – Roughly 1 hour each (17 – 103 minutes; 72 hours total) – 4 main meeting types, 53 unique talkers • Simultaneous multi-channel recordings, using both close-talking and far-field microphones • Audio only (no video), plus complete transcriptions • “Digits task”: a small-vocab read-speech subtask • Supports a wealth of research possibilities • Available through the LDC this summer (and to research partners soon, direct from us) M4 Meeting, Sheffield 29-30 January 2003 2

  3. Recording Set-up All meetings were recorded at ICSI in the same conference room, using the same set-up • Close-talking microphones for each speaker – mostly head-mounted – some lapel mics in early meetings • 6 tabletop microphones – 4 high-quality omnidirectional PZM’s arrayed down the center of the table – 2 inexpensive microphone elements mounted on a “PDA mock-up” • All channels recorded separately and simultaneously • Collected at 48 kHz, downsampled on the fly to 16 kHz • Audio files are 16-bit linear, compressed NIST SPHERE formats M4 Meeting, Sheffield 29-30 January 2003 3

  4. Meeting Types, Meeting Participants A few main meeting types with slowly changing mix of speakers, content • Meeting Recorder Project [29] • Robustness (signal processing for robust ASR) [23] • Even Deeper Understanding (natural language understanding) [15] • Network Services & Applications [3] • Other sporadic types, incl. 2 transcription team meetings [5] 53 unique talkers in the corpus • Speakers may appear in more than one meeting type • Significant proportion of non-native English speakers • Demographic info on sex, age, education level, dialect, etc. (all opt’l) collected on enrollment • For non-native speakers, info available on native tongue and time in English-speaking country M4 Meeting, Sheffield 29-30 January 2003 4

  5. The Digits Task At most meetings, participants read digit strings (similar to TIDIGITS) at start or end of meeting • Same speakers, same mics, same room as for spontaneous speech collection • Allows factorization of speech challenges offered by corpus: – Tackle spontaneous multi-party ASR using high-quality channel – Explore far-field acoustics on simpler speech task • Digits usually read by each speaker in turn, but there are some interesting exceptions: – Occasionally, digits read by all speakers simultaneously – Once, all speakers used same digits script and read in unison (more or less) M4 Meeting, Sheffield 29-30 January 2003 5

  6. Transcriptions All meetings are fully transcribed at the word level • Uses simple conventions, favoring standard orthography • Includes word fragments, mangled prons, disfluencies • Includes vocal (breath, laugh, …) and nonvocal (door slam, coffee mug clinks, mic noise, …) nonspeech sounds, and contextual comments • Produced from close-talking channels, permitting careful transcription of overlapping speech, soft-spoken backchannels, etc. Transcripts were post-processed into a simple XML format • Headers include meeting date, time, participant, mic, etc. info (plus a free- form notes field) • XML format designed specifically for this corpus • Software provided for translating from our format to other common ones M4 Meeting, Sheffield 29-30 January 2003 6

  7. Additional Information • Released corpus will also include – Speaker table of demographic info – Transcription guidelines – XML documentation, including our “Meetings DTD” • Additional annotation of Meeting subsets underway – Prosodic feature database – Dialogue act annotations • For further information, consult – ICSI’s many publications on Meetings work, incl. • corpus overview: A. Janin et al., Proc. ICASSP’03 • research updates: N. Morgan et al., Proc. HLT-2001 and Proc. ICASSP’03 • [see our website for other listings] – Our website: http://www.icsi.berkeley.edu/Speech/mr/ M4 Meeting, Sheffield 29-30 January 2003 7

Recommend


More recommend