The ICSI corpus; Browsing meetings nlssd natural language and speech - PowerPoint PPT Presentation

The ICSI corpus; Browsing meetings nlssd – natural language and speech system design . Steve Renals s.renals@ed.ac.uk NLSSD/Corpus and Browsing – p.1/14

Overview • The ICSI meetings corpus • Browsing meetings • Useful reading: Janin et al (2003; 2004), Kazman et al (1996), Tucker and Whittaker (2004) Web: http://www.inf.ed.ac.uk/teaching/courses/nlssd/ NLSSD/Corpus and Browsing – p.2/14

The ICSI Meetings Corpus • Collected at the International Computer Science Institute, Berkeley from 2000–2002 • Natural (rather than scenario-driven) meetings, mainly of ICSI research groups • Audio only, recorded using close-talking head-mounted mics and 4 desktop PZM mics • Audio stored as 16 kHZ, 16 bit linear NLSSD/Corpus and Browsing – p.3/14

Meeting types • Meetings of 3–12 people • 75 meetings total: ◦ Bmr - the meeting recorder group - 29 meetings ◦ Bro - robust ASR group - 23 mtgs ◦ Bed - the EDU group (NLP) - 15 mtgs • Meetings were typically about 1 hour long • All meetings in English NLSSD/Corpus and Browsing – p.4/14

Meeting participants • 53 unique speakers: ◦ 40 male, 13 female ◦ Most under 40 ◦ 28 native English speakers, 12 German, 5 Spanish, 8 others • Some ethical issues to consider in this research; the corpus is somewhat anonymised NLSSD/Corpus and Browsing – p.5/14

Speech transcription • The entire corpus has been transcribed at the word level • Also includes word fragments, restarts, filled pauses, back-channels, non-lexical events (cough, laugh, etc.) • Overlap information is available through time stamps on each utterance • Each utterance is marked with the participant ID • Speech recognition transcriptions (29% word error rate) also available (“fairly” done by training on 3/4 of corpus, testing on remaining 1/4 and rotating 4 times) NLSSD/Corpus and Browsing – p.6/14

Dialogue Act (MRDA) Annotations • Dialogue acts (DA) annotations: 11 general tags (eg “statement”) and 39 further modifiers (eg “joke”, “disagreement”) • Includes automatic time alignment of each word and segmentation into DA units (obtained using forced alignment with a switch • Adjacency pairs also marked (eg question-answer pairs) NLSSD/Corpus and Browsing – p.7/14

Other annotations • Topic segmentation (together with brief descriptions of each topic) • Summarization (human-written abstracts, together with links to the meeting extracts that support the abstract) • Annotations of “hot spots” and involvement NLSSD/Corpus and Browsing – p.8/14

Browsing meetings • Tucker and Whittaker categorize meeting browsers as: ◦ Audio browsers (with and without visual feedback) ◦ Video browsers (also include audio, but video is the focus) ◦ “Artefact” browsers (browsing based on other material eg notes, slides, whiteboard ◦ Discourse browsers - based on derived elements - mainly transcripts, also speaker activity, involvement • Typically these browsers index into the audio or video, based on discourse or artefact information NLSSD/Corpus and Browsing – p.9/14

Approaches to indexing Kazman et al identified four indexing approaches: • Indexing by intended content (eg the agenda) • Indexing by actual content (eg ASR transcript) • Indexing by temporal structure (eg speaker turns) • Indexing by application record (eg artefacts such as notes, slides, PC interaction) NLSSD/Corpus and Browsing – p.10/14

Browser focus • Indexes based on agenda, slide changes, participant activity • Search and retrieval based on ASR transcript • Structure based on topic segmentation and tracking • Preview based on summarization or keyword spotting • Archive filtering • Browsers for limited resources eg phones, PDAs NLSSD/Corpus and Browsing – p.11/14

Example browser: Ferret NLSSD/Corpus and Browsing – p.12/14

Browser evaluation • How can browsers and browsing techniques be compared objectively? • Most browsers have no real evaluation; how do you know if what you did was useful? • Browser Evaluation Test (BET) - finding the maximum number of observations of interest in the minimum time • Observers make the observations of interest (expressed as two contrasting statements, one of which is true, eg: Jo thought that there too many items on the agenda; Jo thought the agenda was about the right length) • Subjects browse the meeting to decide which observations are true • Note that this only applies to browsing a single meeting (in its current form) NLSSD/Corpus and Browsing – p.13/14

Next two sessions Tuesday 18 January Component technologies and software Useful reading : Galley et al (2003), Zechner (2002), Wrede and Shriberg (2003) Friday 21 January NITE XML toolkit (Jean Carletta), and division into groups Useful reading : Carletta and Kilgour (2004) All readings downloadable from http://www.inf.ed.ac.uk/teaching/courses/nlssd/readings.html NLSSD/Corpus and Browsing – p.14/14

The ICSI corpus; Browsing meetings nlssd natural language and speech - PowerPoint PPT Presentation

The ICSI corpus; Browsing meetings nlssd natural language and speech system design . Steve Renals s.renals@ed.ac.uk NLSSD/Corpus and Browsing p.1/14 Overview The ICSI meetings corpus Browsing meetings Useful reading: Janin et

The ICSI Meeting Corpus Barbara Peskin [on behalf of ICSIs MeetingRecorder Team]

Meetings Research at ICSI Barbara Peskin reporting on work of: Don Baron, Sonali Bhagat, Hannah

Models for Models for Retrieval and Browsing Retrieval and Browsing - Structural Models and

Performance Metrics for Web Browsing draft fan ippm web metrics 00 Peng Fan

Models for Models for Retrieval and Browsing Retrieval and Browsing - Fuzzy Set, Extended

Secure Browsing and Email Web Browsing with HTTPS Secure Email with OpenPGP Organised by Steven

Forced/forceful browsing sws2 1 Forced browsing (not in book!) Supplying a URL directly

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

Meetings ASJ Types of company meetings The broader categories of the meetings are as follows:

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Web Browsing Topics Physical Exchange of Web Web Browsing 101 Technology Information

The ICSI Haystack A Platform for Hybrid Mobile Measurements in the Wild Narseo

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

A mas novas vos torn / Now I take you back Corpus to my tale Structure Corpus Study

UCognito: Private Browsing without Tears Meng Xu, Yeongjin Yang, Xinyu Xing, Taesoo Kim, Wenke

A Digital Fountain Approach to Reliable Distribution of Bulk Data John Byers, ICSI Michael Luby,

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

Agenda CPSC 533C Information Visualization Project Update Motivation: Exploratory browsing?

From the National Corpus of Polish to the Polish Corpus Infrastructure Maciej Ogrodniczuk

Image search through browsing using NN k networks Daniel Heesch, Marcus Pickering, Stefan Rger,

Corpus Analysis from a Mathematical Perspective Corpus Statistics Research Group launch event

CORPUS STYLISTICS: SPEECH, WRITING AND THOUGHT PRESENTATION IN A CORPUS OF ENGLISH WRITING

Practical Cryptography for a Peer-to-Peer Web Browsing System A. Pokluda Cheriton School of

ERROR ANALYSIS IN A WRITTEN LEARNER CORPUS FROM SPANISH SPEAKERS EFL LEARNERS. A CORPUS BASED