dynamic language model adaptation using presentation
play

Dynamic language model adaptation using presentation slides for - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221489599 Dynamic language model adaptation using presentation slides for lecture speech recognition Conference Paper January 2007


  1. See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221489599 Dynamic language model adaptation using presentation slides for lecture speech recognition Conference Paper · January 2007 Source: DBLP CITATIONS READS 21 183 5 authors , including: Iwano Koji Koichi Shinoda Tokyo City University Tokyo Institute of Technology 90 PUBLICATIONS 985 CITATIONS 160 PUBLICATIONS 1,650 CITATIONS SEE PROFILE SEE PROFILE Sadaoki Furui Haruo Yokota Tokyo Institute of Technology Tokyo Institute of Technology 400 PUBLICATIONS 9,119 CITATIONS 188 PUBLICATIONS 866 CITATIONS SEE PROFILE SEE PROFILE All content following this page was uploaded by Koichi Shinoda on 04 June 2014. The user has requested enhancement of the downloaded file.

  2. INTERSPEECH 2007 Dynamic Language Model Adaptation Using Presentation Slides for Lecture Speech Recognition Hiroki Yamazaki, Koji Iwano, Koichi Shinoda, Sadaoki Furui and Haruo Yokota Department of Computer Science, Tokyo Institute of Technology, Japan yamazaki@ks.cs.titech.ac.jp, { iwano, shinoda, furui, yokota } @cs.titech.ac.jp Abstract to be interrupted by questions from students. The spontaneity of this kind of speech is much higher than other kinds of pre- We propose a dynamic language model adaptation method that sentations; the lectures are characterized by strong coarticula- uses the temporal information from lecture slides for lecture tion effects, non-grammatical constructions, hesitations, repeti- speech recognition. The proposed method consists of two steps. tions, and filled pauses. For these reasons, speech recognition First, the language model is adapted with the text information for classroom lecture speech is generally more difficult than that extracted from all the slides of a given lecture. Next, the text of speeches in conferences or meetings; its recognition accuracy information of a given slide is extracted based on temporal in- is around 40-60%. Furthermore, no large database of classroom formation and used for local adaptation. Hence, the language lecture speech is available for training acoustic and language model, used to recognize speech associated with the given slide models. changes dynamically from one slide to the next. We evaluated In classrooms, lecturers often use various materials, e.g., the proposed method with the speech data from four Japanese textbooks or slides, to help their students understand. Since lecture courses. Our experiments show the effectiveness of our those materials include many keywords that also appear in lec- proposed method, especially for keyword detection. The F- ture speech, they are expected to be useful for language model- measure error rate for lecture keywords was reduced by 2.4%. ing in speech recognition. Several adaptation methods for lan- Index Terms : language model adaptation, speech recognition, guage models using such content have already been proposed classroom lecture speech. for lecture speech recognition. For example, Togashi et. al. [14] proposed a method of using the text information in presentation 1. Introduction slides. If lecture speech is accompanied by slides, a strong corre- Recent advancements in computer and storage technology en- lation can be observed between slides and speech. In partic- able archiving large multimedia databases. The databases of ular, the speech corresponding to a given slide contains most classroom lectures in universities and colleges are particularly of the text information presented in the slide. We expect this useful knowledge resources, and they are expected to be used in relation between speech and text information of the slide can education systems. improve the model adaptation for lecture speech recognition. Recently much effort has been made to construct educa- We propose a dynamic adaptation method for language model- tional systems that use the multimedia content of classroom ing that applies text information from slides. In this method, a lectures to support distant-learning [1, 2, 3, 4, 5]. Among the slide-dependent language model is constructed for each slide, various kinds of content related to lectures, the transcription and this model is used afterwards to recognize the speech as- of speech data is expected to be the most important for in- sociated with the given slide. The language model is changed dexing and searching lecture contents [2, 6]. Therefore, high- dynamically as the lecture progresses. level speech recognition engine for lectures is required. Lecture This paper is organized as follows. In Section 2, the base speech recognition has been studied extensively. Many research system applied in our studies is introduced. In Section 3, the projects for lecture transcriptions, such as the European project proposed language model adaptation method is explained, and CHIL (Computers in the Human Interaction Loop) [8], and the in Section 4, the effectiveness of the proposed method is dis- American iCampus Spoken Lecture Processing project [9], have cussed. been conducted. Trancoso et. al. [7] investigated the automatic transcription of classroom lectures in Portuguese. 2. UPRISE: Unified Presentation Contents Large databases of conference presentations, such as the Corpus of Spontaneous Japanese (CSJ) [10, 11], and the TED Retrieval by Impression Search Engine corpus [12] have been collected to improve speech recogni- tion accuracy. With the use of these databases, a state-of- UPRISE (Unified Presentation Contents Retrieval by Impres- the-art speech recognition systems for conference presentations sion Search Engine) [1, 15] is a lecture presentation system achieves accuracy of 70-80%. Hence, the recognition results to support distant-learning. It stores many types of multime- provided by these systems are good enough to be used for dia materials, such as texts, pictures, graphs, images, sounds, speech summarization and speech indexing [13]. The speak- voices, and videos, and provides a unified presentation view ing style of classroom lectures is, however, much different from (Figure 1) as a lecture video retrieval system. The retrieval that of lectures in meetings or conferences. Classroom lectures system returns appropriate lecture video scenes to match given are not always practiced in advance, and the same phrases are keywords. Since the speech information in lectures is used to repeated many times for emphasis. The lecture speaking style narrow down the search candidates [6], a high level of speech is closer to that in dialogue because lecturers are always ready recognition accuracy is strongly required. 2349 August 27 - 31, Antwerp, Belgium

Recommend


More recommend