convergences
play

Convergences: Convergences Bitext + morph = IGT concern with data - PowerPoint PPT Presentation

Fieldwork as a Computational Problem three data types The Human Language Project: Uniting Computational Linguistics three kinds of metadata with Documentary Linguistics relations Steven Bird computational challenge


  1. Fieldwork as a Computational Problem • three data types The Human Language Project: Uniting Computational Linguistics • three kinds of metadata with Documentary Linguistics • relations Steven Bird • computational challenge University of Melbourne & University of Pennsylvania • http://www.ldc.upenn.edu/sb/fieldwork/ • this isn't computational linguistics Convergences: Convergences Bitext + morph = IGT • concern with data • bilingual text • bilingual lexicons • use of speech data • bilingual text • morphologically analyzed text • comparative wordlists

  2. Documentation types: Documentary and Descriptive Linguistics Interlinear text Guwamu, Peter Austin (2010) Nikolaus Himmelmann (1998) "Documentary and Descriptive Linguistics" Linguistics 36:161-195 Documentation types: Documentary and Descriptive Linguistics Lexicons Use of Computation • documentarists • innovation, tool development • descriptivists • Evans, Hyman Nikolaus Himmelmann (1998) "Documentary and Descriptive Linguistics" Linguistics 36:161-195 Kröger, F. Buli-English dictionary: With an Introductory Grammar and an Index. Münster: Lit, 1992.

  3. Karaim CD-ROM Where's the science? Eva Csato and David Nathan After years of neglect in which linguistics lost sight of the value of empirical field research, new life has finally been breathed into this fundamentally important component of our discipline. But in the process, linguistic fieldwork has ironically lost sight of linguistics! That is, if by linguistics one means the scientific study of language, fieldwork ideology and practice have gone askew. The major movements and individuals that we can thank for the resurgence of interest in linguistic fieldwork all promote (in words or deeds) approaches to field research that fall far short of the tenets of science. Examples of such misguided directions include (a) the endangered languages movement, (b) language documentation, and (c) the "Dixon school". ! In my talk, I expose the failings of these non-scientific approaches to linguistic field research and set out what would be required for linguistic fieldwork to qualify as truly scientific and thus be entitled to recognition as an essential subfield within linguistics per se. Paul Newman -- Linguistic Fieldwork as a Scientific Enterprise, International Conference on Language Nathan, D. (1998) The spoken Karaim CD: Sound, text, lexicon and "active morphology" for language learning multimedia, Proceedings of the Ninth Annual Conference on Turkish Linguistics . Key Questions • What does computational linguistics offer to the problem of documenting and describing the world's languages? • How can CL help improve the descriptive value of language documentation? Basic Oral Language Documentation • three places where this might happen

  4. Synopsis of 1 week Pilot project in Moife 1. Discussions re orthography, literacy 2. training, practice, listening, tone orthography experiment 3. training in oral transcription and translation; gave out recorders 4. re-assigned recorders 5. (Saturday) 6. oral transcription, vitality survey, orthography recommendations 7. more oral transcription Pilot project

  5. Main Phase

  6. Preparation • Batteries • Date • Identifiers Training Training

  7. Basic Oral Language Documentation Overview of one week's activity... Oral Annotation Protocol

  8. Transcription Cross Checking Evaluation • What is the quality of the collected materials? • Can we correctly establish the phonemic inventory of the language from the recorded materials? • What semantic domains are covered? • What can trained linguists get from the raw transcripts?

  9. Axioms • Limited funding, but costs for local participation are negligible • Cannot assume continuous presence of a linguist: primary collection work is "unsupervised" Back to the computational questions... • Cannot assume an orthography • Can give training in documentation, but not description • Contact language has every conceivable resource • No time limit Transcription MT to help with eliciting morphology? • contact-language orthography: issues with normalisation • problems with recording and translating isolated words • lexical inventory, diphone inventory • short complete sentences with translations • sense tagging • fix nouns and vary the form of the verb? • multiple instances of one story • bilingual texts as the key means a user would train the system • ASR? • resegmentation • active learning in interlinear text glossing

  10. MT as the measure of adequacy? Data mining • inspect MT output to see what is lost • supply a corrected version when it gets something wrong • supply other examples, much as you would do with a child Bird (1999) Multidimensional exploration of online linguistic field data. NELS 29 : 33-50.

Recommend


More recommend