Blackfoot Corpus Joel Dunham UBC Overview Blackfoot language - - PowerPoint PPT Presentation

blackfoot corpus
SMART_READER_LITE
LIVE PREVIEW

Blackfoot Corpus Joel Dunham UBC Overview Blackfoot language - - PowerPoint PPT Presentation

Treebanking a Blackfoot Corpus Joel Dunham UBC Overview Blackfoot language Online Linguistic Database (OLD) Blackfoot OLD (BOLD) BOLD Annotation/treebanking Blackfoot language Algonquian (Plains): Alberta & Montana


slide-1
SLIDE 1

Treebanking a Blackfoot Corpus

Joel Dunham UBC

slide-2
SLIDE 2

Overview

  • Blackfoot language
  • Online Linguistic Database (OLD)
  • Blackfoot OLD (BOLD)
  • BOLD Annotation/treebanking
slide-3
SLIDE 3

Blackfoot language

  • Algonquian (Plains): Alberta & Montana
  • Endangered: < 5000 speakers
  • Fieldwork: UBC, UCalgary, UMontana
slide-4
SLIDE 4

Blackfoot language

  • Salient properties:
  • Direct-inverse system
  • Grammatical animacy
  • Agglutinative
slide-5
SLIDE 5

Blackfoot language

  • Agglutinative:
  • kimaaksawohpokooyimasi
  • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi
  • 2-why-NEG-with-eat-TA-DIR-3SG-

CONJ

  • „Why don‟t you eat with her?‟
slide-6
SLIDE 6

OLD

  • Online Linguistic Database
  • www.onlinelinguisticdatabase.org
  • Web application for documenting and

analyzing languages

slide-7
SLIDE 7

OLD

  • Open source (GPL): Python (Pylons),

MySQL, HTML/JS

  • Powerful search capability: regex,

boolean

  • Multi-user, web-based, collaborative
  • Multi-media: audio, video, images, text
  • Auto-linking of morphemes
slide-8
SLIDE 8

Blackfoot OLD

  • OLD web application for Blackfoot

(BLAOLD; funded by SSHRC)

  • http://blaold.webfactional.com/
  • Other OLD web apps:
  • Okanagan OLD (OKAOLD)
  • Plains Cree OLD (CRKOLD)
  • etc.
slide-9
SLIDE 9

BLAOLD

slide-10
SLIDE 10

BLAOLD

  • Forms (morphemes & sentences):

21,788 (2011-07-25)

  • morphemes: 5,094
  • sentences: 3,193
  • unclassified: 13,501
  • (word tokens: 20,577)
slide-11
SLIDE 11

BLAOLD

  • Sources:
  • textual: 16,209 forms
  • field work: 5,569 forms (and

growing...)

slide-12
SLIDE 12

BLAOLD

  • Collections
  • texts created by ordered references to

forms

  • 135 Collections at present
  • E.g., Creation Story:
  • http://blaold.webfactional.com/creati
  • nstory
slide-13
SLIDE 13

BLAOLD

  • ...

Collection (text) created by referencing Forms entered into the BLAOLD.

slide-14
SLIDE 14

BLAOLD

  • Files:
  • Associate Forms, Collections & Files
  • 2,159 files (2011-07-25)
  • 1,744 audio
  • 259 image
  • 148 text
  • 4 video
slide-15
SLIDE 15

Form with morphemic analysis Associated WAV file (tagged as an object language utterance) Associated JPG (used as a stimulus in elicitation) Morpheme segmentation and morpheme gloss lines. Blue text indicates links to morphemic Form entries found by the system POS string auto-generated: “prev-asp-vta drt-num nan drt-num agra-nan adt-asp- vai-oth-num”

slide-16
SLIDE 16

BLAOLD: Goal

  • Improve efficiency of data collection,

dissemination & analysis

  • automate subtasks & improve search
  • morphological parsing
  • treebanking?
slide-17
SLIDE 17

Morphological Parser

  • „A morphological parser for Blackfoot‟

(Dunham, 2010; WAIL)

  • input = transcription:
  • kimaaksawohpokooyimasi
  • utput = <segmentation, morph glosses, POSes>:
  • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi
  • 2-why-NEG-with-eat-TA-DIR-3SG-CONJ
  • agra-adt-oth-adt-vai-fin-thm-agrb-agrb
slide-18
SLIDE 18

Morphological Parser

Phonology Morphotactics (lexicon)

FST

kimaaksawohpokooyimasi

k-máak-sa-ohpook-ooyi-m-yii-wa-hsi 2-why-NEG-with-eat-TA-DIR-3SG-CONJ agra-adt-oth-adt-vai-fin-thm-agrb-agrb

Morphotactics & lexicon extracted programmatically from the BLAOLD Phonology (from a grammar) hand-coded into FST Accuracy: ca. 70% Challenges:

  • variations in transcription
  • no hard and fast spelling

rules

  • researchers differ in the

extent to which they use the standard phonemic

  • rthography to capture

phonetic detail POS/morphemic N-grams used to select most probable parse

slide-19
SLIDE 19

Morphological Parser

  • Benefits of a morphological parse(r):
  • parse online in real time (i.e., during data

entry): save researcher time

  • create more data to improve searching
slide-20
SLIDE 20

Morphological Parser

  • Search example: find all sentences with

an overt subject and an overt object

  • Regex on POS string for 2 nominal

roots:

  • /n[ai][nr].*n[ai][nr].*/
slide-21
SLIDE 21

Morphological Parser

/n[ai][nr].*n[ai][nr].*/ Good Bad

slide-22
SLIDE 22

Treebank

NP VBD NP DT VP NP S (S (NP (DT oma) (NP aakííwa)) (VP (VBD iihpóma) (NP ónnikii))) TGrep: „S < (NP $. (VP < NP))‟

slide-23
SLIDE 23

Treebank

  • Assuming a flat morphological structure, the

syntactic phrase structure parsing of Blackfoot may actually be easy relative to English

  • one of the longest words in the BLAOLD by

character (69 chr.s) has only 5 words

slide-24
SLIDE 24

Treebank

ann-wa á'p-á-istot-i-m om-yi náápi-moyis ki saaki-á'p-á-istot-i-m-wa-áyi drt-num adt-asp-fin-fin-thm drt-num nan-nin und adt-adt-asp-fin-fin-thm-agrb

  • th

DEM

„He is building that house and he is still building it.‟

VBZ DEM NN VBZ CC NP NP VP S VP S S

slide-25
SLIDE 25

Treebank

  • Worth it to treebank Blackfoot?

Cons Pros lots of researcher hours & money might significantly improve search :. research efficiency time might be better spent elsewhere, e.g., elicitation automated parsing may be relatively easy

slide-26
SLIDE 26

Nitsííkoohtaahsi‟taki