Treebanking a Blackfoot Corpus Joel Dunham UBC
Overview • Blackfoot language • Online Linguistic Database (OLD) • Blackfoot OLD (BOLD) • BOLD Annotation/treebanking
Blackfoot language • Algonquian (Plains): Alberta & Montana • Endangered: < 5000 speakers • Fieldwork: UBC, UCalgary, UMontana
Blackfoot language • Salient properties: • Direct-inverse system • Grammatical animacy • Agglutinative
Blackfoot language • Agglutinative: • kimaaksawohpokooyimasi • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2-why-NEG-with-eat-TA-DIR-3SG- CONJ • „Why don‟t you eat with her?‟
OLD • Online Linguistic Database • www.onlinelinguisticdatabase.org • Web application for documenting and analyzing languages
OLD • Open source (GPL): Python (Pylons), MySQL, HTML/JS • Powerful search capability: regex, boolean • Multi-user, web-based, collaborative • Multi-media: audio, video, images, text • Auto-linking of morphemes
Blackfoot OLD • OLD web application for Blackfoot (BLAOLD; funded by SSHRC) • http://blaold.webfactional.com/ • Other OLD web apps: • Okanagan OLD (OKAOLD) • Plains Cree OLD (CRKOLD) • etc.
BLAOLD
BLAOLD • Forms (morphemes & sentences): 21,788 (2011-07-25) • morphemes: 5,094 • sentences: 3,193 • unclassified: 13,501 • (word tokens: 20,577)
BLAOLD • Sources: • textual: 16,209 forms • field work: 5,569 forms (and growing...)
BLAOLD • Collections • texts created by ordered references to forms • 135 Collections at present • E.g., Creation Story: • http://blaold.webfactional.com/creati onstory
BLAOLD Collection (text) created by referencing Forms entered into the BLAOLD. • ...
BLAOLD • Files: • Associate Forms, Collections & Files • 2,159 files (2011-07-25) • 1,744 audio • 259 image • 148 text • 4 video
Morpheme segmentation Form with and morpheme gloss lines. morphemic analysis Blue text indicates links to morphemic Form entries found by the system POS string auto-generated: “prev -asp-vta drt-num nan Associated WAV file (tagged as an object drt-num agra-nan adt-asp- vai-oth- num” language utterance) Associated JPG (used as a stimulus in elicitation)
BLAOLD: Goal • Improve efficiency of data collection, dissemination & analysis • automate subtasks & improve search • morphological parsing • treebanking?
Morphological Parser • „A morphological parser for Blackfoot‟ (Dunham, 2010; WAIL) • input = transcription: • kimaaksawohpokooyimasi • output = <segmentation, morph glosses, POSes>: • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2-why-NEG-with-eat-TA-DIR-3SG-CONJ • agra-adt-oth-adt-vai-fin-thm-agrb-agrb
Morphological Parser kimaaksawohpokooyimasi FST Accuracy: ca. 70% Challenges: Phonology (from a grammar) hand-coded into - variations in transcription Phonology FST - no hard and fast spelling rules - researchers differ in the Morphotactics & lexicon extent to which they use the Morphotactics extracted programmatically standard phonemic from the BLAOLD orthography to capture (lexicon) phonetic detail POS/morphemic N-grams used to select most probable parse k-máak-sa-ohpook-ooyi-m-yii-wa-hsi 2-why-NEG-with-eat-TA-DIR-3SG-CONJ agra-adt-oth-adt-vai-fin-thm-agrb-agrb
Morphological Parser • Benefits of a morphological parse(r): • parse online in real time (i.e., during data entry): save researcher time • create more data to improve searching
Morphological Parser • Search example: find all sentences with an overt subject and an overt object • Regex on POS string for 2 nominal roots: • /n[ai][nr].*n[ai][nr].*/
Morphological Parser /n[ai][nr].*n[ai][nr].*/ Good Bad
Treebank (S (NP (DT oma) (NP aakííwa)) (VP (VBD iihpóma) (NP ónnikii))) TGrep: „S < (NP $. (VP < NP))‟ S NP VP DT NP VBD NP
Treebank • Assuming a flat morphological structure, the syntactic phrase structure parsing of Blackfoot may actually be easy relative to English • one of the longest words in the BLAOLD by character (69 chr.s) has only 5 words
Treebank S S S VP VP NP NP DEM VBZ DEM NN CC VBZ drt-num adt-asp-fin-fin-thm drt-num nan-nin und adt-adt-asp-fin-fin-thm-agrb oth ann-wa á'p-á-istot-i-m om-yi náápi-moyis ki saaki-á'p-á-istot-i-m-wa-áyi „He is building that house and he is still building it.‟
Treebank • Worth it to treebank Blackfoot? Cons Pros might significantly lots of researcher improve search :. hours & money research efficiency time might be better automated parsing spent elsewhere, may be relatively e.g., elicitation easy
Nitsííkoohtaahsi‟taki
Recommend
More recommend