Blackfoot Corpus Joel Dunham UBC Overview Blackfoot language - PowerPoint PPT Presentation
Treebanking a Blackfoot Corpus Joel Dunham UBC Overview Blackfoot language Online Linguistic Database (OLD) Blackfoot OLD (BOLD) BOLD Annotation/treebanking Blackfoot language Algonquian (Plains): Alberta & Montana
Treebanking a Blackfoot Corpus Joel Dunham UBC
Overview • Blackfoot language • Online Linguistic Database (OLD) • Blackfoot OLD (BOLD) • BOLD Annotation/treebanking
Blackfoot language • Algonquian (Plains): Alberta & Montana • Endangered: < 5000 speakers • Fieldwork: UBC, UCalgary, UMontana
Blackfoot language • Salient properties: • Direct-inverse system • Grammatical animacy • Agglutinative
Blackfoot language • Agglutinative: • kimaaksawohpokooyimasi • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2-why-NEG-with-eat-TA-DIR-3SG- CONJ • „Why don‟t you eat with her?‟
OLD • Online Linguistic Database • www.onlinelinguisticdatabase.org • Web application for documenting and analyzing languages
OLD • Open source (GPL): Python (Pylons), MySQL, HTML/JS • Powerful search capability: regex, boolean • Multi-user, web-based, collaborative • Multi-media: audio, video, images, text • Auto-linking of morphemes
Blackfoot OLD • OLD web application for Blackfoot (BLAOLD; funded by SSHRC) • http://blaold.webfactional.com/ • Other OLD web apps: • Okanagan OLD (OKAOLD) • Plains Cree OLD (CRKOLD) • etc.
BLAOLD
BLAOLD • Forms (morphemes & sentences): 21,788 (2011-07-25) • morphemes: 5,094 • sentences: 3,193 • unclassified: 13,501 • (word tokens: 20,577)
BLAOLD • Sources: • textual: 16,209 forms • field work: 5,569 forms (and growing...)
BLAOLD • Collections • texts created by ordered references to forms • 135 Collections at present • E.g., Creation Story: • http://blaold.webfactional.com/creati onstory
BLAOLD Collection (text) created by referencing Forms entered into the BLAOLD. • ...
BLAOLD • Files: • Associate Forms, Collections & Files • 2,159 files (2011-07-25) • 1,744 audio • 259 image • 148 text • 4 video
Morpheme segmentation Form with and morpheme gloss lines. morphemic analysis Blue text indicates links to morphemic Form entries found by the system POS string auto-generated: “prev -asp-vta drt-num nan Associated WAV file (tagged as an object drt-num agra-nan adt-asp- vai-oth- num” language utterance) Associated JPG (used as a stimulus in elicitation)
BLAOLD: Goal • Improve efficiency of data collection, dissemination & analysis • automate subtasks & improve search • morphological parsing • treebanking?
Morphological Parser • „A morphological parser for Blackfoot‟ (Dunham, 2010; WAIL) • input = transcription: • kimaaksawohpokooyimasi • output = <segmentation, morph glosses, POSes>: • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2-why-NEG-with-eat-TA-DIR-3SG-CONJ • agra-adt-oth-adt-vai-fin-thm-agrb-agrb
Morphological Parser kimaaksawohpokooyimasi FST Accuracy: ca. 70% Challenges: Phonology (from a grammar) hand-coded into - variations in transcription Phonology FST - no hard and fast spelling rules - researchers differ in the Morphotactics & lexicon extent to which they use the Morphotactics extracted programmatically standard phonemic from the BLAOLD orthography to capture (lexicon) phonetic detail POS/morphemic N-grams used to select most probable parse k-máak-sa-ohpook-ooyi-m-yii-wa-hsi 2-why-NEG-with-eat-TA-DIR-3SG-CONJ agra-adt-oth-adt-vai-fin-thm-agrb-agrb
Morphological Parser • Benefits of a morphological parse(r): • parse online in real time (i.e., during data entry): save researcher time • create more data to improve searching
Morphological Parser • Search example: find all sentences with an overt subject and an overt object • Regex on POS string for 2 nominal roots: • /n[ai][nr].*n[ai][nr].*/
Morphological Parser /n[ai][nr].*n[ai][nr].*/ Good Bad
Treebank (S (NP (DT oma) (NP aakííwa)) (VP (VBD iihpóma) (NP ónnikii))) TGrep: „S < (NP $. (VP < NP))‟ S NP VP DT NP VBD NP
Treebank • Assuming a flat morphological structure, the syntactic phrase structure parsing of Blackfoot may actually be easy relative to English • one of the longest words in the BLAOLD by character (69 chr.s) has only 5 words
Treebank S S S VP VP NP NP DEM VBZ DEM NN CC VBZ drt-num adt-asp-fin-fin-thm drt-num nan-nin und adt-adt-asp-fin-fin-thm-agrb oth ann-wa á'p-á-istot-i-m om-yi náápi-moyis ki saaki-á'p-á-istot-i-m-wa-áyi „He is building that house and he is still building it.‟
Treebank • Worth it to treebank Blackfoot? Cons Pros might significantly lots of researcher improve search :. hours & money research efficiency time might be better automated parsing spent elsewhere, may be relatively e.g., elicitation easy
Nitsííkoohtaahsi‟taki
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.