blackfoot corpus
play

Blackfoot Corpus Joel Dunham UBC Overview Blackfoot language - PowerPoint PPT Presentation

Treebanking a Blackfoot Corpus Joel Dunham UBC Overview Blackfoot language Online Linguistic Database (OLD) Blackfoot OLD (BOLD) BOLD Annotation/treebanking Blackfoot language Algonquian (Plains): Alberta & Montana


  1. Treebanking a Blackfoot Corpus Joel Dunham UBC

  2. Overview • Blackfoot language • Online Linguistic Database (OLD) • Blackfoot OLD (BOLD) • BOLD Annotation/treebanking

  3. Blackfoot language • Algonquian (Plains): Alberta & Montana • Endangered: < 5000 speakers • Fieldwork: UBC, UCalgary, UMontana

  4. Blackfoot language • Salient properties: • Direct-inverse system • Grammatical animacy • Agglutinative

  5. Blackfoot language • Agglutinative: • kimaaksawohpokooyimasi • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2-why-NEG-with-eat-TA-DIR-3SG- CONJ • „Why don‟t you eat with her?‟

  6. OLD • Online Linguistic Database • www.onlinelinguisticdatabase.org • Web application for documenting and analyzing languages

  7. OLD • Open source (GPL): Python (Pylons), MySQL, HTML/JS • Powerful search capability: regex, boolean • Multi-user, web-based, collaborative • Multi-media: audio, video, images, text • Auto-linking of morphemes

  8. Blackfoot OLD • OLD web application for Blackfoot (BLAOLD; funded by SSHRC) • http://blaold.webfactional.com/ • Other OLD web apps: • Okanagan OLD (OKAOLD) • Plains Cree OLD (CRKOLD) • etc.

  9. BLAOLD

  10. BLAOLD • Forms (morphemes & sentences): 21,788 (2011-07-25) • morphemes: 5,094 • sentences: 3,193 • unclassified: 13,501 • (word tokens: 20,577)

  11. BLAOLD • Sources: • textual: 16,209 forms • field work: 5,569 forms (and growing...)

  12. BLAOLD • Collections • texts created by ordered references to forms • 135 Collections at present • E.g., Creation Story: • http://blaold.webfactional.com/creati onstory

  13. BLAOLD Collection (text) created by referencing Forms entered into the BLAOLD. • ...

  14. BLAOLD • Files: • Associate Forms, Collections & Files • 2,159 files (2011-07-25) • 1,744 audio • 259 image • 148 text • 4 video

  15. Morpheme segmentation Form with and morpheme gloss lines. morphemic analysis Blue text indicates links to morphemic Form entries found by the system POS string auto-generated: “prev -asp-vta drt-num nan Associated WAV file (tagged as an object drt-num agra-nan adt-asp- vai-oth- num” language utterance) Associated JPG (used as a stimulus in elicitation)

  16. BLAOLD: Goal • Improve efficiency of data collection, dissemination & analysis • automate subtasks & improve search • morphological parsing • treebanking?

  17. Morphological Parser • „A morphological parser for Blackfoot‟ (Dunham, 2010; WAIL) • input = transcription: • kimaaksawohpokooyimasi • output = <segmentation, morph glosses, POSes>: • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2-why-NEG-with-eat-TA-DIR-3SG-CONJ • agra-adt-oth-adt-vai-fin-thm-agrb-agrb

  18. Morphological Parser kimaaksawohpokooyimasi FST Accuracy: ca. 70% Challenges: Phonology (from a grammar) hand-coded into - variations in transcription Phonology FST - no hard and fast spelling rules - researchers differ in the Morphotactics & lexicon extent to which they use the Morphotactics extracted programmatically standard phonemic from the BLAOLD orthography to capture (lexicon) phonetic detail POS/morphemic N-grams used to select most probable parse k-máak-sa-ohpook-ooyi-m-yii-wa-hsi 2-why-NEG-with-eat-TA-DIR-3SG-CONJ agra-adt-oth-adt-vai-fin-thm-agrb-agrb

  19. Morphological Parser • Benefits of a morphological parse(r): • parse online in real time (i.e., during data entry): save researcher time • create more data to improve searching

  20. Morphological Parser • Search example: find all sentences with an overt subject and an overt object • Regex on POS string for 2 nominal roots: • /n[ai][nr].*n[ai][nr].*/

  21. Morphological Parser /n[ai][nr].*n[ai][nr].*/ Good Bad

  22. Treebank (S (NP (DT oma) (NP aakííwa)) (VP (VBD iihpóma) (NP ónnikii))) TGrep: „S < (NP $. (VP < NP))‟ S NP VP DT NP VBD NP

  23. Treebank • Assuming a flat morphological structure, the syntactic phrase structure parsing of Blackfoot may actually be easy relative to English • one of the longest words in the BLAOLD by character (69 chr.s) has only 5 words

  24. Treebank S S S VP VP NP NP DEM VBZ DEM NN CC VBZ drt-num adt-asp-fin-fin-thm drt-num nan-nin und adt-adt-asp-fin-fin-thm-agrb oth ann-wa á'p-á-istot-i-m om-yi náápi-moyis ki saaki-á'p-á-istot-i-m-wa-áyi „He is building that house and he is still building it.‟

  25. Treebank • Worth it to treebank Blackfoot? Cons Pros might significantly lots of researcher improve search :. hours & money research efficiency time might be better automated parsing spent elsewhere, may be relatively e.g., elicitation easy

  26. Nitsííkoohtaahsi‟taki

Recommend


More recommend