Unsupervised discovery of Construction Grammar representations for under-resourced languages Bogdan Babych University of Leeds Centre for Translation Studies (CTS) http://www.comp.leeds.ac.uk/bogdan b.babych@leeds.ac.uk
Corpus annotation for under- resourced languages • Getting a language on a ‘technology map’ • Morphosyntactic annotation & generation – Part-of-speech taggers, lemmatisers, paradigms – Dependency / constituency parsing, chunking – Annotated general-purpose & domain-specific corpora, treebanks • Starting point for computational applications – Addressing data sparseness for inflected languages – Language models (for Speech Recognition, MT) – Text normalization (Text-to-speech)
Technological value of morphosyntax: MT for under-resourced languages • Neural MT: generation of lemma sequence + morphological tagging (Conforti et al., 2018) • Factored SMT: data sparseness & disambiguation (Koehn, • Häuser Haus | NN.plur.nom.neut 2009) • Haus house; • RBMT Analysis, Generation • NN.plur.nom.neut N.plur &Transfer pipelines • house | N.plur houses – Successful morphological • Their weight disambiguation correct changes.(VERB.3pers.sing) every day translation equivalents • Some people record their weight – Cascaded disambiguation changes.(NOUN.plur) every day • Morphological ambiguities resolved at the syntax level
Corpus annotation practice vs. theoretical lexicogrammar • Schemes traditionally relied on theory-neutral, consensual annotation (Leech, 1993; Straka and Straková, 2017) – Theoretically sensitive decisions (Garside et al., 1997) – Possibility of linguistically unsound, ad-hoc or contradictory solutions – Potential errors reduce usefulness of annotation – Conservative view on linguistic material missing recent theoretical developments • Traditionally: two separate stages – grammar and the lexicon development – Tagsets & morphosyntactic features, disambiguated tags in a sub-corpus – Emission tags for word forms, paradigm classes for lemmas
Corpus annotation practice vs. theoretical lexicogrammar • Limitations: morphological disambiguation depends on lexical features, e.g.: • [Prep (Adj.Case+Num)? N.Case+Num] PP • в (Prep_Case Gen | Acc |Loc) книжки ( Gen+Sing |Nom+Plur| Acc+Plur ) (with a book; into books) • на (Prep_Case Acc |Loc) книжки (Gen+Sing|Nom+Plur| Acc+Plur ) (onto books) • до (Prep_Case Gen ) книжки ( Gen+Sing |Nom+Plur|Acc+Plur) (to a book) • The need for lexicalized morphosyntactic representations – A systematic lexicalized theoretical framework
Unsupervised linguistic annotation of under-resourced languages • Supervised methods need manual annotation – Not available for under-resourced languages • Unsupervised & weakly-supervised methods: – More suitable for under-resourced scenarios – Smaller and more qualified development effort – Strong assumptions about expected linguistic structures – Models of expected variation (phonological, morphological, syntactic …
Context: Experience of HyghTra project (FP7 MC IAPP) • RBMT core architecture (Lingenio GmbH) – Transfer-based, syntactic dependencies + semantic features for selectional restrictions • Corpus-based resource creation & disambiguation – Faster development for new translation directions – Exploiting similarities between closely-related languages (nl de; pt,es fr; uk ru) – Alignment of richly-annotated, morphologically and syntactically disambiguated corpora • Under-specified representations: morph., synt., sem.
Lingenio ’ s RBMT lexicon
Ukrainian news corpus • Low-resource scenario: ~250 million words, not balanced • News texts collected via targeted crawling – Part-of-speech annotation via transfer learning (Babych & Sharoff, 2016) – Coverage of tag emission & lematization lexicon: ~ 15k words (~91% on news texts) – Accuracy: 93% on known & 72% on unknown words • Available on: http://corpus.leeds.ac.uk/internet.html • Tasks for unsupervised learning: – T1: Discovery of Construction Grammar representations – T2: Induction of wide-coverage morphological lexicon
T1. Discovery of Construction Grammar representations in a Ukrainian corpus • Construction grammar framework (Kay & Fillmore, 1999; Fillmore, 2002) – Lexicalized morphosyntactic representations • specify syntactic relations, valencies and semantics for associated linguistic structures (cf. Fillmore, 2013: 112) • Include different levels : morphosyntactic, lexical, phraseological • have underspecified slots for lexical or grammatical valencies, that are lexically or morphologically restricted • Examples: What’s X doing Y ; to look forward to X – Single-stage induction of morphosyntactic lexicon • Syntax is lexicalised = lexicon has morphosyntactic annotation – Unified framework for Single- and Multiword Expressions (MWEs) • Words are not elementary units, MWEs have structure • Explaining valencies & syntactic variation (lexicalised TAG)
(to) look forward to V-ing • Representations of lexicalized structures (CWB format) • Modeling variation: – I look forward to receiving President Tadic – He looked forward to arguing the case in court – I ’ m looking forward to being able to see his talk online • (an overlap with “ (to) be able to X ” construction) – Hawking looks forward to knowing (metaphorically, of course) the "mind of God ”
TAG representations: syntactic variation (initial & auxiliary trees)
TAG representations: syntactic variation (initial & auxiliary trees)
TAG representations: syntactic variation (initial & auxiliary trees)
Unsupervised discovery of lexicalized constructions • Methodology ~ discovering multiword expressions in PoS-annotated corpora • Justeson &Katz, 1995; Babych & Hartley, 2009 – Collecting & sorting lexical N-grams and skip-grams – Filtering: frequency & lexical salience • Frequency threshold (>4); Association measures (Log likelihood, Mutual Information … ) • PoS configurations: positive vs. negative filters, statistical ft.idf filters – *user interface of ; *with user interface • Generalizing methods for multilevel annotation – [word, lemma, PoS, Sub-classes, syntactic dependencies … ] – Computationally intensive: deal with “ longer ” N-grams – Recurring feature patterns across annotation levels – Underspecified representations: partially filled positions
Underspecified N-grams: construction candidates & selected lexical classes
Fully lexicalized constructions NN IN NN 2393 NN point IN of NN view 417 NN courseIN of NN action 2104 NN sort IN of NN thing 405 NN head IN of NN state 1272 NN cup IN of NN tea 384 NN matterIN of NN fact 1014 NN way IN of NN life 342 NN lot IN of NN work 865 NN periodIN of NN time 336 NN person IN per NN night 841 NN lot IN of NN money 318 NN sheet IN of NN paper 710 NN value IN for NN money 301 NN work IN of NN art 692 NN kind IN of NN thing 296 NN rule IN of NN law 595 NN quality IN of NN life 294 NN state IN of NN emergency 566 NN piece IN of NN paper 286 NN balance IN of NN power 551 NN sense IN of NN humour 281 NN breach IN of NN contract 524 NN length IN of NN time 277 NN sum IN of NN money 521 NN division IN of NN labour 277 NN state IN of NN mind 519 NN side IN by NN side 277 NN rate IN of NN return 518 NN lot IN of NN time 269 NN hand IN in NN hand 513 NN rate IN of NN interest 262 NN duty IN of NN care 510 NN amount IN of NN money 255 NN time IN of NN day 477 NN cup IN of NN coffee 255 NN secretary IN of NN state 454 NN waste IN of NN time 250 NN sourceIN of NN information 449 NN member IN of NN staff 250 NN rate IN of NN growth 437 NN amount IN of NN time 250 NN friend IN of NN mine 424 NN time IN of NN year 247 NN cause IN of NN death 419 NN rate IN of NN inflation 242 NN sort IN of NN person
Recommend
More recommend