Leaving no token behind: comprehensive (and delicious) annotation of MWEs and supersenses Nathan Schneider nert Georgetown University LAW-MWE-CxG • 25 August 2018 • Santa Fe, NM 1
• Goal: corpora in a language annotated with some form of lexical semantics (for NLP , corpus linguistics) • How to achieve this with good coverage, quality, scalability? 2
Traditional Strategy Start with a general lexicon like WordNet, apply it to corpus. LIMITATIONS : coverage (esp. for MWEs), granularity, language-specificity, cost 3
Lexicon-Free Lexical Semantics Annotate general categories/criteria at the token level, and identify types as you go. Some options: 1. Focus on a syntactic domain of interest, and annotate at the token level with general criteria [most work on MWEs, e.g. PARSEME Shared Tasks 1.0, 1.1 focusing on VMWE subclasses; Savary et al. 2017, Ramisch et al. 2018] . Next session! 2. Focus on a semantic domain of interest, and annotate at the token level while populating a lexicon or constructicon of types [Dunietz et al. 2015, 2017] . Lori’s keynote tomorrow! 3. Comprehensively annotate at the token level with coarse-grained categories. This talk! 4
Claim Comprehensive annotation of MWEs and supersenses (semantic classes), without starting from a lexicon, is structurally simple (labeled segments) ‣ intuitive to annotate (reasonable agreement) ‣ robust to cover long tail of types & ‣ constructions robust to gappy expressions ‣ scalable/cost-effective ‣ conducive to NLP ‣ 5
Roadmap The story of the STREUSLE corpus: MWEs ‣ comprehensive annotation ✦ data exploration: notable MWEs/constructions ✦ Supersenses ‣ nouns & verbs ✦ prepositions ✦ 6
S upersense T agged R epository of E nglish with a U ni fi ed S emantics for ������������������������������! L exical E xpressions tiny.cc/streusle 7
Comprehensive MWE annotation in STREUS LE [Schneider et al., LREC 2014] 8
Scene: An AMR design meeting in June 2012 We should annotate all kinds of MWEs in AMR! LOL good luck getting annotators to agree on what’s an MWE CHALLENGE ACCEPTED 9
MWE Definition Multiword expression (MWE): 2 or more orthographic words that are tightly associated • Strong MWEs: idiomatic = not fully predictable in form and/or function non- or semi-compositional : ‣ ice cream , daddy longlegs , pay attention ‣ unusual morphosyntax : Me/*Him neither ; by and large ; plural of daddy longlegs ? • Weak MWEs: statistically collocated or formulaic ‣ p ( heavy rain ) > p ( strong rain ); highly recommended ; no amount of … can … 10
MWE Challenges • Not superficially apparent in text Syntactic variability ‣ • Number/frequency Too many expressions to list all of them ‣ Individually rare, but frequent in aggregate ‣ • Diversity Many different construction types ‣ Semantically unrestricted ‣ Can be gappy ‣ 11
Noam Chomsky daddy longlegs, hot dog dry out dry out the clothes depend on, come across no attention was paid (to) pay pay attention (to) close attention (to) put up with, give in (to) under the weather cut and dry in spite of pick up where pick up where __ left off they __ left off easy as pie You’re welcome. To each his own. The structure of this paper is as follows. 12
Some syntactically-focused English MWE datasets • NNCs [Reddy et al. 2011] • VNCs [Venkatapathy & Joshi 2005, Cook et al. 2008] • LVCs [Fazly et al. 2007, Hwang al. 2010, Tu & Roth 2011] • VPCs [Bannard et al. 2003, McCarthy et al. 2003, Baldwin 2005] • VPCs + PVs [Tu & Roth 2012] • PARSME VMWEs (LVC, VPC, VID, …) [Savary et al. 2017, Ramisch et al. 2018] • Functional MWEs (complex prepositions, adverbs, …) [Shigeto et al. 2013] 13
English corpora with several kinds of MWEs • SemCor [Miller et al. 1993] ‣ Lexical annotation with WordNet synsets ‣ NEs, compound nominals, some phrasal verbs • Prague CEDT [Haji č et al. 2012] ‣ NEs, light verb constructions, phrasal idioms, multiword tlemma s • Wiki50 [Vincze et al. 2011] ‣ NEs, compound nominals, LVCs, VPCs, phrasal idioms 14
Comprehensive Approach • Radical new approach 1. Teach annotators the concept of MWE (with examples of many kinds) 2. Give them sentences 3. Ask for all the MWEs [Schneider et al., LREC 2014] 15
Noam_Chomsky refused to give_in_to the vicious daddy_longlegs . 16
Noam_Chomsky refused to give_in_to the vicious daddy_longlegs . 17
Noam_Chomsky refused to give_in_to the vicious daddy_longlegs . 18
Noam_Chomsky refused to give_in_to the vicious daddy_longlegs . 19
Lexical segmentation Noam_Chomsky refused Alan_Black refused to give_in_to the to vicious daddy_longlegs . give_in_to the vicious daddy_longlegs . 20
Real Example, Gappy MWE My wife had taken_ her '07_Ford_Fusion _in for a routine oil_change . 21
22
23
The corpus • The entire Reviews subsection of the English Web Treebank (Bies et al. 2012), fully annotated for MWEs 723 reviews ‣ 3,800 sentences ‣ 55,000 words ‣ found 3,500 MWE instances [original version] ‣ • Every sentence: negotiated consensus between at least 2 annotators IAA between pairs : ~77% ‣ 24
Remarks • More on resulting guidelines shortly • Annotators were not shown syntax But the sentences are treebanked, part of ‣ English-EWT corpus which has gold Universal Dependencies (included in STREUSLE release) • Recently added PARSEME VMWE subtypes (single annotator) • HAMSTER [Chen et al. 2017] 25
✦ 55k words of English S upersense web reviews ✴ 3,000 strong MWE T agged mentions 900 VMWEs with ‣ R epository of PARSEME subtypes ✴ 700 weak MWE E nglish with a mentions U ni fi ed S emantics for ������������������������������! L exical E xpressions tiny.cc/streusle 26
What a joy to stroll off historic Canyon_Road in Santa_Fe into a gallery with a gorgeous diversity of art 27
I googled restaurants in the area and Fuji_Sushi came_up and reviews were great so I made_ a carry_out _order 28
Kinds of MWEs & other notable constructions in STREUS LE 29
LONG TAIL! white-nosed coati 30
31
POS MWEs pattern contig. gappy most frequent types (lowercased lemmas) and their counts N_N 331 1 customer service: 31 oil change: 9 wait staff: 5 garage door: 4 ˆ_ˆ 325 1 santa fe: 4 dr. shady: 4 V_P 217 44 work with: 27 deal with: 16 look for: 12 have to: 12 ask for: 8 V_T 149 42 pick up: 15 check out: 10 show up: 9 end up: 6 give up: 5 V_N 31 107 take time: 7 give chance: 5 waste time: 5 have experience: 5 A_N 133 3 front desk: 6 top notch: 6 last minute: 5 V_R 103 30 come in: 12 come out: 8 take in: 7 stop in: 6 call back: 5 D_N 83 1 a lot: 30 a bit: 13 a couple: 9 P_N 67 8 on time: 10 in town: 9 in fact: 7 R_R 72 1 at least: 10 at best: 7 as well: 6 of course: 5 at all: 5 V_D_N 46 21 take the time: 11 do a job: 8 V~N 7 56 do job: 9 waste time: 4 ˆ_ˆ_ˆ 63 home delivery service: 3 lake forest tots: 3 R~V 49 highly recommend: 43 well spend: 1 pleasantly surprise: 1 P_D_N 33 6 over the phone: 4 on the side: 3 at this point: 2 on a budget: 2 A_P 39 pleased with: 7 happy with: 6 interested in: 5 P_P 39 out of: 10 due to: 9 because of: 7 V_O 38 thank you: 26 get it: 2 trust me: 2 V_V 8 30 get do: 8 let know: 5 have do: 4 N~N 34 1 channel guide: 2 drug seeker: 2 room key: 1 bus route: 1 A~N 31 hidden gem: 3 great job: 2 physical address: 2 many thanks: 2 great guy: 1 V_N_P 16 15 take care of: 14 have problem with: 5 N_V 18 10 mind blow: 2 test drive: 2 home make: 2 ˆ_$ 28 bj s: 2 fraiser ’s: 2 ham s: 2 alan ’s: 2 max ’s: 2 D_A 28 a few: 13 a little: 11 R_P 25 1 all over: 3 even though: 3 instead of: 2 even if: 2 V_A 19 6 make sure: 7 get busy: 3 get healthy: 2 play dumb: 1 32 V_P_N 14 6 go to school: 2 put at ease: 2 be in hands: 2 keep in mind: 1
Names, Dates, Values Dr._Lori_Levin 10 % Harry_,_Prince_of_Wales 3 x the speed Canyon_Road 3 x 4 = 12 Santa_Fe~,~NM A_+ #_1 5_star review 2002_Toyota_Camry 100 square_miles macOS_10.13.6 Jan. 1 , 1980 north_east Fourth_of_July north_-_northeast 33
Natural Kinds, Food Dishes green_tea ice_cream Indian_elephant chicken_salad sandwich my dog is a yellow_lab General_Tso_’s_chicken furcifer_pardalis macaroni_and_cheese brown dog cheese~and~crackers spaghetti~with~meatballs turkey sandwich strawberry banana milkshake 34
Quantifiers, Determiners a_few cats half_a million dollars a_lot of cats half_a mile away several cats half of a mile away plenty of cats Our shirts are the_same 35
Slogans Just_Do_It 36
Constructions Construction Grammar: Framework positing continuity between lexicon and grammar LEXICAL GRAMMATICAL cats kick the bucket the X er, the Y er SVO ice cream spill the beans construction = conventionalized form/function pairing of any grammatical shape, level of abstractness constructicon = structured inventory of constructions characterizing knowledge of a language 37
Recommend
More recommend