exploring propbanks for english and hindi
play

Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of - PowerPoint PPT Presentation

Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder Why is semantic information important? Imagine an automatic question answering system Who created the first effective polio


  1. Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder

  2. Why is semantic information important? • Imagine an automatic question answering system • Who created the first effective polio vaccine? • Two possible choices: – Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine – The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh

  3. Question Answering • Who created the first effective polio vaccine? – Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine – The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh

  4. Question Answering • Who created the first effective polio vaccine? – [Becton Dickinson] created the [first disposable syringe] for use with the mass administration of the first effective polio vaccine – [The first effective polio vaccine] was created in 1952 by [Jonas Salk] at the University of Pittsburgh

  5. Question Answering • Who created the first effective polio vaccine? – [Becton Dickinson agent ] created the [first disposable syringe theme ] for use with the mass administration of the first effective polio vaccine – [The first effective polio vaccine theme ] was created in 1952 by [Jonas Salk agent ] at the University of Pittsburgh

  6. Question Answering • We need semantic information to prefer the right answer • The theme of create should be ‘the first effective polio vaccine’ • The theme in the first sentence was ‘the first disposable syringe’ • We can filter out the wrong answer

  7. We need semantic information • To find out about events and their participants • To capture semantic information across syntactic variation

  8. Semantic information • Semantic information about verbs and participants expressed through semantic roles • Agent, Experiencer, Theme, Result etc. • However, difficult to have a standard set of thematic roles

  9. Proposition Bank • Proposition Bank (PropBank) provides a way to carry out general purpose Semantic role labelling • A PropBank is a large annotated corpus of predicate-argument information • A set of semantic roles is defined for each verb • A syntactically parsed corpus is then tagged with verb-specific semantic role information

  10. Outline • English PropBank • Background • Annotation • Frame files & Tagset • Hindi PropBank development • Adapting Frame files • Light verbs • Mapping from dependency labels

  11. Proposition Bank • The first (English) PropBank was created on a 1 million syntactically parsed Wall Street Journal corpus • PropBank annotation has also been done on different genres e.g. web text, biomedical text • Arabic, Chinese & Hindi PropBanks have been created

  12. English PropBank • English PropBank envisioned as the next level of Penn Treebank (Kingsbury & Palmer, 2003) • Added a layer of predicate-argument information to the Penn Treebank • Broad in its coverage- covering every instance of a verb and its semantic arguments in the corpus • Amenable to collecting representative statistics

  13. English PropBank Annotation • Two steps are involved in annotation – Choose a sense ID for the predicate – Annotate the arguments of that predicate with semantic roles • This requires two components: frame files and PropBank tagset

  14. PropBank Frame files • PropBank defines semantic roles on a verb-by- verb basis • This is defined in a verb lexicon consisting of frame files • Each predicate will have a set of roles associated with a distinct usage • A polysemous predicate can have several rolesets within its frame file

  15. An example • John rings the bell ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for

  16. An example • John rings the bell • Tall aspen trees ring the lake ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for ring.02 To surround Arg1 Surrounding entity Arg2 Surrounded entity

  17. An example • [John] rings [the bell] Ring.01 • [Tall aspen trees] ring [the lake] Ring.02 ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for ring.02 To surround Arg1 Surrounding entity Arg2 Surrounded entity

  18. An example • [John ARG0 ] rings [the bell ARG1 ] Ring.01 • [Tall aspen trees ARG1 ] ring [the lake ARG2 ] Ring.02 ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for ring.02 To surround Arg1 Surrounding entity Arg2 Surrounded entity

  19. Frame files • The Penn Treebank had about 3185 unique lemmas (Palmer, Gildea, Kingsbury, 2005) • Most frequently occurring verb: say • Small number of verbs had several framesets e.g. go, come, take, make • Most others had only one frameset per file

  20. PropBank annotation pane in Jubilee

  21. English PropBank Tagset • Numbered arguments Arg0, Arg1, and so on until Arg4 • Modifiers with function tags e.g. ArgM-LOC (location) , ArgM-TMP (time), ArgM-PRP (purpose) • Modifiers give additional information about when, where or how the event occurred

  22. PropBank tagset Numbered Description Argument Arg0 Agent, causer, experiencer Arg1 Theme, patient Arg2 Instrument, benefactive, attribute Arg3 starting point, benefactive, attribute Arg4 ending point • Correspond to the valency requirements of the verb • Or, those that occur with high frequency with that verb

  23. PropBank tagset Modifier Description ArgM-LOC Location ArgM-TMP Time ArgM-GOL Goal ArgM-MNR Manner ArgM-CAU Cause ArgM-ADV Adverbial • 15 modifier labels for English PropBank • [He Arg0 ] studied [economic growth Arg1 ] [in India ArgM-LOC ]

  24. PropBank tagset • Verb specific and more generalized • Arg0 and Arg1 correspond to Dowty’s Proto Roles • Leverage the commonalities among semantic roles • Agents, causers, experiencers – Arg0 • Undergoers, patients, themes- Arg1

  25. PropBank tagset • While annotating Arg0 and Arg1: – Unaccusative verbs take Arg1 as their subject argument • [The window Arg1 ] broke – Unergatives will take Arg0 • [John Arg0 ] sang • Distinction is also made between internally caused events (blush: Arg0) & externally caused events (redden: Arg1)

  26. PropBank tagset • How might these map to the more familiar thematic roles? • Yi, Loper & Palmer (2007) describe such a mapping to VerbNet roles

  27. • More frequent Arg0 and Arg1 (85%) are learnt more easily by automatic systems • Arg2 is less frequent, maps to more than one thematic role • Arg3-5 are even more infrequent

  28. Using PropBank • As a computational resource – Train semantic role labellers (Pradhan et al, 2005) – Question answering systems (with FrameNet) – Project semantic roles onto a parallel corpus in another language (Pado & Lapata, 2005) • For linguists, to study various phenomena related to predicate-argument structure

  29. Outline • English PropBank • Background • Annotation • Frame files & Tagset • Hindi PropBank development • Adapting Frame files • Light verbs • Mapping from dependency labels

  30. Developing PropBank for Hindi-Urdu • Hindi-Urdu PropBank is part of a project to develop a Multi-layered and multi- representational treebank for Hindi-Urdu – Hindi Dependency Treebank – Hindi PropBank – Hindi Phrase Structure Treebank • Ongoing project at CU-Boulder

  31. Hindi-Urdu PropBank • Corpus of 400,000 words for Hindi • Smaller corpus of 150,000 words for Urdu • Hindi corpus consists of newswire text from ‘ Amar Ujala ’ • So far.. – 220 verb frames – ~100K words annotated

  32. Developing Hindi PropBank • Making a PropBank resource for a new language – Linguistic differences • Capturing relevant language-specific phenomena – Annotation practices • Maintain similar annotation practices – Consistency across PropBanks

  33. Developing Hindi PropBank • PropBank annotation for English, Chinese & Arabic was done on top of phrase structure trees • Hindi PropBank is annotated on dependency trees

  34. Dependency tree • Represent relations that hold between constituents (chunks) • Karaka labels show the relations between head verb and its dependents दि ये gave k1 k2 k4 राम ने पैसे औरत को Raam erg money woman dat

  35. Hindi PropBank • There are three components to the annotation • Hindi Frame file creation • Insertion of empty categories • Semantic role labelling

  36. Hindi PropBank • There are three components to the annotation • Hindi Frame file creation • Insertion of empty categories • Semantic role labelling • Both frame creation and labelling require new strategies for Hindi

  37. Hindi PropBank • Hindi frame files were adapted to include – Morphological causatives – Unaccusative verbs – Experiencers • Additionally, changes had to be made to analyze the large number (nearly 40%) of light verbs

Recommend


More recommend