Building a Discourse-annotated Dutch Text Corpus Nynke van der Vliet , Ildikó Berzlánovich , Gosse Bouma , Markus Egg , Gisela Redeker Beyond Semantics DGfS Workshop, Göttingen, February 23-25, 2011
Overview Introduction MTO Project Corpus design Text selection Segmentation Annotation Discourse structure Genre structure Discourse connectives Lexical cohesion Preliminary results and future work 2
Introduction Modeling Textual Organization (MTO) Program Build a Dutc tch h te text t corp rpus, annotated for discourse structure, genre structure, lexical cohesion, coreference, and discourse connectives Project Goals: Investigate the genre-dependent interaction between discourse structure and lexical cohesion (Project 1, Ildikó Berzlánovich) Investigate the mechanisms that establish coherence in text and develop algorithms for discourse parsing (Project 2, Nynke van der Vliet) http://www.let.rug.nl/mto/ 3
Corpus design Provide a reliably annotated “gold standard” resource covering a range of genres Core corpus: 80 texts (length: 190 - 400 words) • expository texts: 20 encyclopedia texts 20 popular scientific news texts • persuasive texts: 20 fundraising letters 20 advertisements 4
Text Selection (1) Preparation: selection of text material, stripping off ‘text - external’ elements • Exclude pictures and picture captions • Exclude genre-specific elements that are not related to rhetorical choices 5
Text Selection (2) Example 6
Segmentation (1) EDU ~ simple sentence Each donation is valuable! EDU ~ finite clause You can build with us by donating, ][ but you can also build with us literally. EDU ~ fragment functioning as complete utterance Nice gadgets. EDU ~ non-restrictive relative clause This gap is caused by one of the moons of Saturn, Mimas, ][ which disturbs the rings. 7
Segmentation (2) EDU ~ embedded discourse unit However during the night, [ which can last for months on Mercury, ] the temperature decreases to about -185 degrees Celsius. EDU ~ coordinated VP At a young age a cataract in her eye was diagnosed ][ and treated. EDU ~ elliptical clause The planet turns around its axis in 58.6 days ][ and around the sun in 88.0 days. 8
Discourse Structure (1) Rhetorical etorical Str tructure ucture Th Theory ory (RST) T) (Mann and Thompson,1988) Full hierarchical text structure Extended Classic RST (30 relations) Semantic and pragmatic relations Non-binary trees 9
Discourse Structure (2) (37) P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. (38 ) Help us now in our fight against malaria (39) by donating today (40) - within an hour more than 120 children will die needlessly from this deadly disease. (41) Give generously (42) and do this today, (43) so that we can help more children (44) before it is too late. 10
Discourse Structure (3) Multi-satellite construction (non-binary tree)tree) 1-4 Motivation Justify 1 P.S. The enclosed 2-3 4 - within an hour cards are a thank more than 120 Means you gift for reading children will die my letter about the needlessly from this 2 Help us now in our 3 by donating today malaria epidemic in deadly disease. fight against malaria Africa. Restriction to binary trees yields implausible analyses 1-4 1-4 Motivation Justify 1 P.S. The enclosed 2-4 1-3 4 - within an hour cards are a thank more than 120 Justify Motivation you gift for reading children will die my letter about the needlessly from this 2-3 4 - within an hour 1 P.S. The enclosed 2-3 malaria epidemic in deadly disease. more than 120 Means cards are a thank Africa. Means children will die you gift for reading needlessly from this my letter about the 2 Help us now in our 3 by donating today 2 Help us now in our 3 by donating today deadly disease. malaria epidemic in fight against malaria fight against malaria Africa. 11
Genre structure (1) Move ve analysis alysis (Upton and Cohen, 2009) Moves = functional components in the text • Each genre has a particular set of move types • The moves create a linear, non-hierarchical • partition of the text 12
Genre structure (2) Encyclopedia texts Fundraising letters Name Get attention Define Introduce the cause Describe Establish credentials of organization Solicit response Offer incentive Reference insert Express gratitude Conclude with pleasantries 13
Genre structure (3) 14
Genre structure (4) Mapping the move structure onto RST structure
Discourse Connectives (1) Wh Why annotate notate connectives nnectives? • At least at intra-sentential level (but probably also across sentences), connectives should be valuable cues to coherence relations. • Frequencies of connectives may differ between genres and thus provide a cue for genre classification. • Genre information may help the parser by biasing the disambiguation of multifunctional connectives, e.g., toward a semantic meaning for expository texts and pragmatic one for persuasive texts. 16
Discourse Connectives (2) (16) With the help of research much has already been achieved. (17) But to protect you and others from the consequences of diabetes (18) more research is needed. (19) That is why we keep asking for your support. 17
Lexical cohesion (1) Lexical cohesive items build up graph structures in the • text For each lexical item, lexical links to items in preceding • and following EDUs are identified 18
EDU5 [ After the fo forming ming of the sun and the solar lar system tem, our st star r began its long existence as a so-called dwarf arf st star r ] EDU6 [ In the dwarf arf phase ase of its life fe, the energy that the sun gives off is generated in its core through the fusion of hydrogen into helium. ] EDU7 [ The sun is about five billon years ] 19
Annotations (1) Se Segmenta ntatio tion Detailed manual with rules and examples Reliability: 25% of the material, K = 0.98 (fundraising letters and encyclopedia) Cohere renc nce e analysis ysis (RST ST) Relation definitions as published on the RST website Consensus procedure: each final analysis is based on at least two independent first versions and intensive team discussion (Berzlánovich, Redeker, van der Vliet) Reliability: K = 0.88 for the spans, 0.82 for nuclearity and 0.57 for labeling 20
Annotations (2) Genre e st structur ucture (move e analy lysi sis) s) Detailed manual Final analysis by consensus among two coders (Berzlánovich, Redeker) Reliability: K will be calculated on 20 % of the corpus (4 texts per genre) Lexic ical al cohes esio ion Detailed manual, training of the coders Final analysis by consensus among two coders (Berzlánovich with Rensema/ Wagenaar) Reliability: K will be calculated on 20 % of the corpus (4 texts per genre) 21
Coherence, Cohesion, and Genre Pre relimina liminary ry re results lts on ge genre re-se sensit nsitivi ivity ty of c f cohere rence ce and co cohesion sion (comparing encyclopedia texts and fundraising letters) Genre difference in coherence Presentational relations are much more frequent in persuasive texts than in expository texts. Genre difference in cohesion Different discourse connectives in expository and persuasive texts. Systematic semantic relations are more frequent in expository texts than in persuasive texts. Genre-specific interaction of coherence and cohesion Coherence and cohesion are closely aligned in expository texts, but not in persuasive texts. 22
Future plans Automatic discourse parsing - automatic segmentation (basic program already achieves good precision (0.72) and recall (0.75)) - determine the validity of genre, connectives, co- reference and lexical cohesion relations as cues for the recognition of RST relations (using machine learning) 23
Thank you 24
Recommend
More recommend