Introduction Experiment Evaluation Ranking and clustering the annotators References Ranking the annotators: An agreement study on argumentation structure Andreas Peldszus Manfred Stede Applied Computational Linguistics, University of Potsdam The 7th Linguistic Annotation Workshop Interoperability with Discourse ACL Workshop, Sofia, August 8-9, 2013 Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Introduction classic reliability study • 2 or 3 annotators • authors, field experts, at least motivated and experienced annotators • measure agreement, identify sources of disagreement Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Introduction classic reliability study crowd-sourced corpus • 2 or 3 annotators • 100- x annotators • authors, field • crowd experts, at least • bias correction motivated and [Snow et al., 2008] experienced outlier identification, annotators find systematic • measure differences agreement, identify [Bhardwaj et al., sources of 2010] disagreement spammer detection [Raykar and Yu, 2012] Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Introduction classic reliability study classroom annotation crowd-sourced corpus • 2 or 3 annotators • 20-30 annotators • 100- x annotators • authors, field • students with • crowd experts, at least different ability and • bias correction motivated and motivation, [Snow et al., 2008] experienced obligatory outlier identification, annotators participation find systematic • measure • do both: test differences agreement, identify reliabilty & identify [Bhardwaj et al., sources of and group 2010] disagreement characteristic spammer detection annotation [Raykar and Yu, behaviour 2012] Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Outline 1 Introduction 2 Experiment 3 Evaluation 4 Ranking and clustering the annotators Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Experiment Task: Argumentation Structure Scheme based on Freeman [1991, 2011] • node types = argumentative role proponent (presents and defends claims) opponent (critically questions) • link types = argumentative function support own claims (normally, by example) attack other’s claims (rebut, undercut) Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Experiment Task: Argumentation Structure Scheme based on Freeman [1991, 2011] • node types = argumentative role proponent (presents and defends claims) opponent (critically questions) • link types = argumentative function support own claims (normally, by example) attack other’s claims (rebut, undercut) Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Experiment Task: Argumentation Structure Scheme based on Freeman [1991, 2011] • node types = argumentative role proponent (presents and defends claims) opponent (critically questions) • link types = argumentative function support own claims (normally, by example) attack other’s claims (rebut, undercut) Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Experiment Task: Argumentation Structure Scheme based on Freeman [1991, 2011] • node types = argumentative role proponent (presents and defends claims) opponent (critically questions) • link types = argumentative function support own claims (normally, by example) attack other’s claims (rebut, undercut) This annotation is tough! • fully connected discourse structure • unitizing ADUs from EDUs is already a complex text-understanding task Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Experiment Data: Micro-Texts Thus, we use micro-texts: • 23 short, constructed, German texts • each text exactly 5 segments long • each segment is argumentatively relevant • covering different argumentative configurations A (translated) example [ Energy-saving light bulbs contain a considerable amount of toxic substances. ] 1 [ A customary lamp can for instance contain up to five milligrams of quicksilver. ] 2 [ For this reason, they should be taken off the market, ] 3 [ unless they are virtually unbreakable. ] 4 [ This, however, is simply not case. ] 5 Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Experiment Data: Micro-Texts Thus, we use micro-texts: • 23 short, constructed, German texts • each text exactly 5 segments long • each segment is argumentatively relevant • covering different argumentative configurations A (translated) example [ Energy-saving light bulbs contain a considerable amount of toxic substances. ] 1 [ A customary lamp can for instance contain up to five milligrams of quicksilver. ] 2 [ For this reason, they should be taken off the market, ] 3 [ unless they are virtually unbreakable. ] 4 [ This, however, is simply not case. ] 5 Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Experiment Setup: Classroom Annotation Obligatory annotation in class with 26 undergraduate students: • minimal training - 5 min. introduction - 30 min. reading guidelines (6p.) - very brief question answering • 45 min. annotation Annotation in three steps: • identify central claim / thesis • decide on argumentative role for each segment • decide on argumentative function for each segment Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Experiment Setup: Classroom Annotation Obligatory annotation in class with 26 undergraduate students: • minimal training - 5 min. introduction - 30 min. reading guidelines (6p.) - very brief question answering • 45 min. annotation Annotation in three steps: • identify central claim / thesis • decide on argumentative role for each segment • decide on argumentative function for each segment Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Evaluation: Preparation Rewrite graphs as a list of (relational) segment labels 1:PSNS(3) 2:PSES(1) 3:PT() 4:OARS(3) 5:PARS(4) Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Evaluation: Results level #cats A O A E D O D E κ α role+type+comb+target (71) 0.384 0.44 0.08 unweighted scores in κ [Fleiss, 1971], weighted scores in α [Krippendorff, 1980] • low agreement for the full task • varying difficulty on the simple levels • other complex levels: target identification has only small impact • hierarchically weighted IAA yields slightly better results Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Evaluation: Results level #cats A O A E D O D E κ α role 2 0.521 0.78 0.55 typegen 3 0.579 0.72 0.33 type 5 0.469 0.61 0.26 comb 2 0.458 0.73 0.50 target (9) 0.490 0.58 0.17 role+type+comb+target (71) 0.384 0.44 0.08 unweighted scores in κ [Fleiss, 1971], weighted scores in α [Krippendorff, 1980] • low agreement for the full task • varying difficulty on the simple levels • other complex levels: target identification has only small impact • hierarchically weighted IAA yields slightly better results Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Introduction Experiment Evaluation Ranking and clustering the annotators References Evaluation: Results level #cats A O A E D O D E κ α role 2 0.521 0.78 0.55 typegen 3 0.579 0.72 0.33 type 5 0.469 0.61 0.26 comb 2 0.458 0.73 0.50 target (9) 0.490 0.58 0.17 role+typegen 5 0.541 0.66 0.25 role+type 9 0.450 0.56 0.20 role+type+comb 15 0.392 0.49 0.16 role+type+comb+target (71) 0.384 0.44 0.08 unweighted scores in κ [Fleiss, 1971], weighted scores in α [Krippendorff, 1980] • low agreement for the full task • varying difficulty on the simple levels • other complex levels: target identification has only small impact • hierarchically weighted IAA yields slightly better results Ranking the annotators: An agreement study on argumentation structure Peldszus, Stede
Recommend
More recommend