duc 2006 pyramid evaluation
play

DUC 2006 Pyramid Evaluation Rebecca J. Passonneau Rebecca J. - PowerPoint PPT Presentation

DUC 2006 Pyramid Evaluation Rebecca J. Passonneau Rebecca J. Passonneau Center for Computational Learning Systems Center for Computational Learning Systems Columbia University Columbia University Acknowledgments n Hoa Hoa Dang Dang n n


  1. DUC 2006 Pyramid Evaluation Rebecca J. Passonneau Rebecca J. Passonneau Center for Computational Learning Systems Center for Computational Learning Systems Columbia University Columbia University

  2. Acknowledgments n Hoa Hoa Dang Dang n n Columbia University (Kathy n Columbia University (Kathy McKeown McKeown) ) n Guideline contributors, testers ( n Guideline contributors, testers ( Lucy Lucy Vanderwende Vanderwende, , ) , . . . ) Adam Goodkind Goodkind, Guy , Guy LaPalme LaPalme, . . . Adam n Pyramid Creators ( n Pyramid Creators ( Adam Adam Goodkind Goodkind, Sergey , Sergey Sigelman Sigelman, , ) , Qui Long ) Lucy Vanderwende Vanderwende, , Inderjeet Inderjeet Mani Mani, Qui Long Lucy n Participants (21 sites) Participants (21 sites) n June 8, 2006 DUC Workshop 2

  3. Pyramid Overview n Human summarizers select overlapping content Human summarizers select overlapping content n n A pyramid represents and quantifies the overlap A pyramid represents and quantifies the overlap n of Summary Content Units (SCUs SCUs) found in ) found in of Summary Content Units ( multiple model summaries multiple model summaries n Two pyramid scores based on SCU annotations n Two pyramid scores based on SCU annotations � Precision Original � u Original Precision u � Recall Modified � u Modified Recall u n Manual annotation reliability assessment n Manual annotation reliability assessment u Pyramid annotations (LREC 2006) Pyramid annotations (LREC 2006) u u Peer annotations (DUC 2005) Peer annotations (DUC 2005) u June 8, 2006 DUC Workshop 3

  4. Sample SCU from D0631 Label:The Concorde crossed the Atlantic in less :The Concorde crossed the Atlantic in less [ Label [ than 4 hours ] than 4 hours ] Sum1 < < making the transatlantic flight in 3 and ½ hrs > Sum1 making the transatlantic flight in 3 and ½ hrs > Sum2 < < The Concorde could make the flight in between Sum2 The Concorde could make the flight in between New York and London or Paris in less than New York and London or Paris in less than > four hours > four hours Sum3 < < completing its journey from London to Sum3 completing its journey from London to > New York in about 3 hours, 30 minutes > New York in about 3 hours, 30 minutes Sum4 < took less than 4 hrs to cross the Atlantic > < took less than 4 hrs to cross the Atlantic > Sum4 June 8, 2006 DUC Workshop 4

  5. Building a Pyramid from Model Summaries (N=4) W=4 W=3 W=2 W=1 June 8, 2006 DUC Workshop 5

  6. 2006 Pyramid effort n New version of n New version of DUCView DUCView, annotation guidelines , annotation guidelines n Pyramids for 20 of the document sets Pyramids for 20 of the document sets n u High clarity ratings High clarity ratings u u Even distribution of assessors (summary writers) Even distribution of assessors (summary writers) u n Pyramid annotation n Pyramid annotation u 6 individuals at 3 sites, 2 with prior experience 6 individuals at 3 sites, 2 with prior experience u n Peer annotation: 21 peers plus the baseline n Peer annotation: 21 peers plus the baseline u New procedure: New procedure: “ “peer peer” ” review review u n Only modified pyramid score (normalized to average # Only modified pyramid score (normalized to average # n SCUs per model for each pyramid) per model for each pyramid) SCUs June 8, 2006 DUC Workshop 6

  7. Brief Comparison with 2005 n Same characteristics for document clusters Same characteristics for document clusters n n 4 instead of 7 model summaries n 4 instead of 7 model summaries u 2005: mean of mean SCU weight = 1.9 2005: mean of mean SCU weight = 1.9 u u 2006: mean of mean SCU weight = 1.56 2006: mean of mean SCU weight = 1.56 u n Possibly simpler task (cf. n Possibly simpler task (cf. Litowski Litowski, DUC 2006) , DUC 2006) n Possibly more coherent pyramids Possibly more coherent pyramids n n Improved systems Improved systems n u 19/25 (76%) beat the baseline in 2005 19/25 (76%) beat the baseline in 2005 u u 17/21 (81%) beat the baseline in 2006 17/21 (81%) beat the baseline in 2006 u June 8, 2006 DUC Workshop 7

  8. ANOVA Results n Dependent variable: modified score Dependent variable: modified score n n 9 Factors: n 9 Factors: u Peerid Peerid (p~0) (p~0) u u Setid Setid (p~0) (p~0) u u 5 5 LingQuality LingQuality ratings ratings u u Content responsiveness (p=0.0001) Content responsiveness (p=0.0001) u u Overall responsiveness (includes readability) Overall responsiveness (includes readability) u June 8, 2006 DUC Workshop 8

  9. System Differences (Tukey’s HSD) Peers > peers Peers > peers (N=5) 1, 17, 18, 25, 25 (N=5) NIL 1, 17, 18, 25, 25 NIL (N=3) 22, 29, 32 (N=3) 1 22, 29, 32 1 (N=3) (N=4) 19, 24, 33 (N=3) 1, 35, 17, 18 (N=4) 19, 24, 33 1, 35, 17, 18 (N=5) (N=5) 2, 3, 6, 14, 15 (N=5) 1, 35, 17, 18, 25 (N=5) 2, 3, 6, 14, 15 1, 35, 17, 18, 25 (N=6) 1, 35, 17, 18, 25, 29 (N=6) 28 28 1, 35, 17, 18, 25, 29 (N=8) 1, 35, 17, 18, 25, 29, 32, 22 (N=8) 27 27 1, 35, 17, 18, 25, 29, 32, 22 8 1, 35, 17, 18, 25, 29, 32, 22, 14 8 1, 35, 17, 18, 25, 29, 32, 22, 14 (N=9) (N=9) 10, 23 1, 35, 17, 18, 25, 29, 32, 22, 14, 19, 10, 23 1, 35, 17, 18, 25, 29, 32, 22, 14, 19, (N=17) 5, 33, 24, 3, 6, 2, 15 (N=17) 5, 33, 24, 3, 6, 2, 15 June 8, 2006 DUC Workshop 9

  10. For Illustration: Group Means Peers Mean modified score Peers Mean modified score (N=5) 1, 17, 18, 25, 35 (N=5) .113 ( � ~ .06) 1, 17, 18, 25, 35 (N=3) 22, 29, 32 (N=3) .169 22, 29, 32 (N=3) 19, 24, 33 (N=3) .176 19, 24, 33 (N=5) 2, 3, 6, 14, 15 (N=5) .199 2, 3, 6, 14, 15 28 .205 28 27 .210 27 8 .214 8 .241 ( � ~ .03) 10, 23 10, 23 June 8, 2006 DUC Workshop 10

  11. Docsets Mean pyramid score Docsets Mean pyramid score DOCSET .065 ( � ~.06) 5 .065 ( Differences 1, 3, 8, 15, 47 .133 .133 50 .135 .135 45, 30 .158 .158 28 .164 .164 16, 17, 20, 29 .172 .172 27 .197 .197 .229 ( � ~.03) 14 14 .229 ( 43 .252 43 .252 40 .269 40 .269 24 .286 24 .286 .357 ( � ~.07) 31 31 .357 ( June 8, 2006 DUC Workshop 11

  12. Content Evaluation n Perfect correlation with mean pyramid score Perfect correlation with mean pyramid score n per content level per content level Content Assessment Mean Pyr Pyr Score Score Content Assessment Mean 1 .12 1 .12 2 .17 2 .17 3 .19 3 .19 4 .21 4 .21 5 .22 5 .22 June 8, 2006 DUC Workshop 12

  13. Comparison with DUC 2005 n Many more significant differences among Many more significant differences among n peers using Tukey Tukey peers using u 2005: 2 distinct comparison sets 2005: 2 distinct comparison sets u u 2006: 8 distinct comparison sets 2006: 8 distinct comparison sets u n Better correlation with responsiveness Better correlation with responsiveness n u 2 assessors in 2005, r=.81; .90 2 assessors in 2005, r=.81; .90 u u 1 assessor in 2006, r=1 1 assessor in 2006, r=1 u June 8, 2006 DUC Workshop 13

  14. Factors Affecting System Scores n Differences in document set difficulty/coherence Differences in document set difficulty/coherence n n Pyramid characteristics Pyramid characteristics n u Mean SCU weight Mean SCU weight u u Pyramid size and proportion of weight 1 Pyramid size and proportion of weight 1 SCUs SCUs u n Score variability Score variability n u 2005: 2005: sd sd = .14 = .14 u u 2006: 2006: sd sd = .09 = .09 u n Better systems Better systems n u 2005 mean system score range: .20 to .06 2005 mean system score range: .20 to .06 u u 2006 mean system score range: .24 to .11 2006 mean system score range: .24 to .11 u June 8, 2006 DUC Workshop 14

  15. Semantics of Pyramids n More highly weighted n More highly weighted SCUs SCUs u more general more general u u less dependent on meaning of other less dependent on meaning of other SCUs SCUs u June 8, 2006 DUC Workshop 15

  16. Generality of Highly Weighted SCUs n W=4 W=4 n Wetlands help control floods D0603: Wetlands help control floods u D0603: u Exercise helps arthritis D0605: Exercise helps arthritis u D0605: u n W=1 W=1 n In underdeveloped countries the D0603: In underdeveloped countries the u D0603: u increase of rice- -planting has negative impacts planting has negative impacts increase of rice on wetlands on wetlands Arthroscopic knee surgery appears to knee surgery appears to D0605: Arthroscopic u D0605: u reduce pain, for unknown reasons reduce pain, for unknown reasons June 8, 2006 DUC Workshop 16

  17. Semantic Independence of Highly Weighted SCUs n W=4 W=4 n The Kursk Kursk sank in the Barents Sea sank in the Barents Sea D0640: The u D0640: u Egypt Air Flight 990 crashed D0617: Egypt Air Flight 990 crashed u D0617: u n W=1 W=1 n The escape hatch [of *] was too badly D0640: The escape hatch [of *] was too badly u D0640: u damaged to dock in 7 attempts damaged to dock in 7 attempts Tail elevators [of*] were in an uneven D0617: Tail elevators [of*] were in an uneven u D0617: u position, indicating a possible malfunction position, indicating a possible malfunction June 8, 2006 DUC Workshop 17

Recommend


More recommend