Bridging the ROUGE/Human Evaluation Gap in Multi- Document - PowerPoint PPT Presentation

Bridging the ROUGE/Human Evaluation Gap in Multi- Document Summarization John M. Conroy Judith D. Schlesinger IDA Center for Computing Sciences,USA Dianne P. O’Leary University of Maryland, College Park, USA

Outline • CLASSY 07 – Main: System 24. – Update: System 44. • Gaps in performance and metrics. • Comparison MSE 2006. (panel tomorrow) • Better metrics? (panel tomorrow)

CLASSY (Clustering, Linguistics, And Statistics for Summarization Yield) • Linguistic preprocessing. – Shallow parsing – Find sentences and shorten them. • Sentence Scoring. – Approximate Oracle. • Redundancy Removal. – Select a subset of sentences. – LSI and L1-norm QR. • Ordering – TSP

Processing: Structure and Linguistic • Use sgml tags to remove datelines, bylines, and harvest headlines. • Use heuristic patterns to find phrases/clauses/words to eliminate – Finding sentence boundaries. – Shallow processing. • Removed lead pronoun sentences and question sentence for 2007.

Linguistic Processing • Eliminations – Gerund phrases – Relative clause appositives – Attributions – Lead adverbs and phrases • For example, On the other hand, … – Medial adverbs • too, however, …

An Oracle and Average Jo • An oracle might tell us Pr( t ) Pr( t )=Probability that a human will choose term t to be included in a summary. • If we had human summaries, we could estimate Pr( t ) based on our data – E.g., 0, 1/4, 1/2, 3/4, or 1 if 4 human summaries are provided. – “Average Jo” Oracle Score: fraction of expected abstract terms (vector space model).

The Oracle Pleases Everyone!

Signature Terms • Term: stemmed (lemmatized), space- delimited string of characters from {a,b,c,…,z}, after text is lower cased and all other characters and stop words are NOT removed. • Need to restrict our attention to indicative terms ( signature terms ). – Terms that occur more often then expected.

Signature Terms Terms that occur more often than expected in Aquaint collection. • Based on a 2 × 2 contingency table of relevance counts. • Log-likelihood; equivalent to mutual information. • Dunning 1993, Hovy Lin 2000 .

A Simple Approximation of P ( t| τ ) • We approximate P ( t| τ ) by sq � ( t | � ) = 1 4 s ( t ) + 1 4 q ( t ) + 1 P 2 � ( t ) � s ( t ) = 1 if t is a signature term � 0 if t is not a signature term � � q ( t ) = 1 if t is a query term � 0 if t is not a query term � � ( t | � ) = probability t occurs in a sentence considered for selection. • The score of a sentence is the sum of Pr( t ) taken over its terms divided by its length.

Correlation with Oracle

Smoothing and Redundancy Removal Use approximate oracle to select candidate sentences (~750 words ). s s L 1 n – Terms as sentence features t a a L 1 11 1 n • Terms: { t 1 , …, t m } ∈ R m M M O M • Sentences: { s 1 , …, s n } ∈ R n t a a L m m 1 mn • Scaling: each column scaled to score. • LSI to reduce rank 0.5 n. – L1 pivoted QR to select sentences.

Ordering Sentences • Approximate TSP to increase flow. • Start with worst... • Order the lowest scoring sentence last. • Order the other sentences so that the sum of the distances between adjacent sentences is minimized (TSP). • B ij =number number words sentence i and j have in common. b ij c ij = � b ii b jj

DUC 2007: Main Task

Why the Gap? • Should Evaluators=Human Summarizer? • Advantage: – Person writing summary judges all summaries? • Disadvantage: – Personal interest (bias?) affects assessment. • Mean Human score DUC 07 was 4.9. – Removing self assessment score was 4.7, T-test indicates humans like their own summary more than other human summaries. • Do we aim to target every human’s ideal or find a middle ground (ROUGE) to please the masses? Come to the panel discussion…

Linguistics vs. Responsiveness • Evaluators liked summaries ending with a period. [Lucy] (2.8 ≠ 2.5 with 96% conf). • But, no significant difference in ROUGE-2. • Responsiveness in DUC 07 was suppose to be content only and not overall. • However,…

Correlating Linguistics Responsiveness Question Content Overall Content Resp. 06 Resp. 06 Resp. 07 Grammar 0.32 0.50 0.60 Non-Red. -0.37 -0.24 -0.43 Ref. 0.24 0.53 0.59 Clarity Focus 0.39 0.62 0.71 Structure 0.13 0.46 0.49 Coherence

Adaptations for Update • Sub-task A: run CLASSY 07 on 10 docs. • Sub-task B: – Use docs A and B to generate signature terms. – Project term-sentence matrix to orthogonal complement of submitted summary. – Select sentences from 8 new documents. • Sub-task C: analogous to sub-task B submission.

Update: Sub-task A

Update: Sub-task B

Update: Sub-task C

Conclusions • CLASSY 07’s did extremely well at ROUGE evaluation for main task and well on human eval. • Gap between humans and machines still exists. • Gap between ROUGE and responsiveness still exists. • Both human and automatic evaluation should be rethought. (Stay tuned for panel discussion, tomorrow.) • Looking forward to more update evaluation.

Bridging the ROUGE/Human Evaluation Gap in Multi- Document - PowerPoint PPT Presentation

Bridging the ROUGE/Human Evaluation Gap in Multi- Document Summarization John M. Conroy Judith D. Schlesinger IDA Center for Computing Sciences,USA Dianne P. OLeary University of Maryland, College Park, USA Outline CLASSY 07

Rouge National Park Opportunities and Challenges January 2013 Rouge Park Finch Meander Rouge

Management Plan 1 DEVELOPMENT SERVICES COMMITTEE S E P T E M B E R 9 , 2 0 1 4 Rouge Park =

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Bridging the Gap on Breaches: What Makes the Difference? Sponsored By: Bridging the Gap on

Bridging the Gender Pay Gap By: Christine Acquah Gender Pay Gap, What is it? The gap between

REPAIR OF EMBANKMENT FAILURE ON TERRE ROUGE- VERDUN ROAD Occurrence of cracks on Terre Rouge

Presentation to Bondholders Combination of Selecta and Pelican Rouge March 2017 Agenda

Discover How Baton Rouge is Becoming a Safer City with Smart M.Apps Warren Kron GIS Manager,

Rouge Park Trail Master Plan Markham Development Services Committee January 17, 2012 Background

Bridging The Gap Between Information Security & IT Audit Agenda Introductions

Bridging the Gap: An overview of CPRITs Early Translational Research Award (ETRA) and SEED

Bridging the Gap between Data Diversity and Data Dependencies Jean-Marc Petit INSA Lyon,

UCF FINANCIALS THE N EXT G EN Fit-Gap Kick Off April 17, 2018 AGENDA How are fit-gap sessions

MCP gap bottom bottom electrode gap Anode

Research on Race Bridging for 2020 Ben Bolender Assistant Division Chief Population Estimates

2017 Training The Company Gap Training GAP training is about bridging courses within the retail

KEYPATCH: binary patcher for IDA Pro http://keystone-engine.org/keypatch NGUYEN Anh Quynh

Formal Models of Language Paula Buttery Dept of Computer Science & Technology, University of

Thank you, sponsors Our online sponsors PLATINUM GOLD 1 6/28/2016 TOP 4 LOW COST CONDENSER

Applicable Flight Accidents Azerbaijan Airlines Flight 217 22:40 December 23 rd 2005. Instrument

BinCAT Purrfecting binary static analysis 8 juin 2017 Philippe Biondi, Raphal Rigo, Sarah

Homotopy-theoretic aspects of Martin-L of type theory Nicola Gambino University of Palermo

An Open-Source Machine-Code Decompiler Peter Matula Marek Milkovi Who Are We? Peter

Logic and discrete mathematics (HKGAB4) Discrete mathematics: contents http://www.ida.liu.se/

Bridging the ROUGE/Human Evaluation Gap in Multi- Document - PowerPoint PPT Presentation

Bridging the ROUGE/Human Evaluation Gap in Multi- Document Summarization John M. Conroy Judith D. Schlesinger IDA Center for Computing Sciences,USA Dianne P. OLeary University of Maryland, College Park, USA Outline CLASSY 07

Rouge National Park Opportunities and Challenges January 2013 Rouge Park Finch Meander Rouge

Management Plan 1 DEVELOPMENT SERVICES COMMITTEE S E P T E M B E R 9 , 2 0 1 4 Rouge Park =

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Bridging the Gap on Breaches: What Makes the Difference? Sponsored By: Bridging the Gap on

Bridging the Gender Pay Gap By: Christine Acquah Gender Pay Gap, What is it? The gap between

REPAIR OF EMBANKMENT FAILURE ON TERRE ROUGE- VERDUN ROAD Occurrence of cracks on Terre Rouge

Presentation to Bondholders Combination of Selecta and Pelican Rouge March 2017 Agenda

Discover How Baton Rouge is Becoming a Safer City with Smart M.Apps Warren Kron GIS Manager,

Rouge Park Trail Master Plan Markham Development Services Committee January 17, 2012 Background

Bridging The Gap Between Information Security &amp; IT Audit Agenda Introductions

Bridging the Gap: An overview of CPRITs Early Translational Research Award (ETRA) and SEED

Bridging the Gap between Data Diversity and Data Dependencies Jean-Marc Petit INSA Lyon,

UCF FINANCIALS THE N EXT G EN Fit-Gap Kick Off April 17, 2018 AGENDA How are fit-gap sessions

MCP gap bottom bottom electrode gap Anode

Research on Race Bridging for 2020 Ben Bolender Assistant Division Chief Population Estimates

2017 Training The Company Gap Training GAP training is about bridging courses within the retail

KEYPATCH: binary patcher for IDA Pro http://keystone-engine.org/keypatch NGUYEN Anh Quynh

Formal Models of Language Paula Buttery Dept of Computer Science &amp; Technology, University of

Thank you, sponsors Our online sponsors PLATINUM GOLD 1 6/28/2016 TOP 4 LOW COST CONDENSER

Applicable Flight Accidents Azerbaijan Airlines Flight 217 22:40 December 23 rd 2005. Instrument

BinCAT Purrfecting binary static analysis 8 juin 2017 Philippe Biondi, Raphal Rigo, Sarah

Homotopy-theoretic aspects of Martin-L of type theory Nicola Gambino University of Palermo

An Open-Source Machine-Code Decompiler Peter Matula Marek Milkovi Who Are We? Peter

Logic and discrete mathematics (HKGAB4) Discrete mathematics: contents http://www.ida.liu.se/

Bridging The Gap Between Information Security & IT Audit Agenda Introductions

Formal Models of Language Paula Buttery Dept of Computer Science & Technology, University of