NLG Evaluation Ehud Reiter (Abdn Uni and Arria/Data2text) Ehud - PowerPoint PPT Presentation

NLG Evaluation Ehud Reiter (Abdn Uni and Arria/Data2text) Ehud Reiter, Computing Science, University of Aberdeen 1

Structure ● Evaluation Concepts ● Specifics: controlled ratings-based eval Ehud Reiter, Computing Science, University of Aberdeen 2

Purpose of Evaluation ● What do we want to know? ● Audience? Ehud Reiter, Computing Science, University of Aberdeen 3

Example: BabyTalk ● Goal: Summarise clinical data about premature babies in neonatal ICU ● Input: sensor data; records of actions/ observations by medical staff ● Output: multi-para texts, summarise » BT45: 45 mins data, for doctors » BT-Nurse: 12 hrs data, for nurses » BT-Family: 24 hrs data, for parents Ehud Reiter, Computing Science, University of Aberdeen 4

Babytalk: Neonatal ICU Ehud Reiter, Computing Science, University of Aberdeen 5

Babytalk Input: Sensor Data Ehud Reiter, Computing Science, University of Aberdeen 6

BT-Nurse text (extract) Respiratory Support Current Status Currently, the baby is on CMV in 27 % O2. Vent RR is 55 breaths per minute. Pressures are 20/4 cms H2O. Tidal volume is 1.5. SaO2 is variable within the acceptable range and there have been some desaturations. … Events During the Shift A blood gas was taken at around 19:45. Parameters were acceptable. pH was 7.18. CO2 was 7.71 kPa. BE was -4.8 mmol/L. … Ehud Reiter, Computing Science, University of Aberdeen 7

Babytalk eval: goals ● Babytalk evaluation goals » Medics want to know if Babytalk summaries enhance patient outcome – Deploy Babytalk on ward and measure outcome (RCT) » Psychologists want to know if Babytalk texts are effective decision support tool – Controlled “off ward” study of decision effectiveness » Developers want to know how improve system – Qualitative feedback often most useful » Software house wants to know if profitable – Business model (costs and revenue) Ehud Reiter, Computing Science, University of Aberdeen 8

Which Goal? ● Depends! » Publish NLG research papers – usually focus on “psychologist” goals » Publish NLP research paper – usually performance on standard data set – Very dubious in my opinion … . ● But other goals also important Ehud Reiter, Computing Science, University of Aberdeen 9

Types of NLG Evaluation ● Task Performance ● Human Ratings ● Metric (comparison to gold standard) ● Controlled vs Real-World Ehud Reiter, Computing Science, University of Aberdeen 10

Task-Performance Eval ● Measure whether NLG system achieves its communicative goal » Typically helping user perform a task » Other possibilities, eg behaviour change ● Evaluate in real-world or in controlled experiment Ehud Reiter, Computing Science, University of Aberdeen 11

Real world: STOP smoking ● STOP system generates personalised smoking-cessation letters ● Recruited 2553 smokers » Sent 1/3 STOP letters » Sent 1/3 fixed (non-tailored) letter » Sent 1/3 simple “thank you” letter ● Waited 6 months, and compared smoking cessation rates between the groups Ehud Reiter, Computing Science, University of Aberdeen 12

Results: STOP ● 6-Month cessation rate » STOP letter: 3.5% » Non-tailored letter: 4.4% » Thank-you letter: 2.6% ● Note: » More heavy smokers in STOP group » Heavy smokers less likely to quit Ehud Reiter, Computing Science, University of Aberdeen 13

Negative result ● Should be published! ● Don’t ignore or “tweak stats” until you get the “right” answer ● E Reiter, R Robertson, and L Osman (2003). Lessons from a Failure: Generating Tailored Smoking Cessation Letters. Artificial Intelligence 144 :41-58. Ehud Reiter, Computing Science, University of Aberdeen 14

Controlled exper: BT45 ● Babytalk BT-45 system (short reports) ● Choose 24 data sets (scenarios) » From historical data (5 years old) ● Created 3 presentations of each scenario » BT45 text, Human text, Visualisation ● Asked 35 subjects (medics) to look at presentations and decide on intervention » In experiment room, not in ward » Compared intervention to gold standard ● Computed likelihood of correct decision Ehud Reiter, Computing Science, University of Aberdeen 15

Results: BT45 ● Correct decision made » BT45 text: 34% » Human text: 39% » Visualisation: 33% ● Note: » BT45 texts mostly as good as human, but were pretty bad in scen where target action was “no action” or “sensor error” Ehud Reiter, Computing Science, University of Aberdeen 16

Reference ● F Portet, E Reiter, A Gatt, J Hunter, S Sripada, Y Freer, C Sykes (2009). Automatic Generation of Textual Summaries from Neonatal Intensive Care Data. Artificial Intelligence 173 :789-816 ● M. van der Meulen, R. Logie, Y. Freer, C. Sykes, N. McIntosh, and J. Hunter, "When a graph is poorer than 100 words: A comparison of computerised natural language generation, human generated descriptions and graphical displays in neonatal intensive care," Applied Cognitive Psychology , vol. 24, pp. 77-89, 2008. Ehud Reiter, Computing Science, University of Aberdeen 17

Task-based evaluations ● Most respected » Especially outwith NLG/NLP community ● Very expensive and time-consuming ● Eval is of specific system, not generic algorithm or idea » Small changes to both STOP and BT45 would probably changes eval result Ehud Reiter, Computing Science, University of Aberdeen 18

Human Ratings ● Ask human subjects to assess texts » Readability (linguistic quality) » Accuracy (content quality) » Usefulness ● Can assess control/baseline as well ● Usually use Likert scale » Strongly agree, agree, undecided, disagree, strongly disagree (5 pt scale) Ehud Reiter, Computing Science, University of Aberdeen 19

Real world: BT-Nurse ● Deployed BT-Nurse on ward ● Nurses used it on real patients » Both beginning and end of shift » Vetted to remove content that could damage care – No content removed ● Nurses gave scores (3-pt scale) on each text » Understandable, accurate, helpful » Agree, neutral, disagree ● Also free-text comments Ehud Reiter, Computing Science, University of Aberdeen 20

Results: BT-Nurse ● Numerical results » 90% of texts understandable » 70% of texts accurate » 60% of texts helpful » [no texts damaged care] ● Many comments » More content » Software bugs » A few “really helped me” comments Ehud Reiter, Computing Science, University of Aberdeen 21

Reference ● J Hunter, Y Freer, A Gatt, E Reiter, S Sripada, C Sykes, D Westwater (2011). BT-Nurse: Computer Generation of Natural Language Shift Summaries from Complex Heterogeneous Medical Data. Journal of the American Medical Informatics Association 18 :621-624 ● J Hunter, Y Freer, A Gatt, E Reiter, S Sripada, C Sykes (2012). Automatic generation of natural language nursing shift summaries in neonatal intensive care: BT-Nurse. Artificial Intelligence in Medicine 56 :157–172 Ehud Reiter, Computing Science, University of Aberdeen 22

Controlled exper: Sumtime ● Marine weather forecasts ● Choose 5 weather data sets (scenarios) ● Created 3 presentations of each scenario » Sumtime text » human texts » Hybrid: Human content, SumTime language ● Asked 73 subjects (readers of marine forecasts) to give preference » Each saw 2 of the 3 possible variants of a scenario » Most readable, most accurate, most appropriate Ehud Reiter, Computing Science, University of Aberdeen 23

Results: SumTime Question ST Human same p value SumTime vs. human texts More appropriate? 43% 27% 30% 0.021 More accurate? 51% 33% 15% 0.011 Easier to read? 41% 36% 23% >0.1 Hybrid vs. human texts More appropriate? 38% 28% 34% >0.1 More accurate? 45% 36% 19% >0.1 Easier to read? 51% 17% 33% <0.0001 Ehud Reiter, Computing Science, University of Aberdeen 24

Reference ● E Reiter, S Sripada, J Hunter, J Yu, and I Davy (2005). Choosing Words in Computer-Generated Weather Forecasts. Artificial Intelligence 167 :137-169. Ehud Reiter, Computing Science, University of Aberdeen 25

Human ratings evaluation ● Probably most common type in NLG » Well accepted in academic literature ● Easier/quicker than task-based » For controlled eval, can be able to use Mechanical Turk » Can answer questions which are hard to fit into a task-based evaluation – Can ask people to generalise Ehud Reiter, Computing Science, University of Aberdeen 26

Metric-based evaluation ● Create a gold standard set » Input data for NLG system (scenarios) » Desired output text (usually human-written) – Sometimes multiple “reference” texts specified ● Run NLG system on above data sets ● Compare output to gold standard output » Various metrics, such as BLEU ● Widely used in machine translation Ehud Reiter, Computing Science, University of Aberdeen 27

Example: SumTime input data day/hour wind-dir speed gust ● 05/06 SSW 18 22 ● 05/09 S 16 20 ● 05/12 S 14 17 ● 05/15 S 14 17 ● 05/18 SSE 12 15 ● 05/21 SSE 10 12 ● 06/00 VAR 6 7 Ehud Reiter, Computing Science, University of Aberdeen 28

NLG Evaluation Ehud Reiter (Abdn Uni and Arria/Data2text) Ehud - PowerPoint PPT Presentation

NLG Evaluation Ehud Reiter (Abdn Uni and Arria/Data2text) Ehud Reiter, Computing Science, University of Aberdeen 1 Structure Evaluation Concepts Specifics: controlled ratings-based eval Ehud Reiter, Computing Science, University of

Natural Language Generation Demos Basics of NLG NLG concepts Issues in NLG NLG subtasks Scott

NLG: Specific Components Texts NLG Systems Architecture modules Scott Farrar Textplanner

NLG, Wrap up Surface realizer Linearization SimpleNLG Lexicon Scott Farrar Design ideas

Tutorial on Abstractive Text Summarization Advaith Siddharthan NLG Summer School, Aberdeen, 22

STS for NLG Christian Chiarcos chiarcos@uni-potsdam.de Natural Language Generation Natural

Findings of the E2E NLG Challenge Ondej Duek , Jekaterina Novikova and Verena Rieser

Why NLU doesnt generalize to NLG Yejin Choi Paul G. Allen School of Computer Science &

NLG as Cogni,ve Modelling The case of Referring Expressions

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Introduction to NLP and NLG Introduction to NLP Rules or Statistics?? Lexical Analysis,

Intro to Natural Language Generation Ehud Reiter (Abdn Uni and Arria/Data2text) Background read:

Natural Language Generation Survey in the State of the Art of Natural Topic Coverage

Creating Training Corpora for NLG Micro-Planning Claire Gardent, Anastasia Shimorina, Shashi

Trainable Approaches for Surface NLG* Adwait Ratnaparkhi WhizBang! Labs -- Research *Funded by

NLG-NYC Arrestee Informational Session MASS DEFENSE This Photo by Unknown Author is licensed

Second Stage of Labor: No financial disclosures related to this talk When to Start and Stop

SPECTRAHEDRA Bernd Sturmfels UC Berkeley Mathematics Colloquium, North Carolina State University

Improving light collection efficiency of silicon photomultipliers through the use of metalenses

Multifocal: A Strategic Bidirectional Transformation Language for XML Schemas Hugo Pacheco

Mon., 21 Sept. 2015 (delayed slides) Conditional and unconditional branches The go to

Applied Statistics and Data Modeling Part 3: Analysis of Variance - Two way ANOVA Luc Duchateau 1

management in group September 7, 2016 housing systems Julie Mnard , Agr, DVM F. Mnard Inc.

An embedded, ecological and evidence- based approach to improving outcomes for families with

NLG Evaluation Ehud Reiter (Abdn Uni and Arria/Data2text) Ehud - PowerPoint PPT Presentation

NLG Evaluation Ehud Reiter (Abdn Uni and Arria/Data2text) Ehud Reiter, Computing Science, University of Aberdeen 1 Structure Evaluation Concepts Specifics: controlled ratings-based eval Ehud Reiter, Computing Science, University of

Natural Language Generation Demos Basics of NLG NLG concepts Issues in NLG NLG subtasks Scott

NLG: Specific Components Texts NLG Systems Architecture modules Scott Farrar Textplanner

NLG, Wrap up Surface realizer Linearization SimpleNLG Lexicon Scott Farrar Design ideas

Tutorial on Abstractive Text Summarization Advaith Siddharthan NLG Summer School, Aberdeen, 22

STS for NLG Christian Chiarcos chiarcos@uni-potsdam.de Natural Language Generation Natural

Findings of the E2E NLG Challenge Ondej Duek , Jekaterina Novikova and Verena Rieser

Why NLU doesnt generalize to NLG Yejin Choi Paul G. Allen School of Computer Science &amp;

NLG as Cogni,ve Modelling The case of Referring Expressions

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Introduction to NLP and NLG Introduction to NLP Rules or Statistics?? Lexical Analysis,

Intro to Natural Language Generation Ehud Reiter (Abdn Uni and Arria/Data2text) Background read:

Natural Language Generation Survey in the State of the Art of Natural Topic Coverage

Creating Training Corpora for NLG Micro-Planning Claire Gardent, Anastasia Shimorina, Shashi

Trainable Approaches for Surface NLG* Adwait Ratnaparkhi WhizBang! Labs -- Research *Funded by

NLG-NYC Arrestee Informational Session MASS DEFENSE This Photo by Unknown Author is licensed

Second Stage of Labor: No financial disclosures related to this talk When to Start and Stop

SPECTRAHEDRA Bernd Sturmfels UC Berkeley Mathematics Colloquium, North Carolina State University

Improving light collection efficiency of silicon photomultipliers through the use of metalenses

Multifocal: A Strategic Bidirectional Transformation Language for XML Schemas Hugo Pacheco

Mon., 21 Sept. 2015 (delayed slides) Conditional and unconditional branches The go to

Applied Statistics and Data Modeling Part 3: Analysis of Variance - Two way ANOVA Luc Duchateau 1

management in group September 7, 2016 housing systems Julie Mnard , Agr, DVM F. Mnard Inc.

An embedded, ecological and evidence- based approach to improving outcomes for families with

Why NLU doesnt generalize to NLG Yejin Choi Paul G. Allen School of Computer Science &