Automatic Quality Estimation for Natural Language Generation: - PowerPoint PPT Presentation

Automatic Quality Estimation for Natural Language Generation: Ranting (Jointly Rating and Ranking) Ondřej Dušek , Karin Sevegnani, Ioannis Konstas & Verena Rieser Charles University, Prague Heriot-Watt University, Edinburgh INLG, Tokyo, 31 Oct 2019

Our Task(s) • Quality estimation : checking NLG output quality • just given input MR & NLG system output • no human reference texts for the NLG output • supervised training from a few human-annotated instances • well-established for MT, not so much in data-to-text NLG • Rating : Given NLG output, check if it’s good or not (scale 1 -6) • Ranking : Given more NLG outputs, which one is the best? Rating: MR: inform_only_match(name='hotel drisco', area='pacific heights') 4 (on a 1-6 scale) NLG output: the only match i have for you is the hotel drisco in the pacific heights area. Rank: MR: inform(name='The Cricketers', eatType='coffee shop', rating=high, familyFriendly=yes, near='Café Sicilia') better NLG 1: The Cricketers is a children friendly coffee shop near Café Sicilia with a high customer rating . NLG 2: The Cricketers can be found near the Café Sicilia. Customers give this coffee shop a high rating. It's family friendly. worse

Why Quality Estimation? • BLEU et al. don’t work very well – can we be better? • evaluating via correlation with humans • We can do without human references – wider usage: • Evaluation, tuning (same as BLEU) • Tuning (same as BLEU) • Inference – improving running NLG systems • Inference time use: • for rating : don’t show outputs rated below a threshold • use a backoff or humans • ranking : select best system output from an n-best list Dušek, Sevegnani, Konstas & Rieser – Automatic Quality Estimation for NLG 3

Old Model (Dušek, Novikova & Rieser, 2017) • Ratings only • Dual-encoder • MR encoder • NLG output encoder • fully connected + linear • trained by squared error • Final score is rounded Dušek, Sevegnani, Konstas & Rieser – Automatic Quality Estimation for NLG 4

Our Model • Ranking extension: • 2 nd copy NLG output encoder + fully connected + linear • shared weights • trained by hinge rank loss • on difference from 2 ratings • Can learn ranking & rating jointly • training instances mixed & losses masked 5

Synthetic Data (Dušek, Novikova & Rieser, 2017) • Adding more training instances • introducing artificial errors restaurant • randomly:* • removing words name is a restaurant . • replacing words by random ones children • duplicating words • inserting random words price • For rating data: • lower the rating by 1 for each error (with 6 → 4) • This can be applied to NLG systems’ training data, too • assume 6 (maximum) as original instances’ rating * articles and punctuation are dispreferred Dušek, Sevegnani, Konstas & Rieser – Automatic Quality Estimation for NLG

Synthetic Ranking Pairs • Different #’s of errors introduced to the same NLG output • Fewer errors should rank better • Ranking pairs are useful when the system is trained to rate, too! restaurant 1 error Rank: X-name serves Chinese food . better 2 errors food cheaply worse X-name serves Chinese food . Dušek, Sevegnani, Konstas & Rieser – Automatic Quality Estimation for NLG 7

Results: Rating (Novikova et al., EMNLP 2017) • Small 1-6 Likert-scale data (2,460 instances) https://aclweb.org/anthology/D17-1238 • 3 systems, 3 datasets (hotels & restaurants) • 5-fold cross-validation System Pearson Spearman MAE RMSE • Much better correlations Constant - - 1.013 1.233 than BLEU et al. BLEU (needs human references) 0.074 0.061 2.264 2.731 Our previous (Dušek et al., 2017) 0.330 0.287 0.909 1.208 • despite not needing references Our base 0.253 0.252 0.917 1.221 • synthetic data help a lot + synthetic rating instances 0.332 0.308 0.924 1.241 • statistically significant + synthetic ranking instances 0.347 0.320 0.936 1.261 • correlation of 0.37 still not ideal + synthetic from systems’ training data 0.369 0.295 0.925 1.250 • noise in human data? • absolute differences (MAE/RMSE) not so great 8

Results: Ranking (Dušek et al., CS&L 59) https://arxiv.org/abs/1901.07931 • Using E2E human ranking data (quality) – 15,001 instances • 21 systems, 1 domain • 5-way ranking converted to pairwise, leaving out ties • 8:1:1 train-dev-test split, no MR overlap • Our system is much better than random in pairwise ranking accuracy System P@1/Acc • Synthetic ranking instances help Random 0.500 Our base 0.708 • +4% absolute, statistically significant + synthetic ranking instances 0.732 • Training on both datasets doesn’t help + synthetic from systems’ training data 0.740 • different text style, different systems Dušek, Sevegnani, Konstas & Rieser – Automatic Quality Estimation for NLG 9

Conclusions • Trained quality estimation can do much better than BLEU & co. • Pearson correlation with humans 0.37 vs. ~0.06-0.10 • synthetic ranking instances help • The results so far aren’t ideal (we want more than 0.37/74%) • Domain/system generalization is still a problem • Future work: • improving model • using pretrained LMs • obtaining “cleaner” user scores • more realistic synthetic errors • influence of error type on user ratings Dušek, Sevegnani, Konstas & Rieser – Automatic Quality Estimation for NLG 10

Thanks • Code & link to data + paper: http://bit.ly/ratpred • Contact me: odusek@ufal.mff.cuni.cz http://bit.ly/odusek @tuetschek Paper links: this paper: arXiv: 1910.04731 previous model: arXiv: 1708.01759 datasets used: ACL D17-1238, arXiv:1901.07931 Dušek, Sevegnani, Konstas & Rieser – Automatic Quality Estimation for NLG 11

Automatic Quality Estimation for Natural Language Generation: - PowerPoint PPT Presentation

Automatic Quality Estimation for Natural Language Generation: Ranting (Jointly Rating and Ranking) Ondej Duek , Karin Sevegnani, Ioannis Konstas & Verena Rieser Charles University, Prague Heriot-Watt University, Edinburgh INLG, Tokyo,

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Seminar 18122 Automatic Quality Assurance and Release Seminar 18122 Automatic Quality

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

ADAPTIVE QUALITY ESTIMATION FOR MACHINE TRANSLATION AND AUTOMATIC SPEECH RECOGNITION Jos G. C.

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Automatic Registration and Calibration Automatic Registration and Calibration Automatic

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

M-Estimation under High-Dimensional Asymptotics DLD, Andrea Montanari 2014-05-01 DLD, Andrea

Part 3. Spectrum Estimation Part 3. Spectrum Estimation 3.2 Parametric Methods for Spectral

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Mail Service Quality Support: Mail Service Quality Support: Mail Service Quality Support: Mail

Advice Automatic Structures and Uniformly Automatic Classes Faried Abu Zaid 1 , Erich Grdel 2 ,

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Meta-Kernelization with Structural Parameters Robert Ganian Friedrich Slivovsky Stefan Szeider

Rank and Range Vector Spaces Marco Chiarandini Department of Mathematics & Computer Science

Comparing the Network Modeling Techniques Ivan Kendzor, Max Helm Friday 25 th January, 2019 Chair

Knowledge-Defined Networking Learning how to route Albert

Deep Text Mining of Instagram Data Without Strong Supervision WI 2018 Santiago | International

Ranking Factors of Team Success Nataliia Pobiedina, Julia Neidhardt, Maria del Carmen Calatrava

What is a factor? Introduction to R for Finance Stocks or bonds? Investment = 2 stock = 1

Making Decisions via Simulation [Law, Ch. 10], [Handbook of Sim. Opt.], [Haas, Sec. 6.3.6] Peter

Sambuz

Useful Links

Newsletter

Mail Us

Automatic Quality Estimation for Natural Language Generation: - PowerPoint PPT Presentation

Automatic Quality Estimation for Natural Language Generation: Ranting (Jointly Rating and Ranking) Ondej Duek , Karin Sevegnani, Ioannis Konstas & Verena Rieser Charles University, Prague Heriot-Watt University, Edinburgh INLG, Tokyo,

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Seminar 18122 Automatic Quality Assurance and Release Seminar 18122 Automatic Quality

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

ADAPTIVE QUALITY ESTIMATION FOR MACHINE TRANSLATION AND AUTOMATIC SPEECH RECOGNITION Jos G. C.

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Automatic Registration and Calibration Automatic Registration and Calibration Automatic

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

M-Estimation under High-Dimensional Asymptotics DLD, Andrea Montanari 2014-05-01 DLD, Andrea

Part 3. Spectrum Estimation Part 3. Spectrum Estimation 3.2 Parametric Methods for Spectral

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Mail Service Quality Support: Mail Service Quality Support: Mail Service Quality Support: Mail

Advice Automatic Structures and Uniformly Automatic Classes Faried Abu Zaid 1 , Erich Grdel 2 ,

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Meta-Kernelization with Structural Parameters Robert Ganian Friedrich Slivovsky Stefan Szeider

Rank and Range Vector Spaces Marco Chiarandini Department of Mathematics &amp; Computer Science

Comparing the Network Modeling Techniques Ivan Kendzor, Max Helm Friday 25 th January, 2019 Chair

Knowledge-Defined Networking Learning how to route Albert

Deep Text Mining of Instagram Data Without Strong Supervision WI 2018 Santiago | International

Ranking Factors of Team Success Nataliia Pobiedina, Julia Neidhardt, Maria del Carmen Calatrava

What is a factor? Introduction to R for Finance Stocks or bonds? Investment = 2 stock = 1

Making Decisions via Simulation [Law, Ch. 10], [Handbook of Sim. Opt.], [Haas, Sec. 6.3.6] Peter

Sambuz

Useful Links

Newsletter

Mail Us

Rank and Range Vector Spaces Marco Chiarandini Department of Mathematics & Computer Science