Neural Generation for Czech: Data and Baselines Ondej Duek & - PowerPoint PPT Presentation

Neural Generation for Czech: Data and Baselines Ondřej Dušek & Filip Jurčíček Institute of Formal and Applied Linguistics Charles University, Prague INLG, Tokyo, 31 Oct 2019

Task & Motivation • Task: Data-to-text generation from flat MRs • as in dialogue systems • dialogue act type + attributes/slots + values → sentence in Czech inform(name=The Red Lion, food=British) The Red Lion serves British food. • Motivation: Most data-to-text NLG only targets English • non-English systems are mostly handcrafted • (surface realization is a different task) • Not many non-English data-to-text NLG datasets available • English has little morphology – bias? • Czech has rich morphology, used in MT a lot, NLP tools ready Dušek & Jurčíček – Neural Generation for Czech 2

Task & Motivation • Task: Data-to-text generation from flat MRs • as in dialogue systems • dialogue act type + attributes/slots + values → sentence in Czech inform(name=Na Růžku , food=Czech) Na Růžku podávají česká jídla. • Motivation: Most data-to-text NLG only targets English • non-English systems are mostly handcrafted • (surface realization is a different task) • Not many non-English data-to-text NLG datasets available • English has little morphology – bias? • Czech has rich morphology, used in MT a lot, NLP tools ready Dušek & Jurčíček – Neural Generation for Czech 3

Delexicalization • Delexicalization = replacing slot values with placeholders • used heavily in NLG systems (not just data-driven) • helps fight data sparsity • Lexicalization = putting concrete values back • easy in English – can just do verbatim (for noun phrases) • not easy in Czech and other languages with rich morphology • need to find the proper surface form to fit the sentence inform(name= Baráčnická rychta, area=Malá Strana) Baráčnická rychta nominative Malá Strana nominative <name> je na <area> Baráčnické rychty genitive Malé Strany genitive <name> is in <area> Malé Straně dative Baráčnické rychtě dative Baráčnickou rychtu accusative Malou Stranu accusative Baráčnické rychtě locative Malé Straně locative <name> najdete v oblasti <area> Baráčnickou rychtou instrumental Malou Stranou instrumental <name> you-find in the-area of- <area>

Delexicalization • Delexicalization = replacing slot values with placeholders • used heavily in NLG systems (not just data-driven) • helps fight data sparsity • Lexicalization = putting concrete values back • easy in English – can just do verbatim (for noun phrases) • not easy in Czech and other languages with rich morphology • need to find the proper surface form to fit the sentence inform(name= Baráčnická rychta, area=Malá Strana) needs needs locative nominative Baráčnická rychta nominative Malá Strana nominative Baráčnická rychta je na Malé Straně Baráčnické rychty genitive Malé Strany genitive Baráčnická rychta is in Malá Strana Malé Straně dative Baráčnické rychtě dative needs Baráčnickou rychtu accusative Malou Stranu accusative needs accusative Baráčnické rychtě locative Malé Straně locative genitive Baráčnickou rychtu najdete v oblasti Malé Strany Baráčnickou rychtou instrumental Malou Stranou instrumental Baráčnick á rychta you-find in the-area of- Malá Strana

Creating a Czech NLG Dataset • Crowdsourcing was not an option for Czech • no Czech speakers on the platforms Ananta – feminine, inflected BarBar – masculine inanim., inflected • We opted for translating an existing dataset Café Savoy – neuter, not inflected Místo – neuter, inflected • easier than in-house collection U Konšelů – prep. phrase, not inflected • translators are easy to hire and require no training • SFRest (Wen et al., EMNLP 2015) • manageable size + shown to work with neural NLG • We localized the set before translation • restaurants, landmarks, addresses in San Francisco → Prague • local names sound more natural • using various types of names (some inflected, some not) • We kept track of all possible inflection forms for slot values 6

Data Statistics • The result is more complex than SFRest: • more distinct lemmas (base forms) • >2x more distinct surface word forms • not counting restaurant names SFRest CS-Rest • 3.84 different lexical forms Number of instances 5,192 5,192 for a slot value on average Unique delexicalized instances 2,648 2,752 Unique delexicalized MRs 248 248 • train/dev/test split is not random Unique lemmas (in delexicalized set) 399 532 – we’re ensuring no MR overlap Unique word forms (in delexicalized set) 455 962 Average lexicalizations per slot value 1 3.84 Dušek & Jurčíček – Neural Generation for Czech 7

input MR Model encoder attention • Base model: TGen • Seq2seq with attention • Beam reranking decoder by MR classification • any differences w. r. t. input MR are penalized output beam • Base model: • Direct word form generation • Delexicalized input MRs MR classifier penalty = # of differences from input MR

TGen extensions • Lemma-tag generation mode • generate an interleaved sequence of lemmas & morphological tags • postprocess using morphological generator (dictionary-based) • addressing data sparsity, limiting possible inflection forms for slot values hledáte ? restauraci na hledat VB-P---2P-AA--- restaurace NNFS4-----A---- na RR--4---------- <good-for-meal> NNFS4-----A---- ? Z:------------- search restaurant for ? slot placeholder verb, 2nd pers adjective, preposition, acc noun, fem sg acc final punct present formal fem sg acc are you looking for a restaurant for <meal> ? • Lexicalized inputs • still generate delexicalized outputs, but input lexicalized MRs • some values require different treatment • e.g. “in <area>” with different prepositions – na Smíchově x v Karlíně 9

Lexicalization • New additional generation step • Baseline: always select most frequent form found in training data • Non-trivial: RNN LM ranking • process sentence up to slot placeholder using LSTM RNN LM • get LM probabilities for all possible surface forms for given slot value • select the most probable one inform(name= Baráčnická rychta, area=Malá Strana) Malá Strana nominative Malé Strany genitive Malé Straně dative, locative lstm lstm lstm lstm Malou Stranu accusative Malou Stranou instrumental Baráčnická rychta je na <area> Baráčnická rychta is in Malá Strana 10

Lexicalization • New additional generation step • Baseline: always select most frequent form found in training data • Non-trivial: RNN LM ranking • process sentence up to slot placeholder using LSTM RNN LM • get LM probabilities for all possible surface forms for given slot value • select the most probable one inform(name= Baráčnická rychta, area=Malá Strana) 0.10 Malá Strana nominative 0.07 Malé Strany genitive 0.60 Malé Straně dative, locative lstm lstm lstm lstm 0.10 Malou Stranu accusative 0.03 Malou Stranou instrumental Baráčnická rychta je na Malé Straně Baráčnická rychta is in Malá Strana 11

Evaluation • BLEU + other E2E metrics • single reference → all scores are lower • Slot error rate (counting placeholders before lexicalization) • Manually counting errors of different types • outputs for each configuration on 100 randomly selected MRs Results • Outputs are readable, but not perfect • 49% manually evaluated sentences contain some error(s) • most problems appear with unusual MRs Dušek & Jurčíček – Neural Generation for Czech 12

System configuration Automatic metrics Manual evaluation (100 per system) Results Generator # Semantic # Repeating # Fluency Input DAs Lexicalizer BLEU NIST SER Mode Errors Content Errors Most frequent 20.28 4.519 0.70 8 0 73 Word forms RNN LM 20.74 4.510 0.70 8 0 41 Delexicalized Most frequent 21.21 4.690 1.85 12 2 61 Lemma-tag RNN LM 21.96 4.772 1.85 12 2 22 Most frequent 19.73 4.562 2.30 14 5 54 Word forms RNN LM 20.48 4.606 2.30 14 5 30 Lexicalized Most frequent 19.44 4.445 3.08 15 4 44 Lemma-tag RNN LM 20.42 4.546 3.08 15 4 14 • RNN LM for lexicalization helps • BLEU improvement statistically significant • Lexicalized input & lemma-tag help fluency, but hurt accuracy • BLEU higher, # fluency errors lower • SER + # semantic errors higher Dušek & Jurčíček – Neural Generation for Czech 13

Conclusions • 1st(?) non-English neural data-to-text NLG dataset + baselines • Czech harder than English due to slot value inflection • using RNN LM for that helps • Czech may need more data than English Future work • pretrain a language model on similar domains • use MT for synthetic data Dušek & Jurčíček – Neural Generation for Czech 14

Thanks • Get the code: http://bit.ly/tgen-nlg • Get the data: http://bit.ly/cs-rest • Contact me: odusek@ufal.mff.cuni.cz http://bit.ly/odusek @tuetschek Get this paper: arXiv: 1910.05298 Dušek & Jurčíček – Neural Generation for Czech 15

Dušek & Jurčíček – Neural Generation for Czech 16

Output examples

Neural Generation for Czech: Data and Baselines Ondej Duek & - PowerPoint PPT Presentation

Neural Generation for Czech: Data and Baselines Ondej Duek & Filip Jurek Institute of Formal and Applied Linguistics Charles University, Prague INLG, Tokyo, 31 Oct 2019 Task & Motivation Task: Data-to-text generation from

BASELINES VPUU and CeaseFire BASELINES VPUU in Hanover Park At this point in Time 1: What are the

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

NORDIC chamber of commerce in the czech republic czech economy facts in brief 2015 Czech economy

BASELINES FOR POIN INT AND NONPOINT SOURCES GENERATING CREDITS IN IN THE CHESAPEAKE BAY

Uncertainties over the Starting Line? Challenges in the Definition of Territorial Sea Baselines

Establishment of baselines for tracking global trends in SDG indicators 30 March 1 April 2016

Baselines for Retail Demand Response Programs Bruce Kaneshiro California Public Utilities

REPORT ON THE CZECH CADASTRE 2005-2006 Jiri Rydval, Libor Tomandl Czech Office for Surveying,

Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara

Czech Republic your business partner 4 February 2013 Country Focus Briefing Czech Republic

A SHORT NOTE ABOUT THE CZECH LANGUAGE HOSTED BY A short note about the Czech language Czech

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

What can Statistical Machine Translation teach Neural Text Generation about Optimization? Graham

Deepire: First Experiments with Neural Guidance in Vampire Martin Suda Czech Technical

PRAGUE CZECH REPUBLIC H I LTO N P R A G U E O L D TO W N LOCATION CZECH REPUBLIC KEY:

@ Supersymmetry stabilizes the EW sector of the SM and is actually supported by data via virtual

Chemnitz University of Technology @ GridCLEF Pilot 2009 Outline Motivation Integrating

Routing in space Cisco Systems space team Lloyd Wood IET seminar on military satellite

UniNE at PAN-CLEF 2020: Profiling Fake News Spreaders on Twitter Catherine Ikae, Jacques Savoy

Sequence-to-Sequence Natural Language Generation Ondej Duek work done with Filip Jurek

elmet Hub Bike Helmet retail and renting machine Its a beautiful day for a bicycle ride.

CLEF 20 th Anniversary Nicola Ferro @frrncl University of Padua, Italy 10 th Conference and Labs

Keyword Weight Propagation for Indexing Structured Web Content Jong Wook Kim, and K. Selcuk