General Online Research Conference GOR 18 28 February to 2 March - PowerPoint PPT Presentation

General Online Research Conference GOR 18 28 February to 2 March 2018, TH Köln – University of Applied Sciences, Cologne, Germany Christopher Harms, SKOPOS GmbH & Co. KG Sebastian Schmidt, SKOPOS GmbH & Co. KG Learning From All Answers: Embedding-based Topic Modelling for Open-Ended Questions Contact: christopher.harms@skopos.de Suggested citation: Harms, Christopher, & Schmidt, Sebastian. 2018. “Learning From All Answers: Embedding-based Topic Modelling for Open-Ended Questions..” General Online Research (GOR) Conference, Cologne. This work is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/)

# LearningFromAllAnswers GOR 2018 | Cologne | March 1, 2018 Learning From All Answers: Embedding-based Topic Modelling for Open-Ended Questions Christopher Harms , Consultant Research & Development Sebastian Schmidt , Director Research & Development

Initial Scenario 3 What can we do to improve our service for you? GOR 2018 | Learning From All Answers | March 1, 2018

Initial Scenario 4 How to extract information from open-ended questions? Word Cloud Qualitative summary Code plan § Manual coding § Automatic coding through supervised learning Can we improve this through unsupervised Machine Learning? GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview 5 Naïve Keyword Extraction Latent Dirichlet Allocation (LDA; Blei et al., 2003) Embedding-based Topic-Modelling (ETM, Qiang et al., 2016) GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview 6 Naïve Keyword Extraction Working from home for me means • Nouns indicate topics freedom and independence. I can • Extraction through a pre-trained POS just go for a walk when there is sunny weather and I need a break. tagger (e.g. spaCy) • Catch different forms of same word: Lemmatization or Stemming v Home • Word Cloud of resulting terms, v Freedom v Independence highlighting relative frequency v Walk v Weather v Break GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview 8 Latent Dirichlet Allocation α • Bayesian generative probabilistic ! " model Topic # $ # % # & # ' # ( • Each topic is a probability distribution assignments over words ) $ ) % ) & ) ' ) ( Words • Inference: Find the relationship L between words and topics for a given Distribution of topics * + over words corpus K β GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview 9 Latent Dirichlet Allocation Benefits Disadvantages • Co-occurring words are grouped into • Number of topics has to be chosen a a topic priori • Readily available programming • Large corpus needed for reasonable packages (e.g. gensim) results • No knowledge about relationship between different words (e.g. “buffet” and “restaurant”) GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview 10 Word Embeddings king - man + woman = queen breakfast + lunch = brunch • Embeddings contain information about word relationships • Trained on a very large corpus of texts • Each word becomes a multidimensional vector GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview 11 Embedding-based Topic Modelling • Extension of the LDA model 1. Aggregate short texts into pseudo- documents 2. Assign similar words more likely to the same topic • Word embeddings are used for similarity of documents and words GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview 12 Embedding-based Topic Modelling α • Undirected edge between topics for ! " similar words (binary potential): Topic Similar words should be more # $ # % # & # ' # ( assignments likely belong to the same topic • Graphical model is a Markov ) $ ) % ) & ) ' ) ( Words L Random Field (MRF-LDA, Xie et al., Distribution of topics * + 2015) over words K • Weight for binary potential, if 0 model β reduces to LDA GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview 13 Embedding-based Topic Modelling Benefits Disadvantages • Knowledge of word relationships is • Number of pseudo-texts and topics has incorporated (pre-trained embeddings) to be chosen a priori • k-Means improves Topic Modelling of • Computationally expensive short texts • Requires a large corpus for reasonable results • No prepared software packages available GOR 2018 | Learning From All Answers | March 1, 2018

Proof-of-Concept 14 Datasets Twitter (Sentiment140) Survey Responses • 10.000 tweets in English language • 10.000 survey responses in German language • Purely observational • Responses to three different questions concerning travel GOR 2018 | Learning From All Answers | March 1, 2018

Proof-of-Concept 16 Results: Resulting Topics with Top5 Words (excerpt) LDA ETM Topic #1 Topic #2 Topic #3 Topic #1 Topic #2 Topic #3 hope twitter morning new sad sleep better phone good cold house time sick use cold better watching night feeling site snow damn night hours feel tweets car need thank bed Topic #1 Topic #2 Topic #3 Topic #1 Topic #2 Topic #3 gut super immer super geklappt service geklappt einfach zufrieden einfach reibungslos organisation organisiert nein buchen stimmt vielen hotel gefallen unkompliziert gerne tolle dank hotels reise schnell reisen funktioniert perfekt information GOR 2018 | Learning From All Answers | March 1, 2018

Expert Review 17 Classical Machine Learning metrics not informative for real research projects Question of interest for us: Can our (human) colleagues work with the results provided by the algorithms? Are resulting topics coherent? That is, can words associated with a topic indeed be grouped into a sensible topic? GOR 2018 | Learning From All Answers | March 1, 2018

Expert Review 18 Results: Expert Review (English Dataset) LDA ETM 3.54 (1.04) 2.70 (1.15) 3.23 (1.10) 2.25 (1.19) GOR 2018 | Learning From All Answers | March 1, 2018

Expert Review 19 Results: Expert Review (German Dataset) LDA ETM 4.09 (0.85) 4.06 (0.76) 4.09 (0.90) 3.72 (0.98) GOR 2018 | Learning From All Answers | March 1, 2018

Expert Review 20 Summary English: LDA results more coherent than ETM results German: ETM and LDA rated equally coherent But: Highly dependent on topic selection GOR 2018 | Learning From All Answers | March 1, 2018

Summary 21 Our Learnings Proof of Concept – needs further development Fine-tuning of hyper-parameters and techniques required Pre-trained word vectors provide valuable information Lots of data required for best results (> 1,000 responses) Metric for usefulness in real-world environment? GOR 2018 | Learning From All Answers | March 1, 2018

Thank You For Your Attention! 22 Further questions? Let‘s talk! Christopher Harms Sebastian Schmidt Consultant Research & Development Director Research & Development christopher.harms@skopos.de sebastian.schmidt@skopos.de @chrisharms GOR 2018 | Learning From All Answers | March 1, 2018

References 23 - Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3 , 993–1022. - Qiang, J., Chen, P., Wang, T., & Wu, X. (2016). Topic Modeling over Short Texts by Incorporating Word Embeddings. CEUR Workshop Proceedings, 1828 , 53–59. Retrieved from http://arxiv.org/abs/1609.08496 - Xie, P., Yang, D., & Xing, E. P. (2015). Incorporating Word Correlation Knowledge into Topic Modeling. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (pp. 725–734). Retrieved from http://www.cs.cmu.edu/~pengtaox/papers/naacl15_mrflda.pdf GOR 2018 | Learning From All Answers | March 1, 2018

General Online Research Conference GOR 18 28 February to 2 March - PowerPoint PPT Presentation

General Online Research Conference GOR 18 28 February to 2 March 2018, TH Kln University of Applied Sciences, Cologne, Germany Christopher Harms, SKOPOS GmbH & Co. KG Sebastian Schmidt, SKOPOS GmbH & Co. KG Learning From All

Wednesday, November 30, 2016 3:41 PM General Page 1 General Page 2 General Page 3 General Page

RI SK RESEARCH FOR SAFETY CRI TI CAL SYSTEMS AT THE TECHNI CAL UNI VERSI TY OF DENMARK I gor

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

SEL ELANGOR: GOR: THE P PREFER ERRED ED A AEROSP SPACE H E HUB I IN MALAYSIA SIA INDE

Parish o Paris of f Ban Bangor gor, Memb , Member er of of Ard Ards Past Pastor oral al

The Proof Tree Visualiser By David Alexander Supervisor: Rajeev Gor Summer Scholar, RSISE What

Va Van Gor order er, W Walker & & Co. o., I Inc. Thursday, January 9, 2020 1

Gor don L ake Mine Site s: MVL WB T e c hnic al Pr e se ntation Pr e par e d for the

Va Van n Gor orde der, r, Wal er & C & Co. o., , In Inc. alker Monday, December

LLVM Coroutines Bringing resumable functions to LLVM LLVM Dev Meeting 2016 Gor Nishanov

Quantum feedback for preparation and protection of quantum states of light I gor Dotsenko

Compressibility of Nanoconfined Fluids: Relating Atomistic Modeling to Ultrasonic Experiments

Craig interpolation in displayable logics James Brotherston 1 and Rajeev Gor e 2 1 Imperial

Introduction to Modal and Temporal Logic c Rajeev Gor e Automated Reasoning Group

Termination of Abstract Reduction Systems Jeremy E. Dawson and Rajeev Gor e Logic &

EXPTIME Tableaux with Global Caching for ALC Rajeev Gor c e Linh Anh Nguyen Computer

USING COMPUTERIZED ASSESSMENT TO IDENTIFY PROFILES OF READING & LANGUAGE SKILLS IN

Towards Grounding Conceptual Spaces in Neural Representations Lucas Bechberger and Kai-Uwe

+ Unifying Perspectives on Knowledge Sharing: From Atomic to Parameterised Domains and Tasks

GEO 101 Earth and Space Science - online Upul Senaratne, PhD Wor-Wic Community College

FESAC: Measurement of The Digital Economy Patrick Bajari VP and Chief Economist Amazon 12/15/2017

for Efficient Softmax Inference Shun Liao* 1 , Ting Chen* 2 , Tian Lin 2 , Denny Zhou 2 , Chong

Twinkle Twinkle Little STAR: Smooth Transition AR Models in R. Alexios Ghalanos, PhD R in

Shaping the future o f o il e xplo ratio n and pro duc tio n in Afric a Par e to Oil &

General Online Research Conference GOR 18 28 February to 2 March - PowerPoint PPT Presentation

General Online Research Conference GOR 18 28 February to 2 March 2018, TH Kln University of Applied Sciences, Cologne, Germany Christopher Harms, SKOPOS GmbH & Co. KG Sebastian Schmidt, SKOPOS GmbH & Co. KG Learning From All

Wednesday, November 30, 2016 3:41 PM General Page 1 General Page 2 General Page 3 General Page

RI SK RESEARCH FOR SAFETY CRI TI CAL SYSTEMS AT THE TECHNI CAL UNI VERSI TY OF DENMARK I gor

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

SEL ELANGOR: GOR: THE P PREFER ERRED ED A AEROSP SPACE H E HUB I IN MALAYSIA SIA INDE

Parish o Paris of f Ban Bangor gor, Memb , Member er of of Ard Ards Past Pastor oral al

The Proof Tree Visualiser By David Alexander Supervisor: Rajeev Gor Summer Scholar, RSISE What

Va Van Gor order er, W Walker &amp; &amp; Co. o., I Inc. Thursday, January 9, 2020 1

Gor don L ake Mine Site s: MVL WB T e c hnic al Pr e se ntation Pr e par e d for the

Va Van n Gor orde der, r, Wal er &amp; C &amp; Co. o., , In Inc. alker Monday, December

LLVM Coroutines Bringing resumable functions to LLVM LLVM Dev Meeting 2016 Gor Nishanov

Quantum feedback for preparation and protection of quantum states of light I gor Dotsenko

Compressibility of Nanoconfined Fluids: Relating Atomistic Modeling to Ultrasonic Experiments

Craig interpolation in displayable logics James Brotherston 1 and Rajeev Gor e 2 1 Imperial

Introduction to Modal and Temporal Logic c Rajeev Gor e Automated Reasoning Group

Termination of Abstract Reduction Systems Jeremy E. Dawson and Rajeev Gor e Logic &amp;

EXPTIME Tableaux with Global Caching for ALC Rajeev Gor c e Linh Anh Nguyen Computer

USING COMPUTERIZED ASSESSMENT TO IDENTIFY PROFILES OF READING &amp; LANGUAGE SKILLS IN

Towards Grounding Conceptual Spaces in Neural Representations Lucas Bechberger and Kai-Uwe

+ Unifying Perspectives on Knowledge Sharing: From Atomic to Parameterised Domains and Tasks

GEO 101 Earth and Space Science - online Upul Senaratne, PhD Wor-Wic Community College

FESAC: Measurement of The Digital Economy Patrick Bajari VP and Chief Economist Amazon 12/15/2017

for Efficient Softmax Inference Shun Liao* 1 , Ting Chen* 2 , Tian Lin 2 , Denny Zhou 2 , Chong

Twinkle Twinkle Little STAR: Smooth Transition AR Models in R. Alexios Ghalanos, PhD R in

Shaping the future o f o il e xplo ratio n and pro duc tio n in Afric a Par e to Oil &amp;

Va Van Gor order er, W Walker & & Co. o., I Inc. Thursday, January 9, 2020 1

Va Van n Gor orde der, r, Wal er & C & Co. o., , In Inc. alker Monday, December

Termination of Abstract Reduction Systems Jeremy E. Dawson and Rajeev Gor e Logic &

USING COMPUTERIZED ASSESSMENT TO IDENTIFY PROFILES OF READING & LANGUAGE SKILLS IN

Shaping the future o f o il e xplo ratio n and pro duc tio n in Afric a Par e to Oil &