general online research conference gor 18 28 february to
play

General Online Research Conference GOR 18 28 February to 2 March - PowerPoint PPT Presentation

General Online Research Conference GOR 18 28 February to 2 March 2018, TH Kln University of Applied Sciences, Cologne, Germany Christopher Harms, SKOPOS GmbH & Co. KG Sebastian Schmidt, SKOPOS GmbH & Co. KG Learning From All


  1. General Online Research Conference GOR 18 28 February to 2 March 2018, TH Köln – University of Applied Sciences, Cologne, Germany Christopher Harms, SKOPOS GmbH & Co. KG Sebastian Schmidt, SKOPOS GmbH & Co. KG Learning From All Answers: Embedding-based Topic Modelling for Open-Ended Questions Contact: christopher.harms@skopos.de Suggested citation: Harms, Christopher, & Schmidt, Sebastian. 2018. “Learning From All Answers: Embedding-based Topic Modelling for Open-Ended Questions..” General Online Research (GOR) Conference, Cologne. This work is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/)

  2. # LearningFromAllAnswers GOR 2018 | Cologne | March 1, 2018 Learning From All Answers: Embedding-based Topic Modelling for Open-Ended Questions Christopher Harms , Consultant Research & Development Sebastian Schmidt , Director Research & Development

  3. Initial Scenario 3 What can we do to improve our service for you? GOR 2018 | Learning From All Answers | March 1, 2018

  4. Initial Scenario 4 How to extract information from open-ended questions? Word Cloud Qualitative summary Code plan § Manual coding § Automatic coding through supervised learning Can we improve this through unsupervised Machine Learning? GOR 2018 | Learning From All Answers | March 1, 2018

  5. Methods Overview 5 Naïve Keyword Extraction Latent Dirichlet Allocation (LDA; Blei et al., 2003) Embedding-based Topic-Modelling (ETM, Qiang et al., 2016) GOR 2018 | Learning From All Answers | March 1, 2018

  6. Methods Overview 6 Naïve Keyword Extraction Working from home for me means • Nouns indicate topics freedom and independence. I can • Extraction through a pre-trained POS just go for a walk when there is sunny weather and I need a break. tagger (e.g. spaCy) • Catch different forms of same word: Lemmatization or Stemming v Home • Word Cloud of resulting terms, v Freedom v Independence highlighting relative frequency v Walk v Weather v Break GOR 2018 | Learning From All Answers | March 1, 2018

  7. Methods Overview 8 Latent Dirichlet Allocation α • Bayesian generative probabilistic ! " model Topic # $ # % # & # ' # ( • Each topic is a probability distribution assignments over words ) $ ) % ) & ) ' ) ( Words • Inference: Find the relationship L between words and topics for a given Distribution of topics * + over words corpus K β GOR 2018 | Learning From All Answers | March 1, 2018

  8. Methods Overview 9 Latent Dirichlet Allocation Benefits Disadvantages • Co-occurring words are grouped into • Number of topics has to be chosen a a topic priori • Readily available programming • Large corpus needed for reasonable packages (e.g. gensim) results • No knowledge about relationship between different words (e.g. “buffet” and “restaurant”) GOR 2018 | Learning From All Answers | March 1, 2018

  9. Methods Overview 10 Word Embeddings king - man + woman = queen breakfast + lunch = brunch • Embeddings contain information about word relationships • Trained on a very large corpus of texts • Each word becomes a multidimensional vector GOR 2018 | Learning From All Answers | March 1, 2018

  10. Methods Overview 11 Embedding-based Topic Modelling • Extension of the LDA model 1. Aggregate short texts into pseudo- documents 2. Assign similar words more likely to the same topic • Word embeddings are used for similarity of documents and words GOR 2018 | Learning From All Answers | March 1, 2018

  11. Methods Overview 12 Embedding-based Topic Modelling α • Undirected edge between topics for ! " similar words (binary potential): Topic Similar words should be more # $ # % # & # ' # ( assignments likely belong to the same topic • Graphical model is a Markov ) $ ) % ) & ) ' ) ( Words L Random Field (MRF-LDA, Xie et al., Distribution of topics * + 2015) over words K • Weight for binary potential, if 0 model β reduces to LDA GOR 2018 | Learning From All Answers | March 1, 2018

  12. Methods Overview 13 Embedding-based Topic Modelling Benefits Disadvantages • Knowledge of word relationships is • Number of pseudo-texts and topics has incorporated (pre-trained embeddings) to be chosen a priori • k-Means improves Topic Modelling of • Computationally expensive short texts • Requires a large corpus for reasonable results • No prepared software packages available GOR 2018 | Learning From All Answers | March 1, 2018

  13. Proof-of-Concept 14 Datasets Twitter (Sentiment140) Survey Responses • 10.000 tweets in English language • 10.000 survey responses in German language • Purely observational • Responses to three different questions concerning travel GOR 2018 | Learning From All Answers | March 1, 2018

  14. Proof-of-Concept 16 Results: Resulting Topics with Top5 Words (excerpt) LDA ETM Topic #1 Topic #2 Topic #3 Topic #1 Topic #2 Topic #3 hope twitter morning new sad sleep better phone good cold house time sick use cold better watching night feeling site snow damn night hours feel tweets car need thank bed Topic #1 Topic #2 Topic #3 Topic #1 Topic #2 Topic #3 gut super immer super geklappt service geklappt einfach zufrieden einfach reibungslos organisation organisiert nein buchen stimmt vielen hotel gefallen unkompliziert gerne tolle dank hotels reise schnell reisen funktioniert perfekt information GOR 2018 | Learning From All Answers | March 1, 2018

  15. Expert Review 17 Classical Machine Learning metrics not informative for real research projects Question of interest for us: Can our (human) colleagues work with the results provided by the algorithms? Are resulting topics coherent? That is, can words associated with a topic indeed be grouped into a sensible topic? GOR 2018 | Learning From All Answers | March 1, 2018

  16. Expert Review 18 Results: Expert Review (English Dataset) LDA ETM 3.54 (1.04) 2.70 (1.15) 3.23 (1.10) 2.25 (1.19) GOR 2018 | Learning From All Answers | March 1, 2018

  17. Expert Review 19 Results: Expert Review (German Dataset) LDA ETM 4.09 (0.85) 4.06 (0.76) 4.09 (0.90) 3.72 (0.98) GOR 2018 | Learning From All Answers | March 1, 2018

  18. Expert Review 20 Summary English: LDA results more coherent than ETM results German: ETM and LDA rated equally coherent But: Highly dependent on topic selection GOR 2018 | Learning From All Answers | March 1, 2018

  19. Summary 21 Our Learnings Proof of Concept – needs further development Fine-tuning of hyper-parameters and techniques required Pre-trained word vectors provide valuable information Lots of data required for best results (> 1,000 responses) Metric for usefulness in real-world environment? GOR 2018 | Learning From All Answers | March 1, 2018

  20. Thank You For Your Attention! 22 Further questions? Let‘s talk! Christopher Harms Sebastian Schmidt Consultant Research & Development Director Research & Development christopher.harms@skopos.de sebastian.schmidt@skopos.de @chrisharms GOR 2018 | Learning From All Answers | March 1, 2018

  21. References 23 - Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3 , 993–1022. - Qiang, J., Chen, P., Wang, T., & Wu, X. (2016). Topic Modeling over Short Texts by Incorporating Word Embeddings. CEUR Workshop Proceedings, 1828 , 53–59. Retrieved from http://arxiv.org/abs/1609.08496 - Xie, P., Yang, D., & Xing, E. P. (2015). Incorporating Word Correlation Knowledge into Topic Modeling. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (pp. 725–734). Retrieved from http://www.cs.cmu.edu/~pengtaox/papers/naacl15_mrflda.pdf GOR 2018 | Learning From All Answers | March 1, 2018

Recommend


More recommend