text mining on mailing lists sentiment analysis
play

Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. - PowerPoint PPT Presentation

Chair of Network Architectures and Services Department of Informatics Technical University of Munich Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. Sc. October 13, 2017 Chair of Network Architectures and Services


  1. Chair of Network Architectures and Services Department of Informatics Technical University of Munich Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. Sc. October 13, 2017 Chair of Network Architectures and Services Department of Informatics Technical University of Munich

  2. Introduction What is Sentiment Analysis? G. Heiczman — Sentiment Analysis 2

  3. Introduction What is Sentiment Analysis? G. Heiczman — Sentiment Analysis 3

  4. Introduction What is Sentiment Analysis? G. Heiczman — Sentiment Analysis 4

  5. Introduction Problems of today: • Too much information • Too little time G. Heiczman — Sentiment Analysis 5

  6. Introduction Agenda • Text Mining summary • Example of practical application • Presentation of results • Conclusion and Lessons Learned G. Heiczman — Sentiment Analysis 6

  7. Text Mining Feature Selection Main purpose: Extract valuable information, get rid of redundant features ’Bag of Words’ approach Most common selection steps: • Removal of stop words (the, is, at ...) • Removal of plurals (dogs -> dog) • Word / n-gram frequency • Part of Speech (POS) tagging (adjectives) • Opinion words (like, hate, love ...) • Detection of negation (not good -> bad) G. Heiczman — Sentiment Analysis 7

  8. Text Mining Sentiment Classification Three main categories: • Machine Learning • Lexicon-based • Hybrid G. Heiczman — Sentiment Analysis 8

  9. Text Mining Pitfalls • Named Entity Recognition i.e. "What is the topic" • Anaphora Resolution - Reference word resolution. "What is ’it’ refering to?" • Sarcasm • Abbreviations, poor grammar / punctuation / spelling G. Heiczman — Sentiment Analysis 9

  10. Practical Application • Dataset • Language • Email retrieval • Content retrieval • Sentiment value retrieval G. Heiczman — Sentiment Analysis 10

  11. Practical Application Dataset Collection of emails from the IETF. Task of IETF is to set standards. G. Heiczman — Sentiment Analysis 11

  12. Practical Application Language C# or Python? Not enough comprehensive, completely free tools Notable C# tools: • VaderSharp (free but primitive) • Aylien (paid) • Watson D.C. (paid) • Vivekn (free but no documentation) Python tool: TextBlob G. Heiczman — Sentiment Analysis 12

  13. Practical Application Multiple values obtained through SA: • Polarity ( -1.0 <-> 1.0) • Subjectivity (0.0 <-> 1.0) • Most used word • Sentence Count G. Heiczman — Sentiment Analysis 13

  14. Practical Application Textblob example blob = TextBlob("I think this presentation is really, really good!") print(blob.sentiment) # Gives both polarity and subjectivity around 1.0 print(blob.words.count(’really’)) # Gives 2 print(blob.noun_phrases) # Gives nouns, in this case presentation G. Heiczman — Sentiment Analysis 14

  15. Practical Application Figure 1: Example of email with polarity 1.0 • Filename: /home/.../geopriv/2007-12.mail • Key: 251 G. Heiczman — Sentiment Analysis 15

  16. Practical Application Programflow G. Heiczman — Sentiment Analysis 16

  17. Practical Application Programflow G. Heiczman — Sentiment Analysis 17

  18. Statistics 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 b l l s t n a o a x s y i m s s b p e o 4 e t e n v m i k p 8 e d p o m 3 o i a n i q t i f I h b r m a t - e i r c e i i i l k p c p a a r - t e a p a t y d t r r - s n d Figure 2: Top 10 groups who use the most sentences Even distribution Indication of in-depth discussion or off-topic rambling? G. Heiczman — Sentiment Analysis 18

  19. Statistics 0.60 0.55 0.5 0.50 0.45 0.40 0.375854 0.35 0.3078360.303693 0.29163 0.276323 0.30 0.2532740.251799 0.25 0.25 0.25 0.20 0.15 0.10 0.05 0.00 s l l l s c e s a l l t s p p o d e r o m b s 4 t - o i t o t e l l i 8 h c t a g r - t d n c r n e n e m b s e o e l e m - v c o i s t o s a t - s r - 6 a n e r 7 9 d i d v d 6 n a o c - a l o i Figure 3: Top 10 most positive groups Logarithmic distribution Notable group: "iaoc-scribes" G. Heiczman — Sentiment Analysis 19

  20. Statistics 0.00 -0.05 -0.10 -0.15 -0.20 -0.25 -0.30 -0.35 -0.40 -0.45 -0.50 c l g s s a a s b m e r c b r r e r s a e o e e w s u w m i l e p i i t d a r h i t s a n c n h - e s r i f c t p t t e - a i y t 0 i l 7 b i o m - f t r i Figure 4: Top 10 most negative groups Stronger logarithmic distribution Notable group: "ietf-sailors" G. Heiczman — Sentiment Analysis 20

  21. Statistics 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 s l s s a p t c e o s r r l e e e r s a a c b g e e e m d o m p m r i d d t c m r i i n o n a k s c h - s e e c o t t t c t u t o c a a - o y a 3 - 0 t i f i 7 t 7 l e b i i o m - f t r i Figure 5: Top 10 most subjective groups Surprising top scores Discussion groups G. Heiczman — Sentiment Analysis 21

  22. Statistics From the 7 most negative (-1.0) polarity entries 6 belong to the group ’eos’ All of them are in Spanish (?) G. Heiczman — Sentiment Analysis 22

  23. Conclusion Useful but not universally Lessons learned: • Filter the data-set intelligently • Don’t try to solve everything with one library G. Heiczman — Sentiment Analysis 23

Recommend


More recommend