Chair of Network Architectures and Services Department of Informatics Technical University of Munich Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. Sc. October 13, 2017 Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Introduction What is Sentiment Analysis? G. Heiczman — Sentiment Analysis 2
Introduction What is Sentiment Analysis? G. Heiczman — Sentiment Analysis 3
Introduction What is Sentiment Analysis? G. Heiczman — Sentiment Analysis 4
Introduction Problems of today: • Too much information • Too little time G. Heiczman — Sentiment Analysis 5
Introduction Agenda • Text Mining summary • Example of practical application • Presentation of results • Conclusion and Lessons Learned G. Heiczman — Sentiment Analysis 6
Text Mining Feature Selection Main purpose: Extract valuable information, get rid of redundant features ’Bag of Words’ approach Most common selection steps: • Removal of stop words (the, is, at ...) • Removal of plurals (dogs -> dog) • Word / n-gram frequency • Part of Speech (POS) tagging (adjectives) • Opinion words (like, hate, love ...) • Detection of negation (not good -> bad) G. Heiczman — Sentiment Analysis 7
Text Mining Sentiment Classification Three main categories: • Machine Learning • Lexicon-based • Hybrid G. Heiczman — Sentiment Analysis 8
Text Mining Pitfalls • Named Entity Recognition i.e. "What is the topic" • Anaphora Resolution - Reference word resolution. "What is ’it’ refering to?" • Sarcasm • Abbreviations, poor grammar / punctuation / spelling G. Heiczman — Sentiment Analysis 9
Practical Application • Dataset • Language • Email retrieval • Content retrieval • Sentiment value retrieval G. Heiczman — Sentiment Analysis 10
Practical Application Dataset Collection of emails from the IETF. Task of IETF is to set standards. G. Heiczman — Sentiment Analysis 11
Practical Application Language C# or Python? Not enough comprehensive, completely free tools Notable C# tools: • VaderSharp (free but primitive) • Aylien (paid) • Watson D.C. (paid) • Vivekn (free but no documentation) Python tool: TextBlob G. Heiczman — Sentiment Analysis 12
Practical Application Multiple values obtained through SA: • Polarity ( -1.0 <-> 1.0) • Subjectivity (0.0 <-> 1.0) • Most used word • Sentence Count G. Heiczman — Sentiment Analysis 13
Practical Application Textblob example blob = TextBlob("I think this presentation is really, really good!") print(blob.sentiment) # Gives both polarity and subjectivity around 1.0 print(blob.words.count(’really’)) # Gives 2 print(blob.noun_phrases) # Gives nouns, in this case presentation G. Heiczman — Sentiment Analysis 14
Practical Application Figure 1: Example of email with polarity 1.0 • Filename: /home/.../geopriv/2007-12.mail • Key: 251 G. Heiczman — Sentiment Analysis 15
Practical Application Programflow G. Heiczman — Sentiment Analysis 16
Practical Application Programflow G. Heiczman — Sentiment Analysis 17
Statistics 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 b l l s t n a o a x s y i m s s b p e o 4 e t e n v m i k p 8 e d p o m 3 o i a n i q t i f I h b r m a t - e i r c e i i i l k p c p a a r - t e a p a t y d t r r - s n d Figure 2: Top 10 groups who use the most sentences Even distribution Indication of in-depth discussion or off-topic rambling? G. Heiczman — Sentiment Analysis 18
Statistics 0.60 0.55 0.5 0.50 0.45 0.40 0.375854 0.35 0.3078360.303693 0.29163 0.276323 0.30 0.2532740.251799 0.25 0.25 0.25 0.20 0.15 0.10 0.05 0.00 s l l l s c e s a l l t s p p o d e r o m b s 4 t - o i t o t e l l i 8 h c t a g r - t d n c r n e n e m b s e o e l e m - v c o i s t o s a t - s r - 6 a n e r 7 9 d i d v d 6 n a o c - a l o i Figure 3: Top 10 most positive groups Logarithmic distribution Notable group: "iaoc-scribes" G. Heiczman — Sentiment Analysis 19
Statistics 0.00 -0.05 -0.10 -0.15 -0.20 -0.25 -0.30 -0.35 -0.40 -0.45 -0.50 c l g s s a a s b m e r c b r r e r s a e o e e w s u w m i l e p i i t d a r h i t s a n c n h - e s r i f c t p t t e - a i y t 0 i l 7 b i o m - f t r i Figure 4: Top 10 most negative groups Stronger logarithmic distribution Notable group: "ietf-sailors" G. Heiczman — Sentiment Analysis 20
Statistics 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 s l s s a p t c e o s r r l e e e r s a a c b g e e e m d o m p m r i d d t c m r i i n o n a k s c h - s e e c o t t t c t u t o c a a - o y a 3 - 0 t i f i 7 t 7 l e b i i o m - f t r i Figure 5: Top 10 most subjective groups Surprising top scores Discussion groups G. Heiczman — Sentiment Analysis 21
Statistics From the 7 most negative (-1.0) polarity entries 6 belong to the group ’eos’ All of them are in Spanish (?) G. Heiczman — Sentiment Analysis 22
Conclusion Useful but not universally Lessons learned: • Filter the data-set intelligently • Don’t try to solve everything with one library G. Heiczman — Sentiment Analysis 23
Recommend
More recommend