social media text analysis
play

Social Media & Text Analysis lecture 3 - Language Identification - PowerPoint PPT Presentation

Social Media & Text Analysis lecture 3 - Language Identification (supervised learning and Naive Bayes algorithm) CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org In-class Presentation a


  1. Social Media & Text Analysis lecture 3 - Language Identification 
 (supervised learning and Naive Bayes algorithm) CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org

  2. In-class Presentation • a 10-minute presentation plus 2-minute Q&A (20 points) - A Social Media Platform or a NLP Researcher - Pairing up (2 students collaboration) • Sign up now! Alan Ritter ◦ socialmedia-class.org

  3. Reading #1 Alan Ritter ◦ socialmedia-class.org

  4. Reading #1 Alan Ritter ◦ socialmedia-class.org

  5. Reading #2 Alan Ritter ◦ socialmedia-class.org

  6. Natural Language Processing Dan$Jurafsky$ Language(Technology( making$good$progress$ SenIment$analysis$ sIll$really$hard$ mostly$solved$ Best$roast$chicken$in$San$Francisco!$ QuesIon$answering$(QA)$ The$waiter$ignored$us$for$20$minutes.$ Q.$How$effecIve$is$ibuprofen$in$reducing$ Coreference$resoluIon$ Spam$detecIon$ fever$in$paIents$with$acute$febrile$illness?$ ✓ Let’s$go$to$Agra!$ Paraphrase$ Carter$told$Mubarak$he$shouldn’t$run$again.$ ✗ Buy$V1AGRA$…$ Word$sense$disambiguaIon$(WSD)$ XYZ$acquired$ABC$yesterday$ ABC$has$been$taken$over$by$XYZ$ I$need$new$baWeries$for$my$ mouse .$ PartOofOspeech$(POS)$tagging$ $$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$ SummarizaIon$ Parsing$ Colorless$$$green$$$ideas$$$sleep$$$furiously.$ The$Dow$Jones$is$up$ Economy$is$ The$S&P500$jumped$ I$can$see$Alcatraz$from$the$window!$ good$ Housing$prices$rose$ Named$enIty$recogniIon$(NER)$ Machine$translaIon$(MT)$ $$$ PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$ Dialog$ � 13 � �� ��� � �� � … � Where$is$CiIzen$Kane$playing$in$SF?$$ Einstein$met$with$UN$officials$in$Princeton$ The$13 th $Shanghai$InternaIonal$Film$FesIval…$ Castro$Theatre$at$7:30.$Do$ InformaIon$extracIon$(IE)$ you$want$a$Icket?$ Party$ You’re$invited$to$our$dinner$ May$27$ party,$Friday$May$27$at$8:30$ add$

  7. Domain/Genre • NLP is often designed for one domain (in-domain), and may not work well for other domains (out-of- domain). • Why? News Blogs Wikipedia Forums Comments Twitter … Alan Ritter ◦ socialmedia-class.org

  8. Domain/Genre • How different? Source: Baldwin et al. 
 "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013 Alan Ritter ◦ socialmedia-class.org

  9. Domain/Genre out-of-vocabulary • How different? Source: Baldwin et al. 
 "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013 Alan Ritter ◦ socialmedia-class.org

  10. Domain/Genre • How similar? 
 Twitter ≡ Comments < Forums < Blogs < BNC < Wikipedia 
 Source: Baldwin et al. 
 "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013 Alan Ritter ◦ socialmedia-class.org

  11. 
 
 
 Domain/Genre • What to do? - robust tools/models that works across domains - specific tools/models for Twitter data only — many techniques/algorithms are useful elsewhere 
 (we will see examples of both in the class) Alan Ritter ◦ socialmedia-class.org

  12. Domain/Genre • Why so much Twitter? - publicly available (vs. SMS, emails) - large amount of data - large demand for research/commercial purpose - too different from well-edited text (which most NLP tools have been made for) Alan Ritter ◦ socialmedia-class.org

  13. NLP Pipeline Alan Ritter ◦ socialmedia-class.org

  14. NLP Pipeline Part-of- Named Shallow Language Speech Entity Tokenization Parsing Identification (POS) Recognition (Chunking) Tagging (NER) Stemming Normalization Alan Ritter ◦ socialmedia-class.org

  15. NLP Pipeline Part-of- Named Shallow Language Speech Entity Tokenization Parsing Identification (POS) Recognition (Chunking) Tagging (NER) Stemming Normalization Alan Ritter ◦ socialmedia-class.org

  16. Language Identification (a.k.a Language Detection) Alan Ritter ◦ socialmedia-class.org

  17. LangID: why needed? • Twitter is highly multilingual • But NLP is often monolingual Alan Ritter ◦ socialmedia-class.org

  18. Alan Ritter ◦ socialmedia-class.org

  19. known as the “Chinese Twitter” 120 Million Posts / Day Alan Ritter ◦ socialmedia-class.org

  20. LangID: Google Translate Alan Ritter ◦ socialmedia-class.org

  21. LangID: Twitter API • introduced in March 2013 • uses two-letter ISO 639-1 code Alan Ritter ◦ socialmedia-class.org

  22. LangID Tool: langid.py Alan Ritter ◦ socialmedia-class.org

  23. LangID Tool: langid.py Alan Ritter ◦ socialmedia-class.org

  24. LangID: A Classification Problem • Input: - a document d - a fixed set of classes C = {c 1 , c 2 , …, c j } • Output: - a predicted class c ∈ C Alan Ritter ◦ socialmedia-class.org

  25. Classification Method: Hand-crafted Rules • Keyword-based approaches do not work well for language identification: - poor recall - expensive to build large dictionaries for all different languages - cognate words Alan Ritter ◦ socialmedia-class.org

  26. Classification Method: Supervised Machine Learning • Input: - a document d - a fixed set of classes C = {c 1 , c 2 , …, c j } - a training set of m hand-labeled documents 
 (d 1 , c 1 ), … , (d m , c m ) • Output: - a learned classifier 𝜹 : d → c Alan Ritter ◦ socialmedia-class.org

  27. Classification Method: Supervised Machine Learning Source: NLTK Book Alan Ritter ◦ socialmedia-class.org

  28. Classification Method: Supervised Machine Learning Source: NLTK Book Alan Ritter ◦ socialmedia-class.org

  29. Classification Method: Supervised Machine Learning • Naïve Bayes • Logistic Regression • Support Vector Machines (SVM) • … Alan Ritter ◦ socialmedia-class.org

  30. Classification Method: Supervised Machine Learning • Naïve Bayes • Logistic Regression • Support Vector Machines (SVM) • … Alan Ritter ◦ socialmedia-class.org

  31. Naïve Bayes • a family of simple probabilistic classifiers based on Bayes’ theorem with strong (naive) independence assumptions between the features. • Bayes’ Theorem: P ( c | d ) = P ( d | c ) P ( c ) P ( d ) Alan Ritter ◦ socialmedia-class.org

  32. Naïve Bayes • For a document d , find the most probable class c : c MAP = argmax P ( c | d ) c ∈ C maximum a posteriori Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

  33. Naïve Bayes • For a document d , find the most probable class c : c MAP = argmax P ( c | d ) c ∈ C P ( d | c ) P ( c ) = argmax Bayes Rule P ( d ) c ∈ C Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

  34. Naïve Bayes • For a document d , find the most probable class c : c MAP = argmax P ( c | d ) c ∈ C P ( d | c ) P ( c ) = argmax Bayes Rule P ( d ) c ∈ C = argmax drop the 
 P ( d | c ) P ( c ) denominator c ∈ C Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

  35. Naïve Bayes • document d represented as features t 1 , t 2 , …, t n : c MAP = argmax P ( d | c ) P ( c ) c ∈ C = argmax P ( t 1 , t 2 ,..., t n | c ) P ( c ) c ∈ C Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

  36. Naïve Bayes • document d represented as features t 1 , t 2 , …, t n : c MAP = argmax P ( t 1 , t 2 ,..., t n | c ) P ( c ) c ∈ C prior how often does this class occur? — simple count Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

  37. Naïve Bayes • document d represented as features t 1 , t 2 , …, t n : c MAP = argmax P ( t 1 , t 2 ,..., t n | c ) P ( c ) c ∈ C likelihood prior O(|T| n · |C|) parameters n = number of unique n-gram tokens 
 — need to make simplifying assumption Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

  38. 
 Naïve Bayes • Conditional Independence Assumption : 
 features P(t i | c) are independent given the class c P ( t 1 , t 2 ,..., t n | c ) = P ( t 1 | c ) ⋅ P ( t 2 | c ) ⋅ ... ⋅ P ( t n | c ) Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

  39. 
 Naïve Bayes • For a document d , find the most probable class c : 
 c MAP = argmax P ( t 1 , t 2 ,..., t n | c ) P ( c ) c ∈ C ∏ c NB = argmax P ( c ) P ( t i | c ) c ∈ C t i ∈ d Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

  40. Naïve Bayes ∏ c NB = argmax P ( c ) P ( t i | c ) c ∈ C t i ∈ d c Probabilistic Graphical Model … t 1 t 2 t n Alan Ritter ◦ socialmedia-class.org

  41. Variations of Naïve Bayes c MAP = argmax P ( d | c ) P ( c ) c ∈ C • different assumptions on distributions of features: - Multinomial: discrete features - Bernoulli: binary features - Gaussian: continuous features Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

  42. Variations of Naïve Bayes c MAP = argmax P ( d | c ) P ( c ) c ∈ C • different assumptions on distributions of feature: - Multinomial : discrete features - Bernoulli: binary features - Gaussian: continuous features Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

Recommend


More recommend