Social Media & Text Analysis lecture 3 - Language Identification (supervised learning and Naive Bayes algorithm) CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org
In-class Presentation • a 10-minute presentation plus 2-minute Q&A (20 points) - A Social Media Platform or a NLP Researcher - Pairing up (2 students collaboration) • Sign up now! Alan Ritter ◦ socialmedia-class.org
Reading #1 Alan Ritter ◦ socialmedia-class.org
Reading #1 Alan Ritter ◦ socialmedia-class.org
Reading #2 Alan Ritter ◦ socialmedia-class.org
Natural Language Processing Dan$Jurafsky$ Language(Technology( making$good$progress$ SenIment$analysis$ sIll$really$hard$ mostly$solved$ Best$roast$chicken$in$San$Francisco!$ QuesIon$answering$(QA)$ The$waiter$ignored$us$for$20$minutes.$ Q.$How$effecIve$is$ibuprofen$in$reducing$ Coreference$resoluIon$ Spam$detecIon$ fever$in$paIents$with$acute$febrile$illness?$ ✓ Let’s$go$to$Agra!$ Paraphrase$ Carter$told$Mubarak$he$shouldn’t$run$again.$ ✗ Buy$V1AGRA$…$ Word$sense$disambiguaIon$(WSD)$ XYZ$acquired$ABC$yesterday$ ABC$has$been$taken$over$by$XYZ$ I$need$new$baWeries$for$my$ mouse .$ PartOofOspeech$(POS)$tagging$ $$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$ SummarizaIon$ Parsing$ Colorless$$$green$$$ideas$$$sleep$$$furiously.$ The$Dow$Jones$is$up$ Economy$is$ The$S&P500$jumped$ I$can$see$Alcatraz$from$the$window!$ good$ Housing$prices$rose$ Named$enIty$recogniIon$(NER)$ Machine$translaIon$(MT)$ $$$ PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$ Dialog$ � 13 � �� ��� � �� � … � Where$is$CiIzen$Kane$playing$in$SF?$$ Einstein$met$with$UN$officials$in$Princeton$ The$13 th $Shanghai$InternaIonal$Film$FesIval…$ Castro$Theatre$at$7:30.$Do$ InformaIon$extracIon$(IE)$ you$want$a$Icket?$ Party$ You’re$invited$to$our$dinner$ May$27$ party,$Friday$May$27$at$8:30$ add$
Domain/Genre • NLP is often designed for one domain (in-domain), and may not work well for other domains (out-of- domain). • Why? News Blogs Wikipedia Forums Comments Twitter … Alan Ritter ◦ socialmedia-class.org
Domain/Genre • How different? Source: Baldwin et al. "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013 Alan Ritter ◦ socialmedia-class.org
Domain/Genre out-of-vocabulary • How different? Source: Baldwin et al. "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013 Alan Ritter ◦ socialmedia-class.org
Domain/Genre • How similar? Twitter ≡ Comments < Forums < Blogs < BNC < Wikipedia Source: Baldwin et al. "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013 Alan Ritter ◦ socialmedia-class.org
Domain/Genre • What to do? - robust tools/models that works across domains - specific tools/models for Twitter data only — many techniques/algorithms are useful elsewhere (we will see examples of both in the class) Alan Ritter ◦ socialmedia-class.org
Domain/Genre • Why so much Twitter? - publicly available (vs. SMS, emails) - large amount of data - large demand for research/commercial purpose - too different from well-edited text (which most NLP tools have been made for) Alan Ritter ◦ socialmedia-class.org
NLP Pipeline Alan Ritter ◦ socialmedia-class.org
NLP Pipeline Part-of- Named Shallow Language Speech Entity Tokenization Parsing Identification (POS) Recognition (Chunking) Tagging (NER) Stemming Normalization Alan Ritter ◦ socialmedia-class.org
NLP Pipeline Part-of- Named Shallow Language Speech Entity Tokenization Parsing Identification (POS) Recognition (Chunking) Tagging (NER) Stemming Normalization Alan Ritter ◦ socialmedia-class.org
Language Identification (a.k.a Language Detection) Alan Ritter ◦ socialmedia-class.org
LangID: why needed? • Twitter is highly multilingual • But NLP is often monolingual Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
known as the “Chinese Twitter” 120 Million Posts / Day Alan Ritter ◦ socialmedia-class.org
LangID: Google Translate Alan Ritter ◦ socialmedia-class.org
LangID: Twitter API • introduced in March 2013 • uses two-letter ISO 639-1 code Alan Ritter ◦ socialmedia-class.org
LangID Tool: langid.py Alan Ritter ◦ socialmedia-class.org
LangID Tool: langid.py Alan Ritter ◦ socialmedia-class.org
LangID: A Classification Problem • Input: - a document d - a fixed set of classes C = {c 1 , c 2 , …, c j } • Output: - a predicted class c ∈ C Alan Ritter ◦ socialmedia-class.org
Classification Method: Hand-crafted Rules • Keyword-based approaches do not work well for language identification: - poor recall - expensive to build large dictionaries for all different languages - cognate words Alan Ritter ◦ socialmedia-class.org
Classification Method: Supervised Machine Learning • Input: - a document d - a fixed set of classes C = {c 1 , c 2 , …, c j } - a training set of m hand-labeled documents (d 1 , c 1 ), … , (d m , c m ) • Output: - a learned classifier 𝜹 : d → c Alan Ritter ◦ socialmedia-class.org
Classification Method: Supervised Machine Learning Source: NLTK Book Alan Ritter ◦ socialmedia-class.org
Classification Method: Supervised Machine Learning Source: NLTK Book Alan Ritter ◦ socialmedia-class.org
Classification Method: Supervised Machine Learning • Naïve Bayes • Logistic Regression • Support Vector Machines (SVM) • … Alan Ritter ◦ socialmedia-class.org
Classification Method: Supervised Machine Learning • Naïve Bayes • Logistic Regression • Support Vector Machines (SVM) • … Alan Ritter ◦ socialmedia-class.org
Naïve Bayes • a family of simple probabilistic classifiers based on Bayes’ theorem with strong (naive) independence assumptions between the features. • Bayes’ Theorem: P ( c | d ) = P ( d | c ) P ( c ) P ( d ) Alan Ritter ◦ socialmedia-class.org
Naïve Bayes • For a document d , find the most probable class c : c MAP = argmax P ( c | d ) c ∈ C maximum a posteriori Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky
Naïve Bayes • For a document d , find the most probable class c : c MAP = argmax P ( c | d ) c ∈ C P ( d | c ) P ( c ) = argmax Bayes Rule P ( d ) c ∈ C Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky
Naïve Bayes • For a document d , find the most probable class c : c MAP = argmax P ( c | d ) c ∈ C P ( d | c ) P ( c ) = argmax Bayes Rule P ( d ) c ∈ C = argmax drop the P ( d | c ) P ( c ) denominator c ∈ C Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky
Naïve Bayes • document d represented as features t 1 , t 2 , …, t n : c MAP = argmax P ( d | c ) P ( c ) c ∈ C = argmax P ( t 1 , t 2 ,..., t n | c ) P ( c ) c ∈ C Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky
Naïve Bayes • document d represented as features t 1 , t 2 , …, t n : c MAP = argmax P ( t 1 , t 2 ,..., t n | c ) P ( c ) c ∈ C prior how often does this class occur? — simple count Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky
Naïve Bayes • document d represented as features t 1 , t 2 , …, t n : c MAP = argmax P ( t 1 , t 2 ,..., t n | c ) P ( c ) c ∈ C likelihood prior O(|T| n · |C|) parameters n = number of unique n-gram tokens — need to make simplifying assumption Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky
Naïve Bayes • Conditional Independence Assumption : features P(t i | c) are independent given the class c P ( t 1 , t 2 ,..., t n | c ) = P ( t 1 | c ) ⋅ P ( t 2 | c ) ⋅ ... ⋅ P ( t n | c ) Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky
Naïve Bayes • For a document d , find the most probable class c : c MAP = argmax P ( t 1 , t 2 ,..., t n | c ) P ( c ) c ∈ C ∏ c NB = argmax P ( c ) P ( t i | c ) c ∈ C t i ∈ d Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky
Naïve Bayes ∏ c NB = argmax P ( c ) P ( t i | c ) c ∈ C t i ∈ d c Probabilistic Graphical Model … t 1 t 2 t n Alan Ritter ◦ socialmedia-class.org
Variations of Naïve Bayes c MAP = argmax P ( d | c ) P ( c ) c ∈ C • different assumptions on distributions of features: - Multinomial: discrete features - Bernoulli: binary features - Gaussian: continuous features Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky
Variations of Naïve Bayes c MAP = argmax P ( d | c ) P ( c ) c ∈ C • different assumptions on distributions of feature: - Multinomial : discrete features - Bernoulli: binary features - Gaussian: continuous features Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky
Recommend
More recommend