Social Media & Text Analysis lecture 3 - Language Identification - PowerPoint PPT Presentation

Social Media & Text Analysis lecture 3 - Language Identification   (supervised learning and Naive Bayes algorithm) CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org

In-class Presentation • a 10-minute presentation plus 2-minute Q&A (20 points) - A Social Media Platform or a NLP Researcher - Pairing up (2 students collaboration) • Sign up now! Alan Ritter ◦ socialmedia-class.org

Reading #1 Alan Ritter ◦ socialmedia-class.org

Reading #2 Alan Ritter ◦ socialmedia-class.org

Natural Language Processing Dan$Jurafsky$ Language(Technology( making$good$progress$ SenIment$analysis$ sIll$really$hard$ mostly$solved$ Best$roast$chicken$in$San$Francisco!$ QuesIon$answering$(QA)$ The$waiter$ignored$us$for$20$minutes.$ Q.$How$effecIve$is$ibuprofen$in$reducing$ Coreference$resoluIon$ Spam$detecIon$ fever$in$paIents$with$acute$febrile$illness?$ ✓ Let’s$go$to$Agra!$ Paraphrase$ Carter$told$Mubarak$he$shouldn’t$run$again.$ ✗ Buy$V1AGRA$…$ Word$sense$disambiguaIon$(WSD)$ XYZ$acquired$ABC$yesterday$ ABC$has$been$taken$over$by$XYZ$ I$need$new$baWeries$for$my$ mouse .$ PartOofOspeech$(POS)$tagging$ $$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$ SummarizaIon$ Parsing$ Colorless$$$green$$$ideas$$$sleep$$$furiously.$ The$Dow$Jones$is$up$ Economy$is$ The$S&P500$jumped$ I$can$see$Alcatraz$from$the$window!$ good$ Housing$prices$rose$ Named$enIty$recogniIon$(NER)$ Machine$translaIon$(MT)$ $$$ PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$ Dialog$ � 13 � �� … � Where$is$CiIzen$Kane$playing$in$SF?$$ Einstein$met$with$UN$officials$in$Princeton$ The$13 th $Shanghai$InternaIonal$Film$FesIval…$ Castro$Theatre$at$7:30.$Do$ InformaIon$extracIon$(IE)$ you$want$a$Icket?$ Party$ You’re$invited$to$our$dinner$ May$27$ party,$Friday$May$27$at$8:30$ add$

Domain/Genre • NLP is often designed for one domain (in-domain), and may not work well for other domains (out-of- domain). • Why? News Blogs Wikipedia Forums Comments Twitter … Alan Ritter ◦ socialmedia-class.org

Domain/Genre • How different? Source: Baldwin et al.   "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013 Alan Ritter ◦ socialmedia-class.org

Domain/Genre out-of-vocabulary • How different? Source: Baldwin et al.   "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013 Alan Ritter ◦ socialmedia-class.org

Domain/Genre • How similar?   Twitter ≡ Comments < Forums < Blogs < BNC < Wikipedia   Source: Baldwin et al.   "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013 Alan Ritter ◦ socialmedia-class.org

      Domain/Genre • What to do? - robust tools/models that works across domains - specific tools/models for Twitter data only — many techniques/algorithms are useful elsewhere   (we will see examples of both in the class) Alan Ritter ◦ socialmedia-class.org

Domain/Genre • Why so much Twitter? - publicly available (vs. SMS, emails) - large amount of data - large demand for research/commercial purpose - too different from well-edited text (which most NLP tools have been made for) Alan Ritter ◦ socialmedia-class.org

NLP Pipeline Alan Ritter ◦ socialmedia-class.org

NLP Pipeline Part-of- Named Shallow Language Speech Entity Tokenization Parsing Identification (POS) Recognition (Chunking) Tagging (NER) Stemming Normalization Alan Ritter ◦ socialmedia-class.org

Language Identification (a.k.a Language Detection) Alan Ritter ◦ socialmedia-class.org

LangID: why needed? • Twitter is highly multilingual • But NLP is often monolingual Alan Ritter ◦ socialmedia-class.org

Alan Ritter ◦ socialmedia-class.org

known as the “Chinese Twitter” 120 Million Posts / Day Alan Ritter ◦ socialmedia-class.org

LangID: Google Translate Alan Ritter ◦ socialmedia-class.org

LangID: Twitter API • introduced in March 2013 • uses two-letter ISO 639-1 code Alan Ritter ◦ socialmedia-class.org

LangID Tool: langid.py Alan Ritter ◦ socialmedia-class.org

LangID: A Classification Problem • Input: - a document d - a fixed set of classes C = {c 1 , c 2 , …, c j } • Output: - a predicted class c ∈ C Alan Ritter ◦ socialmedia-class.org

Classification Method: Hand-crafted Rules • Keyword-based approaches do not work well for language identification: - poor recall - expensive to build large dictionaries for all different languages - cognate words Alan Ritter ◦ socialmedia-class.org

Classification Method: Supervised Machine Learning • Input: - a document d - a fixed set of classes C = {c 1 , c 2 , …, c j } - a training set of m hand-labeled documents   (d 1 , c 1 ), … , (d m , c m ) • Output: - a learned classifier 𝜹 : d → c Alan Ritter ◦ socialmedia-class.org

Classification Method: Supervised Machine Learning Source: NLTK Book Alan Ritter ◦ socialmedia-class.org

Classification Method: Supervised Machine Learning • Naïve Bayes • Logistic Regression • Support Vector Machines (SVM) • … Alan Ritter ◦ socialmedia-class.org

Naïve Bayes • a family of simple probabilistic classifiers based on Bayes’ theorem with strong (naive) independence assumptions between the features. • Bayes’ Theorem: P ( c | d ) = P ( d | c ) P ( c ) P ( d ) Alan Ritter ◦ socialmedia-class.org

Naïve Bayes • For a document d , find the most probable class c : c MAP = argmax P ( c | d ) c ∈ C maximum a posteriori Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

Naïve Bayes • For a document d , find the most probable class c : c MAP = argmax P ( c | d ) c ∈ C P ( d | c ) P ( c ) = argmax Bayes Rule P ( d ) c ∈ C Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

Naïve Bayes • For a document d , find the most probable class c : c MAP = argmax P ( c | d ) c ∈ C P ( d | c ) P ( c ) = argmax Bayes Rule P ( d ) c ∈ C = argmax drop the   P ( d | c ) P ( c ) denominator c ∈ C Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

Naïve Bayes • document d represented as features t 1 , t 2 , …, t n : c MAP = argmax P ( d | c ) P ( c ) c ∈ C = argmax P ( t 1 , t 2 ,..., t n | c ) P ( c ) c ∈ C Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

Naïve Bayes • document d represented as features t 1 , t 2 , …, t n : c MAP = argmax P ( t 1 , t 2 ,..., t n | c ) P ( c ) c ∈ C prior how often does this class occur? — simple count Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

Naïve Bayes • document d represented as features t 1 , t 2 , …, t n : c MAP = argmax P ( t 1 , t 2 ,..., t n | c ) P ( c ) c ∈ C likelihood prior O(|T| n · |C|) parameters n = number of unique n-gram tokens   — need to make simplifying assumption Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

  Naïve Bayes • Conditional Independence Assumption :   features P(t i | c) are independent given the class c P ( t 1 , t 2 ,..., t n | c ) = P ( t 1 | c ) ⋅ P ( t 2 | c ) ⋅ ... ⋅ P ( t n | c ) Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

  Naïve Bayes • For a document d , find the most probable class c :   c MAP = argmax P ( t 1 , t 2 ,..., t n | c ) P ( c ) c ∈ C ∏ c NB = argmax P ( c ) P ( t i | c ) c ∈ C t i ∈ d Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

Naïve Bayes ∏ c NB = argmax P ( c ) P ( t i | c ) c ∈ C t i ∈ d c Probabilistic Graphical Model … t 1 t 2 t n Alan Ritter ◦ socialmedia-class.org

Variations of Naïve Bayes c MAP = argmax P ( d | c ) P ( c ) c ∈ C • different assumptions on distributions of features: - Multinomial: discrete features - Bernoulli: binary features - Gaussian: continuous features Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

Variations of Naïve Bayes c MAP = argmax P ( d | c ) P ( c ) c ∈ C • different assumptions on distributions of feature: - Multinomial : discrete features - Bernoulli: binary features - Gaussian: continuous features Alan Ritter ◦ socialmedia-class.org Source: adapted from Dan jurafsky

Social Media & Text Analysis lecture 3 - Language Identification - PowerPoint PPT Presentation

Social Media & Text Analysis lecture 3 - Language Identification (supervised learning and Naive Bayes algorithm) CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org In-class Presentation a

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Presentation 1 What is social media? Get Media Smart social media 2 What is social media?

Social Media Legal Issues Brian C. England Deputy City Attorney Garland, Texas March 7, 2018

Social Media for Mason AGENDA What is Social Media Social Media Strategy Content

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Social Media donts What is social media Social media is nothing new Just an extension

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Social Media Analytics Ahmed Abbasi University of Virginia 1 Outline Social Media Overview

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

Getting Social What is social media? Why does social media matter? What social media

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

Objectives Review: JavaScript Quality Attributes of Web Software Introduction to

Introduction to Information Retrieval http://informationretrieval.org IIR 6: Scoring, Term

Division of Labor and Productivity Advantage of Cities: Theory and Evidence from Brazil Lin Tian

AMECON: Abstract Meta-Concept Features for Text Illustration Ines Chami 1, *, Youssef Tamaazousti

Radiochemical Solar Neutrino Experiments, Successful and Otherwise Richard L. (Dick) Hahn

Sysco Earnings Results | 3Q18 FORWARD LOOKING STATEMENTS Statements made in this presentation or

Building a Fault-Tolerant ETL Pipeline for Claims CAF Internship Presentation - Summer 2018 Kim

Server-side Web Security: Cross-Site Scripting CS 161: Computer Security Prof. Raluca Ada Popa