Social Media & Text Analysis lecture 5 - POS/NE Tagging CSE - PowerPoint PPT Presentation

Social Media & Text Analysis lecture 5 - POS/NE Tagging CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org

NLP Pipeline (summary so far) classification Regular (Naïve Bayes) Expression Part-of- Named Shallow Language Speech Entity Tokenization Parsing Identification (POS) Recognition (Chunking) Tagging (NER) Stemming Normalization Alan Ritter ◦ socialmedia-class.org

NLP Pipeline (next) Part-of- Named Shallow Language Speech Entity Tokenization Parsing Identification (POS) Recognition (Chunking) Tagging (NER) Stemming Sequential Tagging Normalization Alan Ritter ◦ socialmedia-class.org

Challenge: Natural Language Processing Breaks 4

Challenge: Natural Language Processing Breaks LOCATION PERSON 4

Challenge: Natural Language Processing Breaks Stanford NER: 1.00 LOCATION 0.90 0.80 0.70 ~50% 0.60 Drop 0.50 PERSON 0.40 0.30 0.20 0.10 0.00 Newswire Twitter 4

Part-of-Speech (POS) Tagging Cant MD wait VB for IN the DT ravens NNP game NN tomorrow NN … : go VB ray NNP rice NNP !!!!!!! . Alan Ritter ◦ socialmedia-class.org

Penn Treebank POS Tags Alan Ritter ◦ socialmedia-class.org

Part-of-Speech (POS) Tagging • Words often have more than one POS: - The back door = JJ - On my back = NN - Win the voters back = RB - Promised to back the bill = VB • POS tagging problem is to determine the POS tag for a particular instance of a word. Alan Ritter ◦ socialmedia-class.org Source: adapted from Chris Manning

Twitter-specific Tags • #hashtag • @metion • url • email address • emoticon • discourse marker • symbols • … Source: Gimpel et al.   “Part-of-Speech Tagging for Twitter : Annotation, Features, and Experiments” ACL 2011 Alan Ritter ◦ socialmedia-class.org

Noisy Text: Challenges • Lexical Variation (misspellings, abbreviations) `2m', `2ma', `2mar', `2mara', `2maro', `2marrow', `2mor', `2mora', `2moro', `2morow', – `2morr', `2morro', `2morrow', `2moz', `2mr', `2mro', `2mrrw', `2mrw', `2mw', `tmmrw', `tmo', `tmoro', `tmorrow', `tmoz', `tmr', `tmro', `tmrow', `tmrrow', `tmrrw', `tmrw', `tmrww', `tmw', `tomaro', `tomarow', `tomarro', `tomarrow', `tomm', `tommarow', `tommarrow', `tommoro', `tommorow', `tommorrow', `tommorw', `tommrow', `tomo', `tomolo', `tomoro', `tomorow', `tomorro', `tomorrw', `tomoz', `tomrw', `tomz‘ • Unreliable Capitalization – “The Hobbit has FINALLY started filming! I cannot wait!” • Unique Grammar – “watchng american dad.” 7 Alan Ritter ◦ socialmedia-class.org

Chunking Cant VP wait for PP the ravens NP game tomorrow NP … go VP ray NP rice !!!!!!! Alan Ritter ◦ socialmedia-class.org

Chunking • recovering phrases constructed by the part-of-speech tags • a.k.a shallow (partial) parsing: - full parsing is expensive, and is not very robust - partial parsing can be much faster, more robust, yet sufficient for many applications - useful as input (features) for named entity recognition or full parser Alan Ritter ◦ socialmedia-class.org

Named Entity Recognition(NER) Cant wait for the ravens ORG game ORG: organization tomorrow … PER: person go LOC: location ray PER rice !!!!!!! . Alan Ritter ◦ socialmedia-class.org

NER: Basic Classes Cant wait for the ravens ORG game ORG: organization tomorrow … PER: person go LOC: location ray PER rice !!!!!!! . Alan Ritter ◦ socialmedia-class.org

Noisy Text: NLP breaks POS: Chunk: NER:

Noisy Text: NLP breaks POS: Chunk: Noisy Style NER:

NER: Rich Classes Source: Strauss, Toma, Ritter, de Marneffe, Xu   Results of the WNUT16 Named Entity Recognition Shared Task (WNUT@COLING 2016) Alan Ritter ◦ socialmedia-class.org

NER: Genre Differences News Tweets PER Politicians, business Sportsmen, actors, TV leaders, journalists, personalities, celebrities, celebrities names of friends LOC Countries, cities, rivers, Restaurants, bars, local and other places related to landmarks/areas, cities, current affairs rarely countries ORG Public and private Bands, internet companies, companies, government sports clubs organisations Source: Kalina Bontcheva and Leon Derczynski   “Tutorial on Natural Language Processing for Social Media” EACL 2014 Alan Ritter ◦ socialmedia-class.org

Weakly Supervised NER • Freebase / Wikipedia lists provide a source of supervision • But these lists are highly ambiguous • Example: China

Weakly Supervised NER • Freebase / Wikipedia lists provide a source of supervision • But these lists are highly ambiguous • Example: China …

[Ritter, et. al. EMNLP 2011] Distant Supervision with Latent Variables Latent variable model for Named Entity Categorization with constraints

[Ritter, et. al. EMNLP 2011] Obama Apple JFK On my way to JFK early in the… JFK 's bomber jacket sells for… JFK Airport’s Pan Am Worldport… Waiting at JFK for our ride… When JFK threw first pitch on… …" …"

[Ritter, et. al. EMNLP 2011] Obama Apple JFK On my way to JFK early in the… JFK 's bomber jacket sells for… JFK Airport’s Pan Am Worldport… Waiting at JFK for our ride… When JFK threw first pitch on… …" …" ‘s 0.04 announced 0.04 waiting 0.04 threw 0.02 ride 0.03 new 0.03 way 0.02 jacket 0.01 release 0.02 … … … PERSON FACILITY PRODUCT

[Ritter, et. al. EMNLP 2011] Obama Apple JFK 5 On my way to JFK early in the… 3.75 JFK 's bomber jacket sells for… 2.5 JFK Airport’s Pan Am Worldport… 1.25 Waiting at JFK for our ride… When JFK threw first pitch on… 0 …" …" ‘s 0.04 announced 0.04 waiting 0.04 threw 0.02 ride 0.03 new 0.03 way 0.02 jacket 0.01 release 0.02 … … … PERSON FACILITY PRODUCT

[Ritter, et. al. EMNLP 2011] Obama Apple JFK 5 On my way to JFK early in the… 3.75 JFK 's bomber jacket sells for… 2.5 JFK Airport’s Pan Am Worldport… 1.25 X Waiting at JFK for our ride… When JFK threw first pitch on… 0 …" …" ‘s 0.04 announced 0.04 waiting 0.04 threw 0.02 ride 0.03 new 0.03 way 0.02 jacket 0.01 release 0.02 … … … PERSON FACILITY PRODUCT

[Ritter, et. al. EMNLP 2011] Obama Apple JFK 5 On my way to JFK early in the … 3.75 JFK 's bomber jacket sells for… 2.5 JFK Airport’s Pan Am Worldport… 1.25 X Waiting at JFK for our ride … 0 When JFK threw first pitch on… …" …" ‘s 0.04 announced 0.04 waiting 0.04 threw 0.02 ride 0.03 new 0.03 way 0.02 jacket 0.01 release 0.02 … … … PERSON FACILITY PRODUCT

[Ritter, et. al. EMNLP 2011] Example Type Lists

[Ritter, et. al. EMNLP 2011] Example Type Lists KKTNY = Kourtney and Kim Take New York RHOBH = Real Housewives of Beverly Hills

[Ritter, et. al. EMNLP 2011] Twitter NER: Classification Results F1 0.7 0.525 0.35 0.175 0 Majority Baseline Freebase Baseline Supervised Baseline DL-Cotrain LLDA (Collins and Singer ‘99)

[Ritter, et. al. EMNLP 2011] Twitter NER: Classification Results F1 25% increase in F1 0.7 0.525 0.35 0.175 0 Majority Baseline Freebase Baseline Supervised Baseline DL-Cotrain LLDA (Collins and Singer ‘99)

Tool: twitter_nlp Alan Ritter ◦ socialmedia-class.org

Results of the WNUT16 Named Entity Recognition Shared Task Benjamin Strauss, Bethany Toma, Alan Ritter, Marie- Catherine de Marneffe and Wei Xu

Need for Shared Evaluations • Fast Moving Area: Papers published in the same year use different datasets and evaluation methodology • Performance still behind what we would like • ~0.6 - 0.7 F1 score ( much lower than news ) • Explore new ideas & approaches

Related NER Evaluations Newswire • MUC • http://www.itl.nist.gov/iaui/894.02/related_projects/muc/muc_data/muc_data_index.html Newswire • CONLL • http://www.cnts.ua.ac.be/conll2002/ner/ • http://www.cnts.ua.ac.be/conll2003/ner/ Newswire • ACE • https://catalog.ldc.upenn.edu/LDC2005T09 • Named Entity rEcognition and Linking (NEEL) Challenge Microblogs • #Microposts workshop at WWW • http://microposts2016.seas.upenn.edu/challenge.html

Twitter NER Evaluation Summary Re-Run of 2015 Task 2 Subtasks • Segmentation + 10 way classification • Segmentation only (no classification)

Social Media & Text Analysis lecture 5 - POS/NE Tagging CSE - PowerPoint PPT Presentation

Social Media & Text Analysis lecture 5 - POS/NE Tagging CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org NLP Pipeline (summary so far) classification Regular (Nave Bayes) Expression Part-of-

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Presentation 1 What is social media? Get Media Smart social media 2 What is social media?

Social Media Legal Issues Brian C. England Deputy City Attorney Garland, Texas March 7, 2018

Social Media for Mason AGENDA What is Social Media Social Media Strategy Content

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Social Media donts What is social media Social media is nothing new Just an extension

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Social Media Analytics Ahmed Abbasi University of Virginia 1 Outline Social Media Overview

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

Getting Social What is social media? Why does social media matter? What social media

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

Abelian Square-Free Dithering and Recoding for Iterated Hash Functions Ronald L. Rivest MIT

OUR STORY Will Postma Executive Director PWRDF 1 20190523 MATTER The Diocese

Fun Online Learning Liz Romero, PhD & Maria Glass, PhD November 30 th , 2013 Toronto, ON

Leopard ISWC Semantic Web Challenge 2017 e Speck 1 , 2 and Axel-Cyrille Ngonga Ngomo 3 Ren

off or steady as she goes? Bob Pymm, School of Information Studies, Charles Sturt University,

Data and Process Modelling Lab3. Modelling a Complex Domain in NORMA Marco Montali KRDB Research

1 2 3 http://www.gamefaqs.com/sinclair/948634-the-hobbit/faqs/14842 4 5 6 7 8 9 10 11

LDA 1 [Credits: Mike Smith, Las Vegas Sun 2013] LDA 2 [Credits: IITD Library] 4 5 6 In

Social Media & Text Analysis lecture 5 - POS/NE Tagging CSE - PowerPoint PPT Presentation

Social Media & Text Analysis lecture 5 - POS/NE Tagging CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org NLP Pipeline (summary so far) classification Regular (Nave Bayes) Expression Part-of-

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Presentation 1 What is social media? Get Media Smart social media 2 What is social media?

Social Media Legal Issues Brian C. England Deputy City Attorney Garland, Texas March 7, 2018

Social Media for Mason AGENDA What is Social Media Social Media Strategy Content

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Social Media donts What is social media Social media is nothing new Just an extension

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Social Media Analytics Ahmed Abbasi University of Virginia 1 Outline Social Media Overview

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

Getting Social What is social media? Why does social media matter? What social media

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

Abelian Square-Free Dithering and Recoding for Iterated Hash Functions Ronald L. Rivest MIT

OUR STORY Will Postma Executive Director PWRDF 1 20190523 MATTER The Diocese

Fun Online Learning Liz Romero, PhD &amp; Maria Glass, PhD November 30 th , 2013 Toronto, ON

Leopard ISWC Semantic Web Challenge 2017 e Speck 1 , 2 and Axel-Cyrille Ngonga Ngomo 3 Ren

off or steady as she goes? Bob Pymm, School of Information Studies, Charles Sturt University,

Data and Process Modelling Lab3. Modelling a Complex Domain in NORMA Marco Montali KRDB Research

1 2 3 http://www.gamefaqs.com/sinclair/948634-the-hobbit/faqs/14842 4 5 6 7 8 9 10 11

LDA 1 [Credits: Mike Smith, Las Vegas Sun 2013] LDA 2 [Credits: IITD Library] 4 5 6 In

Fun Online Learning Liz Romero, PhD & Maria Glass, PhD November 30 th , 2013 Toronto, ON