Text analysis Natural Language Processing, or How to do cool stuff - PowerPoint PPT Presentation

Text analysis ≈ Natural Language Processing, or “ How to do cool stuff with words.” Emily Rae Sabo Data Camp | June 19, 2019

2 objectives for this session: ✓ What is NLP /T ext Analysis and why would I use it? ✓ What tools are out there for me to use?

What is NLP used for? Predicting Translating Finding patterns Measuring meaning language language in language in language

How to apply T ext Analysis Finding patterns in language Measuring meaning in language Vector space modeling with • • Change over time with Google Ngram word-embedded vectors like • Topic Modeling with Gensim, NLTK Word2Vec in Gensim or String matching and token extraction • GloVe in SpaCy with RegEx

Python’s basic elements & data structures Strings are the element class, or type, of focus for NLP (e.g. “cat”) Arrays, or vectors, are a list of elements. This is the data structure of focus for NLP (e.g. df = [‘apple’, ‘banana’])

4 TAKE-AWAYS 1. Google Ngram Viewer is a quick ‘n dirty tool for measuring word frequency change over time. 2. T opic modeling is a dimensionality reduction technique used to reveal “topics” in a document. 3. Regular Expressions (RegEx) is the syntax you use to do string matching, text cleaning, and token extraction. 4. Word-embedded vectors are decomposed matrices from a huge word matrix that tells you about word meaning.

How to measure changes in word frequency over time? • The founding tool of “ culturomics ” • Advantages vs. limitations? • Share one way you could imagine using this Google Ngram in your research. Viewer • Go and play! • https://books.google.com/ngrams • https://books.google.com/ngrams/info

• It’s a dimensionality reduction technique used to discover the hidden or abract What is T opic "topics“ that occur in a document or Modeling? collection of documents. • Techniques you may have heard of before: It is an unsupervised LSA (Latent Semantic Analysis) and LDA approach used for finding (Latent Dirichlet Allocation) and observing the bunch of words (called “topics”) in large clusters of texts.” Bansal (2016) Click here for a good starter on Topic Modeling in Python with NLTK and Gensim

• It’s essentially a highly specialized programming language embedded inside Python (through the re module) that you can use to search, match and extract text What are (A.K.A strings, tokens) Hi Emily, Regular It’s also a really nice way to do get into • The house code is 1468. I prefer Expressions, the nitty-gritty of improving your Python not to use Airbnb’s chat for literacy. communication, so please text me or RegEx? at xxx-xxx-xxxx. • Can you think of one way you might use this in your own research? \d{3}[-.] \d{3}[-.] \d{4} M(r|s|rs)\.?\s[A-Z]\w*

Examples: What are • Finding phone number patterns Regular \ d\ d\ d. \ d\ d\ d. \ d\ d\ d\ d Expressions, \ d\ d\ d[- .]\ d\ d\ d[-.]\ d\ d\ d\ d or RegEx? \ d{3}[- .] \ d{3}[ - .] \ d{4} • Literal s vs. meta ^ • What string pattern will this RegEx characters (e.g. ^s ) code match? • Wildcards s …. • Character sets [a-z] M(r|s|rs)\.?\s[A-Z]\w* • Character groups (a|z) • Quantifiers s*

2 options for you to explore RegEx: What are Work through a tutorial: • https://regexone.com/ https://www.tutorialspoint.com/python/python_ Regular reg_expressions.htm Expressions, Play in Jupyter, using your RegEx cheat • sheet handout as a guide. or RegEx? Start by creating your own mini-corpus (~20 words) and write RegEx code to match a string from your corpus. Pro-tip reminders: Be computational and creative in your approach. There are an infinite number of ways to accomplish a string matching task! Define your task clearly (functional level) then start coding.

Vector Space Modeling, Word-embedded vectors & Cosine Similarity

Quantifying word meaning

Now it’s your turn to drive. Start to finish. Your task : 1. Pick your package and word-embedded vectors – it’s between Gensim (Word2Vec) and SpaCy. 2. Write code to calculate the semantic similarity of two words (e.g. janky, ghetto ). “How similar in meaning?”

4 TAKE-AWAYS 1. Google Ngram Viewer is a quick ‘n dirty tool for measuring word frequency change over time. 2. T opic modeling is a dimensionality reduction technique used to reveal “topics” in a document. 3. Regular Expressions (RegEx) is the syntax you use to do string matching, text cleaning, and token extraction. 4. Word-embedded vectors are decomposed matrices from a huge word matrix that tells you about word meaning.

CHECK-IN: 1. So far, what is the most insightful thing you’ve learned during camp? 2. What is the one thing that’s still the muddiest for you?

Thank you! Come to a FREE Nerd Nite talk I’m doing about linguistics on Thursday, June 20th at LIVE, 7pm: The 13 Things You Need to Know about Language. Emily Rae Sabo @StandupLinguist

Text analysis Natural Language Processing, or How to do cool stuff - PowerPoint PPT Presentation

Text analysis Natural Language Processing, or How to do cool stuff with words. Emily Rae Sabo Data Camp | June 19, 2019 2 objectives for this session: What is NLP /T ext Analysis and why would I use it? What tools are out there

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Title of an article [16 pt] Introduction [14 pt] Text. Text. Text. Text. Text. Text. Text. Text.

50 th Anniversary Click here to add text. Click here to add text. July 2, 1964 July 2, 2014

Regression Albert Bifet May 2012 COMP423A/COMP523A Data Stream Mining Outline 1. Introduction

Shortcuts through Colocation Facilities Vasileios Kotronis 1 , George Nomikos 1 , Lefteris

From Variational to Deterministic Autoencoders or the joys of density estimation in latent spaces

Interference of Lexico-Syntactic Gender in Bilingual Spoken-Word Recognition: An Eye-Tracking

Dynamic Micro Targeting: Fitness- Based Approach to Predicting Individual Preferences Tianyi

Health Benefits Survey Release Slides October 3, 2018 Drew Altman President and CEO, KFF Gary

Business Perspective Moderator: Holly Emrick Svetz, Womble Carlyle Alison Brown, NAVSYS

A Tale of two Patients A Tale of two Patients 1) 71 yo with persistent AF since 2018. OSA