NLP!!! (Part 2) April 9, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter 1
Announcements • Viz Lab tomorrow afternoon (4pm? Check Piazza) • Project Grades/Pitches/Presentations 2
Today • More NLP! • Ngrams • Topic Models • Word Embeddings 3
Today • More NLP! • Ngrams • Topic Models • Word Embeddings 4
N-Grams • N-length sequence of words (unigrams, bigrams, trigrams, 4-grams, …) • Provides some context (differentiating “cute dog” from “hot dog ”) • Blows up size of vocabulary, increases sparsity • Usually vocab size cutoffs/min count thresholds apply to ngrams too 5
N-Grams html does work . all webdev is awesome. 1gms: [‘html’, ‘does’, ‘work’, ‘.’, ‘all’, …] 2gms: [‘html does’, ‘does work’, ‘work .’, ‘. all’, …] 3gms: [‘html does work’, ‘does work .’, ‘work . all’, …] 6
N-Grams html does work . all webdev is awesome. 1gms: [‘html’, ‘does’, ‘work’, ‘.’, ‘all’, …] 2gms: [‘html does’, ‘does work’, ‘work .’, ‘. all’, …] 3gms: [‘html does work’, ‘does work .’, ‘work . all’, …] skip-1gms: [‘html does’, ‘html work’, ‘does html’, ‘does work’, …] 7
Tagging • Parts of Speech — “fly” the noun or “fly” the verb? • Word Sense Disambiguation — “fly” as in “take an airplane” or “fly” as in “go fast”? • Named Entity Recognition — “Washington” the place or “Washington” the person 8
Syntactic Relations “Dependency Parsing” today, despite the lockdown, i will get groceries https://explosion.ai/demos/displacy 9
Syntactic Relations “Dependency Parsing” today, despite the lockdown, i will get groceries https://explosion.ai/demos/displacy 10
Syntactic Relations “Constituency Parsing” all webdev is awesome. https://demo.allennlp.org/constituency-parsing 11
12
Today • More NLP! • Ngrams • Topic Models • Word Embeddings 13
Topic Models When I try to display dots Can you elaborate on from part 2 on my mac exactly what the directions (tried chrome, firefox, and are in part 2 step 3, the safari), the elements do stencil code does not quite not appear in the html. imply what we are supposed to do… Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file 14
Topic Models When I try to display dots Can you elaborate on from part 2 on my mac exactly what the directions (tried chrome, firefox, and are in part 2 step 3, the safari), the elements do stencil code does not quite not appear in the html. imply what we are supposed to do… Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a 15
Topic Models Where do documents come from? “The generative story” instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a 16
Topic Models Where do documents come from? “The generative story” instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a 1. Sample a topic 17
Topic Models Where do documents come from? “The generative story” instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a You 2. Sample a word from that topic 18
Topic Models Where do documents come from? “The generative story” instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a You 1. Sample a topic 19
Topic Models Where do documents come from? “The generative story” instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a You javascript 2. Sample a word from that topic 20
Topic Models Where do documents come from? “The generative story” instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a You javascript 1. Sample a topic 21
Topic Models Where do documents come from? “The generative story” instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a You javascript handin 2. Sample a word from that topic 22
Topic Models 23
Topic Models “latent” variable (not observed) 24
Topic Models words are determined by topic (and are conditionally independent of each other) 25
Topic Models documents are a distribution over topics 26
Topic Models set parameters to maximize probability of observations 27
Topic Models part 2 html does not work 28
Topic Models 60 45 part 2 html does not work 30 15 0 Topic1 Topic2 Topic3 Topic4 29
Topic Models 60 45 part 2 html does not work 30 15 0 Topic1 Topic2 Topic3 Topic4 html html javascript javascript work work handin handin part part stencil stencil 30 0 10 20 30 40 0 7.5 15 22.5 30
Clicker Question! 31
Clicker Question! Which is the best parameter setting for 0.3 part the observed data? 0 <NUM> 0.2 0.3 0 html part <NUM> html does not work 0.4 0.1 does 0.1 0.1 not 50 70 0.2 50 50 67 37.5 52.5 0.2 work 25 35 0 33 12.5 17.5 0 0.1 0.2 0.3 0.4 0 0 Topic1 Topic2 Topic1 Topic2 Topic 1 Topic 2 (a) (b) 32
Clicker Question! a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 Which is the best parameter setting for 0.3 part (0+0.3+0.4+0.1+0.2)x0.5 the observed data? 0 = 0.45 + 0.5 <NUM> 0.2 = 0.95 0.3 0 html part <NUM> html does not work 0.4 0.1 does 0.1 0.1 not 50 70 0.2 50 50 67 37.5 52.5 0.2 work 25 35 0 33 12.5 17.5 0 0.1 0.2 0.3 0.4 0 0 Topic1 Topic2 Topic1 Topic2 Topic 1 Topic 2 (a) (b) 33
Clicker Question! a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 Which is the best parameter setting for 0.3 part (0+0.3+0.4+0.1+0.2)x0.5 the observed data? 0 = 0.45 + 0.5 <NUM> 0.2 = 0.95 0.3 0 html part <NUM> html does not work 0.4 0.1 does 0.1 0.1 not 50 70 0.2 50 50 67 37.5 52.5 0.2 work 25 35 0 33 12.5 17.5 0 0.1 0.2 0.3 0.4 0 0 Topic1 Topic2 Topic1 Topic2 Topic 1 Topic 2 (a) (b) 34
Clicker Question! a: (0.3+0.2+0+0.1+0.1+0.2)x0.5 Which is the best parameter setting for 0.3 part (0+0.3+0.4+0.1+0.2)x0.5 the observed data? 0 = 0.45 + 0.5 <NUM> 0.2 = 0.95 0.3 0 html part <NUM> html does not work 0.4 0.1 does 0.1 0.1 not 50 70 0.2 50 50 67 37.5 52.5 0.2 work 25 35 0 33 12.5 17.5 0 0.1 0.2 0.3 0.4 0 0 Topic1 Topic2 Topic1 Topic2 Topic 1 Topic 2 (a) (b) 35
Clicker Question! b: (0.3+0.2+0+0.1+0.1+0.2)x0.33 Which is the best parameter setting for 0.3 part (0+0.3+0.4+0.1+0.2)x0.67 the observed data? 0 = 0.297 + 0.67 <NUM> 0.2 = 0.967 0.3 0 html part <NUM> html does not work 0.4 0.1 does 0.1 0.1 not 50 70 0.2 50 50 67 37.5 52.5 0.2 work 25 35 0 33 12.5 17.5 0 0.1 0.2 0.3 0.4 0 0 Topic1 Topic2 Topic1 Topic2 Topic 1 Topic 2 (a) (b) 36
Topic Models 37
Topic Models LDA Latent Dirichelet Allocation (latent = not directly observed; Dirichelet = prior follows a Dirichelet distribution) Generative Model Set parameters using EM or MCMC 38
Topic Models LDA LSA Latent Dirichelet Allocation Latent Semantic Analysis (latent = not directly observed; Dirichelet = prior follows a Dirichelet distribution) Generative Model Discriminative Model Set parameters using EM Set parameters by factorizing or MCMC the term-document matrix 39
cong parli the US UK ress ame 1 1 1 1 0 Topic Models doc1 1 0 1 0 1 doc2 1 1 0 1 0 doc3 1 0 1 0 1 doc4 cong parlia the US UK ress ment -0.65 -0.34 -0.51 -0.34 -0.31 d1 -0.60 -0.39 0.70 0.00 3.06 0.00 0.00 0.00 0.00 0.02 -0.54 0.34 -0.54 0.56 d2 -0.48 0.50 -0.12 -0.71 0.00 1.81 0.00 0.00 0.00 d3 -0.43 -0.58 -0.69 0.00 -0.42 0.02 0.79 0.02 -0.44 0.00 0.00 0.57 0.00 0.00 d4 -0.48 0.50 -0.12 0.71 -0.63 0.27 0.00 0.37 0.63 0.00 0.00 0.00 0.00 0.00 -0.04 0.73 0.00 -0.68 0.04 U D V 40
Recommend
More recommend