nlp
play

NLP!!! April 7, 2020 Data Science CSCI 1951A Brown University - PowerPoint PPT Presentation

NLP!!! April 7, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter 1 Announcements S/NC Option Special Topics Questions/Concerns? 2 Today 1990s


  1. Clicker Question! Query html does not work When I try to display dots from part 2 the elements do not doc 1 appear in the html. Changes I make do not affect any of the html in after I load doc 2 the nations html file 2/(4 + 17) = 0.095 Which document is more relevant to the query, according to Jaccard? 2/(4+18) = 0.091 a) The first one b) The second one c) Yes 40

  2. Clicker Question! Query html does not work When I try to display dots from part 2 the elements do not doc 1 appear in the html. Changes I make do not affect any of the html in after I load doc 2 the nations html file 2/(4 + 17) = 0.095 Which document is more relevant to the query, according to Jaccard? 2/(4+18) = 0.091 a) The first one b) The second one c) Yes 41

  3. Similarity Metrics • Edit Distance: Minimal number of edits (inserts, deletes, substitutions) needed to transform string 1 into string 2. • Jaccard Similarity: words in common / total words • Cosine Similarity: by far the most popular metric 42

  4. Cosine Similarity 2 Changes I make do not affect any of the do html in after I load the nations html file 1 1 2 html 43

  5. Cosine Similarity 2 When I try to display Changes I make do dots from part 2 …the not affect any of the do elements do not appear html in after I load the in the html. nations html file 1 1 2 html 44

  6. Cosine Similarity 2 When I try to display Changes I make do dots from part 2 …the not affect any of the do elements do not appear html in after I load the in the html. nations html file 1 θ 1 2 html 45

  7. Clicker Question! 46

  8. Clicker Question! awesome webdev does work html not all at is query 1 1 1 1 1 1 0 0 0 doc 1 1 1 0 0 0 1 1 1 1 doc 2 1 1 0 1 0 0 1 0 0 Which document is more relevant to the query, according to cosine? a) doc1 b) doc2 c) Yes 47

  9. Clicker Question! awesome webdev does work html not all at is query 1 1 1 1 1 1 0 0 0 doc 1 1 1 0 0 0 1 1 1 1 doc 2 1 1 0 1 0 0 1 0 0 3/( √ 6 √ 6) = 0.5 Which document is more relevant to the query, 3/( √ 6 √ 4) = 0.6 according to cosine? a) doc1 b) doc2 c) Yes 48

  10. Clicker Question! awesome webdev does work html not all at is query 1 1 1 1 1 1 0 0 0 doc 1 1 1 0 0 0 1 1 1 1 doc 2 1 1 0 1 0 0 1 0 0 3/( √ 6 √ 6) = 0.5 Which document is more relevant to the query, 3/( √ 6 √ 4) = 0.6 according to cosine? a) doc1 b) doc2 c) Yes 49

  11. Clicker Question! awesome webdev does work html not all at is query 1 1 1 1 1 1 0 0 0 doc 1 1 1 0 0 0 1 1 1 1 doc 2 1 1 0 1 0 0 1 0 0 3/( √ 6 √ 6) = 0.5 Which document is more relevant to the query, 3/( √ 6 √ 4) = 0.6 according to cosine? a) doc1 b) doc2 c) Yes 50

  12. Linguistic Preprocessing 51

  13. Linguistic Preprocessing Language is ambiguous but also redundant 52

  14. Linguistic Preprocessing Language is ambiguous but also redundant They freaked out when they found the bug in their apartment. 53

  15. Linguistic Preprocessing Language is ambiguous but also redundant They freaked out when they found the bug in their apartment. 54

  16. Linguistic Preprocessing Language is ambiguous but also redundant They freaked out when they found the bug in their apartment. They’ve always been terrified of anything crawly. 55

  17. Linguistic Preprocessing Language is ambiguous but also redundant They freaked out when they found the bug in their apartment. They ran back the CIT right away to tell everyone they’d finally figured it out. 56

  18. Linguistic Preprocessing Language is ambiguous but also redundant They freaked out when they found the problem in their apartment. They ran back the CIT right away to tell everyone they’d finally figured it out. 57

  19. Linguistic Preprocessing Constant Tradeoff 58

  20. Linguistic Preprocessing Constant Tradeoff Collapse! Try to treat more words as though they are the same 59

  21. Linguistic Preprocessing Constant Tradeoff Collapse! Differentiate! Try to treat Try to preserve as more words as much differences/ though they are nuance as the same possible 60

  22. Linguistic Preprocessing Constant Tradeoff Collapse! Differentiate! Try to treat Try to preserve as more words as much differences/ though they are nuance as the same possible normalization, stemming tagging, collocations 61

  23. Linguistic Preprocessing 62

  24. Linguistic Preprocessing I am trying to display dots from Part 2 on my mac (tried Chrome, Firefox , and Safari), but nothing is displayed (and the elements do not appear in the html). 63

  25. I am trying to display dots from Part 2 on my mac (tried Chrome, Linguistic Preprocessing Firefox , and Safari), but nothing is displayed (and the elements do not appear in the html). • Tokenization (Phrasal Collocations/Morphological Analysis?) • Punctuation — “okay…” vs. “okay!” • Normalization — “Trump” vs. “trump” • Stop words — “pb and jelly” vs. “pb or jelly” • Tagging — “fish fish fish fish fish” • Remove out-of-vocabulary (OOV) 64

  26. I am trying to display dots from Part 2 on my mac ( tried Chrome , Linguistic Preprocessing Firefox , and Safari ) , but nothing is displayed ( and the elements do not appear in the html ) . • Tokenization (Phrasal Collocations/Morphological Analysis?) • Punctuation — “okay…” vs. “okay!” • Normalization — “Trump” vs. “trump” • Stop words — “pb and jelly” vs. “pb or jelly” • Tagging — “fish fish fish fish fish” • Remove out-of-vocabulary (OOV) 65

  27. I am trying to display dots from Part 2 on my mac ( tried Chrome , Linguistic Preprocessing Firefox , and Safari ) , but nothing is displayed ( and the elements do not appear in the html ) . • Tokenization (Phrasal Collocations/Morphological Analysis?) • Punctuation — “okay…” vs. “okay!” ⽇旦⽂斈章⿂魛怎麼說 ? • Normalization — “Trump” vs. “trump” “How to say octopus in Japanese?” • Stop words — “pb and jelly” vs. “pb or jelly” • Tagging — “fish fish fish fish fish” • Remove out-of-vocabulary (OOV) 66

  28. I am trying to display dots from Part 2 on my mac ( tried Chrome , Linguistic Preprocessing Firefox , and Safari ) , but nothing is displayed ( and the elements do not appear in the html ) . • Tokenization (Phrasal Collocations/Morphological Analysis?) • Punctuation — “okay…” vs. “okay!” ⽇旦⽂斈章⿂魛怎麼說 ? • Normalization — “Trump” vs. “trump” “How to say octopus in Japanese?” • Stop words — “pb and jelly” vs. “pb or jelly” ⽇旦⽂斈 章⿂魛 怎麼 說 ? • Tagging — “fish fish fish fish fish” Japanese octopus how say ? • Remove out-of-vocabulary (OOV) 67

  29. I am trying to display dots from Part 2 on my mac tried Chrome Linguistic Preprocessing Firefox and Safari but nothing is displayed and the elements do not appear in the html • Tokenization (Phrasal Collocations/Morphological Analysis?) • Punctuation — “okay…” vs. “okay!” • Normalization — “Trump” vs. “trump” • Stop words — “pb and jelly” vs. “pb or jelly” • Tagging — “fish fish fish fish fish” • Remove out-of-vocabulary (OOV) 68

  30. i am trying to display dots from part 2 on my mac tried chrome Linguistic Preprocessing firefox and safari but nothing is displayed and the elements do not appear in the html • Tokenization (Phrasal Collocations/Morphological Analysis?) • Punctuation — “okay…” vs. “okay!” • Normalization — “Trump” vs. “trump” • Stop words — “pb and jelly” vs. “pb or jelly” • Tagging — “fish fish fish fish fish” • Remove out-of-vocabulary (OOV) 69

  31. i be try to display dot from part 2 on my mac try chrome firefox Linguistic Preprocessing and safari but nothing be display and the element do not appear in the html • Tokenization (Phrasal Collocations/Morphological Analysis?) • Punctuation — “okay…” vs. “okay!” • Normalization — “Trump” vs. “trump” • Stop words — “pb and jelly” vs. “pb or jelly” • Tagging — “fish fish fish fish fish” • Remove out-of-vocabulary (OOV) 70

  32. i be try to display dot from part <NUM> on my mac try chrome Linguistic Preprocessing firefox and safari but nothing be display and the element do not appear in the html • Tokenization (Phrasal Collocations/Morphological Analysis?) • Punctuation — “okay…” vs. “okay!” • Normalization — “Trump” vs. “trump” • Stop words — “pb and jelly” vs. “pb or jelly” • Tagging — “fish fish fish fish fish” • Remove out-of-vocabulary (OOV) 71

  33. Linguistic Preprocessing try display dot part <NUM> mac try chrome firefox safari nothing display element not appear html • Tokenization (Phrasal Collocations/Morphological Analysis?) • Punctuation — “okay…” vs. “okay!” • Normalization — “Trump” vs. “trump” • Stop words — “pb and jelly” vs. “pb or jelly” • Tagging — “fish fish fish fish fish” • Remove out-of-vocabulary (OOV) 72

  34. try_VB display_VB dot_NN part_NN <NUM>_NUM mac_NNP Linguistic Preprocessing try_VB chrome_NNP firefox_NNP safari_NNP nothing_DT display_VB element_NNP not_RB appear_VB html_NN • Tokenization (Phrasal Collocations/Morphological Analysis?) • Punctuation — “okay…” vs. “okay!” • Normalization — “Trump” vs. “trump” • Stop words — “pb and jelly” vs. “pb or jelly” • Tagging — “fish fish fish fish fish” • Remove out-of-vocabulary (OOV) 73

  35. try_VB display_VB dot_NN part_NN <NUM>_NUM mac_NNP Linguistic Preprocessing try_VB chrome_NNP <OOV> <OOV> nothing_DT display_VB element_NNP not_RB appear_VB html_NN • Tokenization (Phrasal Collocations/Morphological Analysis?) • Punctuation — “okay…” vs. “okay!” • Normalization — “Trump” vs. “trump” • Stop words — “pb and jelly” vs. “pb or jelly” • Tagging — “fish fish fish fish fish” • Remove out-of-vocabulary (OOV) 74

  36. try_VB display_VB dot_NN part_NN <NUM>_NUM mac_NNP Linguistic Preprocessing try_VB chrome_NNP <OOV> <OOV> nothing_DT display_VB element_NNP not_RB appear_VB html_NN • Tokenization (Phrasal Collocations/Morphological Analysis?) • Punctuation — “okay…” vs. “okay!” • Normalization — “Trump” vs. “trump” • Stop words — “pb and jelly” vs. “pb or jelly” • Tagging — “fish fish fish fish fish” • Remove out-of-vocabulary (OOV) 75

  37. Choosing a vocabulary (what goes on the columns) • Remove frequent words? (“stop words”) • Remove rare words? (unlikely to appear in test) • Remove uninteresting words? (tf-idf? pmi?) • Try to add a little syntax? (POS tags? ngrams? pmi?) 76

  38. Choosing a vocabulary (what goes on the columns) • Remove frequent words? (“stop words”) • Remove rare words? (unlikely to appear in test) • Remove uninteresting words? (tf-idf? pmi?) • Try to add a little syntax? (POS tags? ngrams? pmi?) 77

  39. Zipf’s Law Word Frequency Word Rank 78 https://en.wikipedia.org/wiki/Zipf%27s_law

  40. Zipf’s Law Word Frequency The most frequent 0.2% of words make up 50% of occurrences. Word Rank 79

  41. Zipf’s Law Word Frequency “stop words”: a, the, of, and, … Word Rank 80

  42. Zipf’s Law Word Frequency “stop words”: a, the, of, and, … (or use nltk.corpus.stopwords…) Word Rank 81

  43. Choosing a vocabulary (what goes on the columns) • Remove frequent words? (“stop words”) • Remove rare words? (unlikely to appear in test) • Remove uninteresting words? (tf-idf? pmi?) • Try to add a little syntax? (POS tags? ngrams? pmi?) 82

  44. Zipf’s Law Word Frequency Usually set some vocab size (around 30K) or some min count (around 3) Word Rank 83

  45. Zipf’s Law Word Frequency Usually set some vocab size (around 30K) or some min count (around 3) seems arbitrary? that’ s cause it is. Word Rank 84

  46. Choosing a vocabulary (what goes on the columns) • Remove frequent words? (“stop words”) • Remove rare words? (unlikely to appear in test) • Remove uninteresting words? (tf-idf? pmi?) • Try to add a little syntax? (POS tags? ngrams? pmi?) 85

  47. Tf-Idf • Term-Frequency Inverse-Document-Frequency • Assigns higher weights to words that differentiate this document from other documents • tf-idf(word,doc) = (# times word appears in doc) / (# of times word appears across all documents) • Can filter out low tf-idf words or else just reweight the term-document matrix accordingly 86

  48. Clicker Question! 87

  49. Clicker Question! doc1 doc 2 doc 3 html does work. all webdev: html html does not work webdev is awesome. does work awesome webdev does work html not all at is doc1 1 1 1 1 1 1 0 0 0 doc 2 1 1 0 0 0 1 1 1 1 doc 3 1 1 0 1 0 0 1 0 0 88

  50. Clicker Question! doc1 doc 2 doc 3 html does work. all webdev: html html does not work webdev is awesome. does work awesome webdev does work html not all at is doc1 1 1 1 1 1 1 0 0 0 doc 2 1 1 0 0 0 1 1 1 1 doc 3 1 1 0 1 0 0 1 0 0 What is the tf-idf vector for doc1 a) 1/3 1/3 1 1/3 0 1/2 1 0 1 b) 1/2 1/3 1 1/3 1 1/2 0 1/2 1 c) 1/3 1/3 1 1/2 1 1/2 0 0 0 89

  51. Clicker Question! html does work. all webdev: html html does not work webdev is awesome. does work awesome webdev df does work html html: 3 not all at does: 3 is not: 1 doc1 1 1 1 1 1 0 0 0 1 work: 2 doc 2 1 1 0 0 0 1 1 1 1 at: 1 all: 2 doc 3 1 1 0 1 0 0 1 0 0 webdev: 2 is: 1 What is the tf-idf vector for doc1 awesome: 1 a) 1/3 1/3 1 1/3 0 1/2 1 0 1 b) 1/2 1/3 1 1/3 1 1/2 0 1/2 1 c) 1/3 1/3 1 1/2 1 1/2 0 0 0 90

  52. Clicker Question! html does work. all webdev: html html does not work webdev is awesome. does work awesome webdev df does work html html: 3 not all at does: 3 is not: 1 doc1 1 1 1 1 1 1 0 0 0 work: 2 doc 2 1 0 0 0 1 1 1 1 1 at: 1 all: 2 doc 3 1 1 0 1 0 0 1 0 0 webdev: 2 is: 1 What is the tf-idf vector for doc1 awesome: 1 a) 1/3 1/3 1 1/3 0 1/2 1 0 1 b) 1/2 1/3 1 1/3 1 1/2 0 1/2 1 c) 1/3 1/3 1 1/2 1 1/2 0 0 0 91

  53. PMI • Pointwise Mutual Information • Again: assigns higher weights to words that differentiate this document from other documents • PMI(word,doc) = log P(word|doc)/P(word) • Used more for finding word-label relationships or word-word collocations (more info in two seconds) 92

  54. Choosing a vocabulary (what goes on the columns) • Remove frequent words? (“stop words”) • Remove rare words? (unlikely to appear in test) • Remove uninteresting words? (tf-idf? pmi?) • Try to add a little syntax? (POS tags? ngrams? pmi?) 93

  55. N-Grams • N-length sequence of words (unigrams, bigrams, trigrams, 4-grams, …) • Provides some context (differentiating “cute dog” from “hot dog ”) • Blows up size of vocabulary, increases sparsity 94

  56. N-Grams html does work . all webdev is awesome. 1gms: [‘html’, ‘does’, ‘work’, ‘.’, ‘all’, …] 2gms: [‘html does’, ‘does work’, ‘work .’, ‘. all’, …] 3gms: [‘html does work’, ‘does work .’, ‘work . all’, …] skip-gms: [‘html does’, ‘html work’, ‘does html’, ‘does work’, ‘does .’, …] 95

  57. Collocations • Try to find just the interesting phrases (e.g. hot dog) by finding words that occur together above chance • Often use PMI for this 96

  58. 97

  59. Topic Models When I try to display dots Can you elaborate on from part 2 on my mac exactly what the directions (tried chrome, firefox, and are in part 2 step 3, the safari), the elements do stencil code does not quite not appear in the html. imply what we are supposed to do… Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file 98

  60. Topic Models When I try to display dots Can you elaborate on from part 2 on my mac exactly what the directions (tried chrome, firefox, and are in part 2 step 3, the safari), the elements do stencil code does not quite not appear in the html. imply what we are supposed to do… Changes I make to the nations.js file do not affect any of the html in after I load the nations.html file instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a 99

  61. Topic Models Where do documents come from? “The generative story” instructions: stencil, instructions, part, step, rubric, handin… UI: html, javascript, debug, display, elements… systems: mac, windows, linux, chrome, firefox, os… fillers: I, you, when, the, and, a 100

Recommend


More recommend