natural language processing basics
play

Natural Language Processing Basics Yingyu Liang University of - PowerPoint PPT Presentation

Natural Language Processing Basics Yingyu Liang University of Wisconsin-Madison Natural language Processing (NLP) The processing of the human languages by computers One of the oldest AI tasks One of the most important AI tasks One


  1. Natural Language Processing Basics Yingyu Liang University of Wisconsin-Madison

  2. Natural language Processing (NLP) • The processing of the human languages by computers • One of the oldest AI tasks • One of the most important AI tasks • One of the hottest AI tasks nowadays

  3. Difficulty • Difficulty 1: ambiguous, typically no formal description • Example: “ We saw her duck.” How many different meanings?

  4. Difficulty • Difficulty 1: ambiguous, typically no formal description • Example: “ We saw her duck.” • 1. We looked at a duck that belonged to her. • 2. We looked at her quickly squat down to avoid something. • 3. We use a saw to cut her duck.

  5. Difficulty • Difficulty 2: computers do not have human concepts • Example: “ She like little animals. For example, yesterday we saw her duck .” • 1. We looked at a duck that belonged to her. • 2. We looked at her quickly squat down to avoid something. • 3. We use a saw to cut her duck.

  6. Words Preprocess Zipf’s Law

  7. Preprocess • Corpus: often a set of text documents • Tokenization or text normalization: turn corpus into sequence(s) of tokens 1. Remove unwanted stuff: HTML tags, encoding tags 2. Determine word boundaries: usually white space and punctuations • Sometimes can be tricky, like Ph.D.

  8. Preprocess • Tokenization or text normalization: turn data into sequence(s) of tokens 1. Remove unwanted stuff: HTML tags, encoding tags 2. Determine word boundaries: usually white space and punctuations • Sometimes can be tricky, like Ph.D. 3. Remove stopwords : the, of, a, with, …

  9. Preprocess • Tokenization or text normalization: turn data into sequence(s) of tokens 1. Remove unwanted stuff: HTML tags, encoding tags 2. Determine word boundaries: usually white space and punctuations • Sometimes can be tricky, like Ph.D. 3. Remove stopwords : the, of, a, with, … 4. Case folding: lower-case all characters. • Sometimes can be tricky, like US and us 5. Stemming/Lemmatization (optional): looks, looked, looking  look

  10. Vocabulary Given the preprocessed text • Word token: occurrences of a word • Word type: unique word as a dictionary entry (i.e., unique tokens) • Vocabulary: the set of word types • Often 10k to 1 million on different corpora • Often remove too rare words

  11. Zipf’s Law • Word count 𝑔 , word rank 𝑠 • Zipf’s law: 𝑔 ∗ 𝑠 ≈ constant Zipf’s law on the corpus Tom Sawyer

  12. Text: Bag-of-Words Representation Bag-of-Words tf-idf

  13. Bag-of-Words How to represent a piece of text (sentence/document) as numbers? • Let 𝑛 denote the size of the vocabulary • Given a document 𝑒 , let 𝑑(𝑥, 𝑒) denote the #occurrence of 𝑥 in 𝑒 • Bag-of-Words representation of the document 𝑤 𝑒 = 𝑑 𝑥 1 , 𝑒 , 𝑑 𝑥 2 , 𝑒 , … , 𝑑 𝑥 𝑛 , 𝑒 /𝑎 𝑒 • Often 𝑎 𝑒 = σ 𝑥 𝑑(𝑥, 𝑒)

  14. Example • Preprocessed text: this is a good sentence this is another good sentence • BoW representation: 𝑑 ′𝑏 ′ , 𝑒 /𝑎 𝑒 , 𝑑 ′𝑗𝑡 ′ , 𝑒 /𝑎 𝑒 , … , 𝑑 ′𝑓𝑦𝑏𝑛𝑞𝑚𝑓 ′ , 𝑒 /𝑎 𝑒 • What is 𝑎 𝑒 ? • What is 𝑑 ′𝑏 ′ , 𝑒 /𝑎 𝑒 ? • What is 𝑑 ′𝑓𝑦𝑏𝑛𝑞𝑚𝑓 ′ , 𝑒 /𝑎 𝑒 ?

  15. tf-idf • tf: normalized term frequency 𝑑(𝑥, 𝑒) 𝑢𝑔 𝑥 = max 𝑑(𝑤, 𝑒) 𝑤 • idf: inverse document frequency total #doucments 𝑗𝑒𝑔 𝑥 = log #documents containing 𝑥 • tf-idf: 𝑢𝑔 - 𝑗𝑒𝑔 𝑥 = 𝑢𝑔 𝑥 ∗ 𝑗𝑒𝑔 𝑥 • Representation of the document 𝑤 𝑒 = [𝑢𝑔 − 𝑗𝑒𝑔 𝑥 1 , 𝑢𝑔 − 𝑗𝑒𝑔 𝑥 2 , … , 𝑢𝑔 − 𝑗𝑒𝑔 𝑥 𝑛 ]

  16. Cosine Similarity How to measure similarities between pieces of text? • Given the document vectors, can use any similarity notion on vectors • Commonly used in NLP: cosine of the angle between the two vectors 𝑦 ⊤ 𝑧 𝑡𝑗𝑛 𝑦, 𝑧 = 𝑦 ⊤ 𝑦 𝑧 ⊤ 𝑧

  17. Text: statistical Language Model Statistical language model N-gram Smoothing

  18. Probabilistic view • Use probabilistic distribution to model the language • Dates back to Shannon (information theory; bits in the message)

  19. Statistical language model • Language model: probability distribution over sequences of tokens • Typically, tokens are words, and distribution is discrete • Tokens can also be characters or even bytes • Sentence: “ the quick brown fox jumps over the lazy dog ” 𝑦 1 𝑦 2 𝑦 3 𝑦 4 𝑦 5 𝑦 6 𝑦 7 𝑦 8 𝑦 9 Tokens:

  20. Statistical language model • For simplification, consider fixed length sequence of tokens (sentence) (𝑦 1 , 𝑦 2 , 𝑦 3 , … , 𝑦 𝜐−1 , 𝑦 𝜐 ) • Probabilistic model: P [𝑦 1 , 𝑦 2 , 𝑦 3 , … , 𝑦 𝜐−1 , 𝑦 𝜐 ]

  21. Unigram model • Unigram model: define the probability of the sequence as the product of the probabilities of the tokens in the sequence 𝜐 P 𝑦 1 , 𝑦 2 , … , 𝑦 𝜐 = ෑ P[𝑦 𝑢 ] 𝑢=1 • Independence!

  22. A simple unigram example • Sentence: “ the dog ran away ” P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = ෠ ෠ P 𝑢ℎ𝑓 ෠ P 𝑒𝑝𝑕 ෠ P 𝑠𝑏𝑜 ෠ P[𝑏𝑥𝑏𝑧] • How to estimate on the training corpus? ෠ P 𝑢ℎ𝑓

  23. A simple unigram example • Sentence: “ the dog ran away ” P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = ෠ ෠ P 𝑢ℎ𝑓 ෠ P 𝑒𝑝𝑕 ෠ P 𝑠𝑏𝑜 ෠ P[𝑏𝑥𝑏𝑧] • How to estimate on the training corpus? ෠ P 𝑢ℎ𝑓

  24. n-gram model • 𝑜 -gram: sequence of 𝑜 tokens • 𝑜 -gram model: define the conditional probability of the 𝑜 -th token given the preceding 𝑜 − 1 tokens 𝜐 P 𝑦 1 , 𝑦 2 , … , 𝑦 𝜐 = P 𝑦 1 , … , 𝑦 𝑜−1 ෑ P[𝑦 𝑢 |𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 ] 𝑢=𝑜

  25. n-gram model • 𝑜 -gram: sequence of 𝑜 tokens • 𝑜 -gram model: define the conditional probability of the 𝑜 -th token given the preceding 𝑜 − 1 tokens 𝜐 P 𝑦 1 , 𝑦 2 , … , 𝑦 𝜐 = P 𝑦 1 , … , 𝑦 𝑜−1 ෑ P[𝑦 𝑢 |𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 ] 𝑢=𝑜 Markovian assumptions

  26. Typical 𝑜 -gram model • 𝑜 = 1 : unigram • 𝑜 = 2 : bigram • 𝑜 = 3 : trigram

  27. Training 𝑜 -gram model • Straightforward counting: counting the co-occurrence of the grams For all grams (𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 , 𝑦 𝑢 ) 1. count and estimate ෠ P[𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 , 𝑦 𝑢 ] 2. count and estimate ෠ P 𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 3. compute ෠ P 𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 , 𝑦 𝑢 ෠ P 𝑦 𝑢 𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 = ෠ P 𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1

  28. A simple trigram example • Sentence: “ the dog ran away ” P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = ෠ ෠ ෠ P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 P[𝑏𝑥𝑏𝑧|𝑒𝑝𝑕 𝑠𝑏𝑜] ෠ P[𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = ෠ ෠ P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 ෠ P[𝑒𝑝𝑕 𝑠𝑏𝑜]

  29. Drawback • Sparsity issue: ෠ P … most likely to be 0 • Bad case: “dog ran away” never appear in the training corpus, so ෠ P[𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] = 0 • Even worse: “dog ran” never appear in the training corpus, so ෠ P[𝑒𝑝𝑕 𝑠𝑏𝑜] = 0

  30. Rectify: smoothing • Basic method: adding non-zero probability mass to zero entries • Example: Laplace smoothing that adds one count to all 𝑜 -grams pseudocount [𝑒𝑝𝑕] = actualcount 𝑒𝑝𝑕 + 1

  31. Rectify: smoothing • Basic method: adding non-zero probability mass to zero entries • Example: Laplace smoothing that adds one count to all 𝑜 -grams pseudocount [𝑒𝑝𝑕] = actualcount 𝑒𝑝𝑕 + 1 pseudocount[𝑒𝑝𝑕] pseudocount[𝑒𝑝𝑕] ෠ P 𝑒𝑝𝑕 = pseudo length of the corpus = actual length of the corpus + |𝑊|

  32. Rectify: smoothing • Basic method: adding non-zero probability mass to zero entries • Example: Laplace smoothing that adds one count to all 𝑜 -grams pseudocount [𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] = actualcount 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 + 1 pseudocount [𝑒𝑝𝑕 𝑠𝑏𝑜] = ?

  33. Rectify: smoothing • Basic method: adding non-zero probability mass to zero entries • Example: Laplace smoothing that adds one count to all 𝑜 -grams pseudocount [𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] = actualcount 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 + 1 pseudocount [𝑒𝑝𝑕 𝑠𝑏𝑜] = actualcount 𝑒𝑝𝑕 𝑠𝑏𝑜 + |𝑊| P 𝑏𝑥𝑏𝑧|𝑒𝑝𝑕 𝑠𝑏𝑜 ≈ pseudocount[𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] ෠ pseudocount [𝑒𝑝𝑕 𝑠𝑏𝑜] since #bigrams ≈ #trigrams on the corpus

  34. Example • Preprocessed text: this is a good sentence this is another good sentence • How many unigrams? • How many bigrams? • Estimate ෠ P 𝑗𝑡|𝑢ℎ𝑗𝑡 without using Laplace smoothing • Estimate ෠ P 𝑗𝑡|𝑢ℎ𝑗𝑡 using Laplace smoothing (|V| = 10000)

Recommend


More recommend