text classification
play

Text Classification Fall 2020 2020-09-18 Adapted from slides from - PowerPoint PPT Presentation

SFU NatLangLab CMPT 413/825: Natural Language Processing Text Classification Fall 2020 2020-09-18 Adapted from slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan 1 Announcements Remaining lectures on language modeling (LM) on


  1. SFU NatLangLab CMPT 413/825: Natural Language Processing Text Classification Fall 2020 2020-09-18 Adapted from slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan 1

  2. Announcements • Remaining lectures on language modeling (LM) on Canvas • Initial grades for HW0-Programming section released • You have until 11:59 Friday to resubmit / address any comments in the feedback • We will aim to have final grades for HW0 out next week • For those that do not have group / single student groups, we have created a Piazza group through which you can contact each other. 2

  3. Why classify? Spam detection Sentiment analysis • Authorship attribution • Language detection • News categorization 3

  4. Other Examples Intent detection Prepositional phrase attachment 4

  5. Classification: The Task • Inputs: • A document d • A set of classes C = {c 1 , c 2 , c 3 , … , c m } • Output: • Predicted class c for document d Negative Movie was terrible Classify Amazing acting Positive Classify 5

  6. 
 
 Rule-based classification • Combinations of features on words in document, meta-data 
 IF there exists word w in document d such that w in [good, great, extra-ordinary, …], 
 THEN output Positive 
 IF email address ends in [ ithelpdesk.com, makemoney.com, spinthewheel.com, … ] 
 THEN output SPAM • Simple, can be very accurate • But: rules may be hard to define (and some even unknown to us!) • Expensive • Not easily generalizable 6

  7. Supervised Learning: Let’s use statistics! • Data-driven approach Let the machine figure out the best patterns to use! • Inputs: • Set of m classes C = {c 1 , c 2 , …, c m } • Set of n ‘labeled’ documents: {(d 1 , c 1 ), (d 2 , c 2 ), …, (d n , c n )} • Output: Trained classifier, F : d → c • What form should F take? • How to learn F? 7

  8. Recall: general guidelines for model building Two steps to building a probability model: • What form should F take? 1. Define the model • What independence assumptions do we make? • What are the model parameters (probability values)? 2. Estimate the model parameters (training/learning) • How to learn F? 8

  9. Types of supervised classifiers Naive Bayes Logistic regression k-nearest neighbors Support vector machines 9

  10. Naive Bayes Classifier General setting • Let the input be represented as features: , x r f j 1 ≤ j ≤ r • Let be the output classification y • We can have a simple classification model using Bayes rule Prior Likelihood Posterior P ( y | x ) = P ( y ) ⋅ P ( x | y ) P ( x ) • Make strong (naive) conditional independence assumptions r r Bayes rule ∏ ∏ P ( x | y ) = P ( f j | y ) P ( y | x ) ∝ P ( y ) ⋅ P ( f j | y ) j =1 j =1 10

  11. Naive Bayes classifier for text classification • For text classification: input is document d = ( w 1 , …, w k ) x • Use as our features the words , where is our vocabulary w j 1 ≤ j ≤ | V | V • is the output classification c • Predicting the best class: c MAP = arg max c ∈ C P ( c | d ) maximum a posteriori Conditional probability of P ( c ) P ( d | c ) P ( d | c ) → = arg max (MAP) estimate generating document from class P ( d ) d c c ∈ C = arg max c ∈ C P ( c ) P ( d | c ) Prior probability of class P ( c ) → c 11

  12. How to represent P(d | c)? • Option 1: represent the entire sequence of words • (too many sequences!) P ( w 1 , w 2 , w 3 , …, w k | c ) • Option 2: Bag of words • Assume position of each word is irrelevant 
 (both absolute and relative) • P ( w 1 , w 2 , w 3 , …, w k | c ) = P ( w 1 | c ) P ( w 2 | c )… P ( w k | c ) • Probability of each word is conditionally independen t 
 given class c 12

  13. Bag of words word count it 6 I 5 the 4 I love this movie! It's sweet, t, to 3 fairy it but with satirical humor. The always love he to and 3 it whimsical it dialogue is great and the I and areanyone seen 2 seen ... adventure scenes are fun... friend yet 1 dialogue cal happy It manages to be whimsical would 1 recommend adventure ng and romantic while laughing whimsical 1 satirical sweet of who it movie at the conventions of the times 1 I to it but romantic I sweet 1 fairy tale genre. I would yet several t r humor satirical 1 again the recommend it to just about it ral the adventure 1 would seen anyone. I've seen it several py es to scenes I genre 1 the manages I times, and I'm always happy the fairy 1 fun times I and and to see it again whenever I about humor 1 while whenever have a friend who hasn't have have 1 conventions seen it yet! great 1 with … … 15 13

  14. ̂ ̂ ̂ Predicting with Naive Bayes • Once we assume that the position of each word is irrelevant and that the words are conditionally independent given class , we have: c P ( d | c ) = P ( w 1 , w 2 , w 3 , …, w k | c ) = P ( w 1 | c ) P ( w 2 | c )… P ( w k | c ) • The maximum a posteriori (MAP) estimate is now: is used to indicate the estimated probability P k ∏ c MAP = arg max c ∈ C P ( c ) P ( d | c ) = arg max P ( c ) P ( w i | c ) c ∈ C i =1 Note that is the number of tokens (words) in the document. k The index is the position of the token. i 14

  15. Naive Bayes as a generative model Generate the entire data set one document at a time 15

  16. Naive Bayes as a generative model c = Science P(c) d1 Sample a category 16

  17. Naive Bayes as a generative model w 1 = Scientists P(w 1 |c) c = Science P(c) d1 Sample words 17

  18. Naive Bayes as a generative model w 1 = Scientists P(w 1 |c) P(w 2 |c) w 2 = have c = Science P(w 3 |c) P(c) w 3 = discovered d1 Generate the entire data set one document at a time 18

  19. Naive Bayes as a generative model w 1 = Scientists w 1 = Global w 2 = have w 2 = warming c = Science c = Environment w 3 = discovered w 3 = has d2 d1 Generate the entire data set one document at a time 19

  20. Estimating probabilities 20

  21. ̂ ̂ Data sparsity • What about when count(‘amazing’, positive) = 0? • Implies P(‘amazing’ | positive ) = 0 • Given a review document, d = “…. most amazing movie ever …” k ∏ c MAP = arg max p ( c ) P ( w i | c ) c ∈ C i =1 Can’t determine the best ! c = arg max p ( c ) ⋅ 0 = arg max c ∈ C 0 c ∈ C 21

  22. Solution: Smoothing! Laplace smoothing • Simple, easy to use • E ff ective in practice 22

  23. 
 
 
 Variants Overall process Multinomial Naive Bayes Normal counts (0,1,2,…) • Input: Set of annotated documents {( d i , c i )} n for each document i =1 A. Compute vocabulary V of all words Binary (Multinomial) NB / Bernoulli NB B. Calculate 
 Binarized counts (0/1) For each document Name based on the C. Calculate 
 distribution of the features P ( f i | y ) → P ( w i | c ) D. (Prediction) Given document 
 d = ( w 1 , w 2 , . . . , w k ) 23

  24. Naive Bayes Example Doc Words Class P ( c ) = N c Training 1 Chinese Beijing$Chinese c ˆ 2 Chinese$Chinese$Shanghai c N 3 Chinese$Macao c Smoothing with α = 1 P ( w | c ) = count ( w , c ) + 1 4 Tokyo$Japan$Chinese j ˆ Test 5 Chinese$Chinese$Chinese$Tokyo Japan ? count ( c ) + | V | Priors: 3 P ( c )=$ Choosing%a%class: 4 1 P ( j )=$ 4 3/4$*$(3/7) 3 *$1/14$*$1/14$ P(c|d5)$ ∝ ≈$0.0003 Conditional%Probabilities: P(Chinese| c )$= (5+1)$/$(8+6)$=$6/14$=$3/7 (0+1)$/$(8+6)$=$1/14 P(Tokyo| c )$$$$= 1/4$*$(2/9) 3 *$2/9$*$2/9$ P(j|d5)$ ∝ P(Japan| c )$$$$$= (0+1)$/$(8+6)$=$1/14 ≈$0.0001 P(Chinese| j )$= (1+1)$/$(3+6)$=$2/9$ P(Tokyo| j )$$$$$= (1+1)$/$(3+6)$=$2/9$ (1+1)$/$(3+6)$=$2/9$ P(Japan| j )$$$$$$=$ 41 24

  25. Some details • Vocabulary is important • Tokenization matters: it can a ff ect your vocabulary • Tokenization = how you break your sentence up into tokens / words • Make sure you are consistent with your tokenization! • Special multi-word tokens: NOT_happy 25

  26. Some details • Vocabulary is important • Tokenization matters: it can a ff ect your vocabulary • Tokenization = how you break your sentence up into tokens / words • Make sure you are consistent with your tokenization! • Handling unknown words in test not in your training vocabulary? • Remove them from your test document! Just ignore them. • Handling stop words (common words like a , the that may not be useful) Better to use • Remove them from the training data! • Modified counts (tf-idf) that down weighs frequent, unimportant words • Better models! 26

  27. Features • In general, Naive Bayes can use any set of features, not just words • URLs, email addresses, Capitalization, … • Domain knowledge can be crucial to performance Top features for Spam detection 27

  28. Naive Bayes and Language Models • If features = bag of words, NB gives a per class unigram language model! • For class , assigning each word: 
 c P ( w | c ) P ( s | c ) = ∏ assigning sentence: P ( w | c ) w ∈ s Example with positive and negative sentiments P ( s | pos ) = 0.0000005 28

  29. Naive Bayes as a language model • Which$class$assigns$the$higher$probability$to$s? Model$pos Model$neg 0.2 I 0.1 I I love this fun film 0.001 love 0.1 love 0.1 0.1 0.01 0.05 0.1 0.01 this 0.01 this 0.2 0.001 0.01 0.005 0.1 0.05 fun 0.005 fun P(s|pos)$$>$$P(s|neg) 0.1 film 0.1 film 29

Recommend


More recommend