Text Classification Fall 2020 2020-09-18 Adapted from slides from - PowerPoint PPT Presentation

SFU NatLangLab CMPT 413/825: Natural Language Processing Text Classification Fall 2020 2020-09-18 Adapted from slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan 1

Announcements • Remaining lectures on language modeling (LM) on Canvas • Initial grades for HW0-Programming section released • You have until 11:59 Friday to resubmit / address any comments in the feedback • We will aim to have final grades for HW0 out next week • For those that do not have group / single student groups, we have created a Piazza group through which you can contact each other. 2

Why classify? Spam detection Sentiment analysis • Authorship attribution • Language detection • News categorization 3

Other Examples Intent detection Prepositional phrase attachment 4

Classification: The Task • Inputs: • A document d • A set of classes C = {c 1 , c 2 , c 3 , … , c m } • Output: • Predicted class c for document d Negative Movie was terrible Classify Amazing acting Positive Classify 5

    Rule-based classification • Combinations of features on words in document, meta-data   IF there exists word w in document d such that w in [good, great, extra-ordinary, …],   THEN output Positive   IF email address ends in [ ithelpdesk.com, makemoney.com, spinthewheel.com, … ]   THEN output SPAM • Simple, can be very accurate • But: rules may be hard to define (and some even unknown to us!) • Expensive • Not easily generalizable 6

Supervised Learning: Let’s use statistics! • Data-driven approach Let the machine figure out the best patterns to use! • Inputs: • Set of m classes C = {c 1 , c 2 , …, c m } • Set of n ‘labeled’ documents: {(d 1 , c 1 ), (d 2 , c 2 ), …, (d n , c n )} • Output: Trained classifier, F : d → c • What form should F take? • How to learn F? 7

Recall: general guidelines for model building Two steps to building a probability model: • What form should F take? 1. Define the model • What independence assumptions do we make? • What are the model parameters (probability values)? 2. Estimate the model parameters (training/learning) • How to learn F? 8

Types of supervised classifiers Naive Bayes Logistic regression k-nearest neighbors Support vector machines 9

Naive Bayes Classifier General setting • Let the input be represented as features: , x r f j 1 ≤ j ≤ r • Let be the output classification y • We can have a simple classification model using Bayes rule Prior Likelihood Posterior P ( y | x ) = P ( y ) ⋅ P ( x | y ) P ( x ) • Make strong (naive) conditional independence assumptions r r Bayes rule ∏ ∏ P ( x | y ) = P ( f j | y ) P ( y | x ) ∝ P ( y ) ⋅ P ( f j | y ) j =1 j =1 10

Naive Bayes classifier for text classification • For text classification: input is document d = ( w 1 , …, w k ) x • Use as our features the words , where is our vocabulary w j 1 ≤ j ≤ | V | V • is the output classification c • Predicting the best class: c MAP = arg max c ∈ C P ( c | d ) maximum a posteriori Conditional probability of P ( c ) P ( d | c ) P ( d | c ) → = arg max (MAP) estimate generating document from class P ( d ) d c c ∈ C = arg max c ∈ C P ( c ) P ( d | c ) Prior probability of class P ( c ) → c 11

How to represent P(d | c)? • Option 1: represent the entire sequence of words • (too many sequences!) P ( w 1 , w 2 , w 3 , …, w k | c ) • Option 2: Bag of words • Assume position of each word is irrelevant   (both absolute and relative) • P ( w 1 , w 2 , w 3 , …, w k | c ) = P ( w 1 | c ) P ( w 2 | c )… P ( w k | c ) • Probability of each word is conditionally independen t   given class c 12

Bag of words word count it 6 I 5 the 4 I love this movie! It's sweet, t, to 3 fairy it but with satirical humor. The always love he to and 3 it whimsical it dialogue is great and the I and areanyone seen 2 seen ... adventure scenes are fun... friend yet 1 dialogue cal happy It manages to be whimsical would 1 recommend adventure ng and romantic while laughing whimsical 1 satirical sweet of who it movie at the conventions of the times 1 I to it but romantic I sweet 1 fairy tale genre. I would yet several t r humor satirical 1 again the recommend it to just about it ral the adventure 1 would seen anyone. I've seen it several py es to scenes I genre 1 the manages I times, and I'm always happy the fairy 1 fun times I and and to see it again whenever I about humor 1 while whenever have a friend who hasn't have have 1 conventions seen it yet! great 1 with … … 15 13

̂ ̂ ̂ Predicting with Naive Bayes • Once we assume that the position of each word is irrelevant and that the words are conditionally independent given class , we have: c P ( d | c ) = P ( w 1 , w 2 , w 3 , …, w k | c ) = P ( w 1 | c ) P ( w 2 | c )… P ( w k | c ) • The maximum a posteriori (MAP) estimate is now: is used to indicate the estimated probability P k ∏ c MAP = arg max c ∈ C P ( c ) P ( d | c ) = arg max P ( c ) P ( w i | c ) c ∈ C i =1 Note that is the number of tokens (words) in the document. k The index is the position of the token. i 14

Naive Bayes as a generative model Generate the entire data set one document at a time 15

Naive Bayes as a generative model c = Science P(c) d1 Sample a category 16

Naive Bayes as a generative model w 1 = Scientists P(w 1 |c) c = Science P(c) d1 Sample words 17

Naive Bayes as a generative model w 1 = Scientists P(w 1 |c) P(w 2 |c) w 2 = have c = Science P(w 3 |c) P(c) w 3 = discovered d1 Generate the entire data set one document at a time 18

Naive Bayes as a generative model w 1 = Scientists w 1 = Global w 2 = have w 2 = warming c = Science c = Environment w 3 = discovered w 3 = has d2 d1 Generate the entire data set one document at a time 19

Estimating probabilities 20

̂ ̂ Data sparsity • What about when count(‘amazing’, positive) = 0? • Implies P(‘amazing’ | positive ) = 0 • Given a review document, d = “…. most amazing movie ever …” k ∏ c MAP = arg max p ( c ) P ( w i | c ) c ∈ C i =1 Can’t determine the best ! c = arg max p ( c ) ⋅ 0 = arg max c ∈ C 0 c ∈ C 21

Solution: Smoothing! Laplace smoothing • Simple, easy to use • E ff ective in practice 22

      Variants Overall process Multinomial Naive Bayes Normal counts (0,1,2,…) • Input: Set of annotated documents {( d i , c i )} n for each document i =1 A. Compute vocabulary V of all words Binary (Multinomial) NB / Bernoulli NB B. Calculate   Binarized counts (0/1) For each document Name based on the C. Calculate   distribution of the features P ( f i | y ) → P ( w i | c ) D. (Prediction) Given document   d = ( w 1 , w 2 , . . . , w k ) 23

Naive Bayes Example Doc Words Class P ( c ) = N c Training 1 Chinese Beijing$Chinese c ˆ 2 Chinese$Chinese$Shanghai c N 3 Chinese$Macao c Smoothing with α = 1 P ( w | c ) = count ( w , c ) + 1 4 Tokyo$Japan$Chinese j ˆ Test 5 Chinese$Chinese$Chinese$Tokyo Japan ? count ( c ) + | V | Priors: 3 P ( c )=$ Choosing%a%class: 4 1 P ( j )=$ 4 3/4$*$(3/7) 3 *$1/14$*$1/14$ P(c|d5)$ ∝ ≈$0.0003 Conditional%Probabilities: P(Chinese| c )$= (5+1)$/$(8+6)$=$6/14$=$3/7 (0+1)$/$(8+6)$=$1/14 P(Tokyo| c )$$$$= 1/4$*$(2/9) 3 *$2/9$*$2/9$ P(j|d5)$ ∝ P(Japan| c )$$$$$= (0+1)$/$(8+6)$=$1/14 ≈$0.0001 P(Chinese| j )$= (1+1)$/$(3+6)$=$2/9$ P(Tokyo| j )$$$$$= (1+1)$/$(3+6)$=$2/9$ (1+1)$/$(3+6)$=$2/9$ P(Japan| j )$$$$$$=$ 41 24

Some details • Vocabulary is important • Tokenization matters: it can a ff ect your vocabulary • Tokenization = how you break your sentence up into tokens / words • Make sure you are consistent with your tokenization! • Special multi-word tokens: NOT_happy 25

Some details • Vocabulary is important • Tokenization matters: it can a ff ect your vocabulary • Tokenization = how you break your sentence up into tokens / words • Make sure you are consistent with your tokenization! • Handling unknown words in test not in your training vocabulary? • Remove them from your test document! Just ignore them. • Handling stop words (common words like a , the that may not be useful) Better to use • Remove them from the training data! • Modified counts (tf-idf) that down weighs frequent, unimportant words • Better models! 26

Features • In general, Naive Bayes can use any set of features, not just words • URLs, email addresses, Capitalization, … • Domain knowledge can be crucial to performance Top features for Spam detection 27

Naive Bayes and Language Models • If features = bag of words, NB gives a per class unigram language model! • For class , assigning each word:   c P ( w | c ) P ( s | c ) = ∏ assigning sentence: P ( w | c ) w ∈ s Example with positive and negative sentiments P ( s | pos ) = 0.0000005 28

Naive Bayes as a language model • Which$class$assigns$the$higher$probability$to$s? Model$pos Model$neg 0.2 I 0.1 I I love this fun film 0.001 love 0.1 love 0.1 0.1 0.01 0.05 0.1 0.01 this 0.01 this 0.2 0.001 0.01 0.005 0.1 0.05 fun 0.005 fun P(s|pos)$$>$$P(s|neg) 0.1 film 0.1 film 29

Text Classification Fall 2020 2020-09-18 Adapted from slides from - PowerPoint PPT Presentation

SFU NatLangLab CMPT 413/825: Natural Language Processing Text Classification Fall 2020 2020-09-18 Adapted from slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan 1 Announcements Remaining lectures on language modeling (LM) on

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Text Classification and Sequence Labeling Graham Neubig Text Classification

Automatic text classification and extraction of Automatic text classification and extraction of

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Annotation Guidelines for Chinese-Korean Word Alignment 2008. 5. 28 POSTECH, R. Korea Jin-Ji Li

State of the Semantic Web Beijing, China, 2006-10-16 Ivan Herman, W3C Ivan Herman, W3C What

A Proof of Concept for Modern Car Sharing Timo Kasper , Alexander Khn, David Oswald, Christian

Software Quality Engineering

Voting as Selection of the Most Representative Voter Ulle Endriss Institute for Logic, Language

Exact Reasoning: AND/ OR Search and Hybrids COMPSCI 276, Fall 2014 Set 8, Rina dechter

1 Rust A new systems programming language 1.0 was released on May 15th been in

Construction Storm Water Construction Storm Water Construction Storm Water - - 10 Most