Text classification II CE-324: Modern Information Retrieval Sharif - PowerPoint PPT Presentation

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Outline } Vector space classification } Rocchio } Linear classifiers } SVM } kNN 2

Features } Supervised learning classifiers can use any sort of feature } URL, email address, punctuation, capitalization, dictionaries, network features } In the simplest bag of words view of documents } We use only word features } we use all of the words in the text (not a subset) 3

The bag of words representation I love this movie! It's sweet, but with satirical humor. The dialogue is great and the γ ( )=c adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet. 4

The bag of words representation great 2 γ ( )=c love 2 recommend 1 laugh 1 happy 1 ... ... 5

Sec.14.1 Recall: vector space representation } Each doc is a vector } One component for each term (= word). } Terms are axes } Usually normalize vectors to unit length. } High-dimensional vector space: } 10,000+ dimensions, or even 100,000+ } Docs are vectors in this space } How can we do classification in this space? 6

Sec.14.1 Classification using vector spaces } Training set: a set of docs, each labeled with its class (e.g., topic) } This set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space } Premise 1: Docs in the same class form a contiguous regions of space } Premise 2 : Docs from different classes don’t overlap (much) } We define surfaces to delineate classes in the space 7

Sec.14.1 Documents in a vector space Government Science Arts 8

Sec.14.1 Test document of what class? Government Science Arts 9

Sec.14.1 Test document of what class? Government Is this similarity hypothesis true in general? Government Science Arts Our main topic today is how to find good separators 10

Relevance feedback relation to classification } In relevance feedback, the user marks docs as relevant/non-relevant. } Relevant/non-relevant can be viewed as classes or categories. } For each doc, the user decides which of these two classes is correct. } Relevance feedback is a form of text classification. 11

Sec.14.2 Rocchino for text classification } Relevance feedback methods can be adapted for text categorization } Relevance feedback can be viewed as 2-class classification } Use standard tf-idf weighted vectors to represent text docs } For training docs in each category, compute a prototype as centroid of the vectors of the training docs in the category. } Prototype = centroid of members of class } Assign test docs to the category with the closest prototype vector based on cosine similarity. 12

Sec.14.2 Definition of centroid 1 ⃗ 𝜈 ⃗ 𝑑 = ( 𝑒 𝐸 ' *∈, - } 𝐸 𝑑 : docs that belong to class 𝑑 ⃗ : vector space representation of 𝑒 . } 𝑒 } Centroid will in general not be a unit vector even when the inputs are unit vectors. 13

Rocchino algorithm 14

Rocchio: example } We will see that Rocchino finds linear boundaries between classes Government Science Arts 15

Sec.14.2 Illustration of Rocchio: text classification 16

Sec.14.2 Rocchio properties } Forms a simple generalization of the examples in each class (a prototype ). } Prototype vector does not need to be normalized. } Classification is based on similarity to class prototypes. } Does not guarantee classifications are consistent with the given training data. 17

Sec.14.2 Rocchio anomaly } Prototype models have problems with polymorphic (disjunctive) categories. 18

Sec.14.2 Rocchio classification: summary } Rocchio forms a simple representation for each class: } Centroid/prototype } Classification is based on similarity to the prototype } It does not guarantee that classifications are consistent with the given training data } It is little used outside text classification } It has been used quite effectively for text classification } But in general worse than many other classifiers } Rocchio does not handle nonconvex, multimodal classes correctly. 19

Linear classifiers } Assumption:The classes are linearly separable. 3 } Classification decision: ∑ 𝑥 1 𝑦 1 + 𝑥 7 > 0 ? 145 } First, we only consider binary classifiers. } Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane (higher dimensionalities) decision boundary. } Find the parameters 𝑥 7 , 𝑥 5 , … , 𝑥 3 based on training set. } Methods for finding these parameters: Perceptron, Rocchio, … 20

Sec.14.4 Separation by hyperplanes } A simplifying assumption is linear separability : } in 2 dimensions, can separate classes by a line } in higher dimensions, need hyperplanes 21

Sec.14.2 Two-class Rocchio as a linear classifier } Line or hyperplane defined by: ? ⃗ ≥ 0 = 𝑥 7 + 𝑥 @ 𝑒 𝑥 7 + ( 𝑥 1 𝑒 1 145 } For Rocchio, set: 𝑥 = 𝜈 ⃗ 𝑑 5 − 𝜈 ⃗ 𝑑 = 𝑥 7 = 1 = − 𝜈 = 𝜈 ⃗ 𝑑 5 ⃗ 𝑑 = 2 22

Sec.14.4 Linear classifier: example } Class:“interest” (as in interest rate) } Example features of a linear classifier w i t i w i t i } • 0.70 prime • − 0.71 dlrs • 0.67 rate • − 0.35 world • 0.63 interest • − 0.33 sees • 0.60 rates • − 0.25 year • 0.46 discount • − 0.24 group • 0.43 bundesbank • − 0.24 dlr } To classify, find dot product of feature vector and weights 23

Linear classifier: example 𝑥 7 = 0 } Class “interest” in Reuters-21578 } 𝑒 5 :“rate discount dlrs world” } 𝑒 = :“prime dlrs” ⃗ 5 = 0.07 ⇒ 𝑒 5 is assigned to the “interest” class } 𝑥 @ 𝑒 ⃗ = = −0.01 ⇒ 𝑒 = is not assigned to this class } 𝑥 @ 𝑒 24

Naïve Bayes as a linear classifier ? ? 𝑄 𝑢 1 𝐷 5 IJ K,L 𝑄 𝑢 1 𝐷 = IJ K,L 𝑄 𝐷 5 G > 𝑄(𝐷 = ) G 145 145 ? log 𝑄(𝐷 5 ) + ( 𝑢𝑔 1,* × log 𝑄 𝑢 1 𝐷 5 145 ? > log 𝑄(𝐷 = ) + ( 𝑢𝑔 1,* × log 𝑄 𝑢 1 𝐷 = 145 T 𝑢 1 𝐷 5 T U V 𝑥 1 = log T 𝑢 1 𝐷 = 𝑦 1 = 𝑢𝑔 𝑥 7 = log 1,* T U V 25

Sec.14.4 Linear programming / Perceptron Find a,b,c, such that ax + by > c for red points ax + by < c for blue points 26

Perceptron } If 𝒚 (1) is misclassified: 𝒙 IY5 = 𝒙 I + 𝒚 (1) 𝑧 (1) } Perceptron convergence theorem: for linearly separable data } If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps Initialize 𝒙, 𝑢 ← 0 repeat 𝜃 can be set to 1 and 𝑢 ← 𝑢 + 1 proof still works 𝑗 ← 𝑢 mod 𝑂 if 𝒚 (1) is misclassified then 𝒙 = 𝒙 + 𝒚 (1) 𝑧 (1) Until all patterns properly classified 27

Sec.14.4 Linear classifiers } Many common text classifiers are linear classifiers } Classifiers more powerful than linear often don’t perform better on text problems.Why? } Despite the similarity of linear classifiers, noticeable performance differences between them } For separable problems, there is an infinite number of separating hyperplanes. } Different training methods pick different hyperplanes. } Also different strategies for non-separable problems 28

Sec.14.4 Which hyperplane? In general, lots of possible solutions 29

Sec.14.4 Which hyperplane? } Lots of possible solutions } Some methods find a separating hyperplane, but not the optimal one [according to some criterion of expected goodness] } Which points should influence optimality? } All points } E.g., Rocchino } Only “difficult points” close to decision boundary } E.g., SupportVector Machine (SVM) 30

Ch. 15 Linear classifiers: Which Hyperplane? } Some methods find a separating hyperplane, but not the optimal one [according to some criterion of expected goodness] } E.g., perceptron } A SupportVector Machine (SVM) finds an optimal* solution. } Maximizes the distance between the hyperplane and the “difficult points” close to decision boundary } One intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions 31

Sec. 15.1 Support Vector Machine (SVM) } SVMs maximize the margin around Support vectors the separating hyperplane. } A.k.a. large margin classifiers } The decision function is fully specified by a subset of training samples, the support vectors . } Solving SVMs is a quadratic programming problem Maximizes } Seen by many as the most Narrower margin successful current text classification margin method* *but other discriminative methods 32 often perform very similarly

Sec. 15.1 Another intuition } If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased 33

Text classification II CE-324: Modern Information Retrieval Sharif - PowerPoint PPT Presentation

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Outline } Vector space

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Text Classification and Sequence Labeling Graham Neubig Text Classification

Automatic text classification and extraction of Automatic text classification and extraction of

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Classification of Line and Character Pixels on Raster Maps Using Discrete Cosine Transformation

Edge Detection State of The Art P. Dollar and C. Zitnick Structured Forests for Fast Edge

Visual Search and Classification of Art Collections Andrew Zisserman Relja Arandjelovic and

WorkSim, an agent-based model to study labor markets Grard Ballot 1 Jean-Daniel Kant 2 (1)

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M.

Precursors of endometrioid Disclosure carcinoma of the uterus S tate of the Art *

c NB argmax log P ( c j ) log P ( x i | c j )

Empirical Methods in Natural Language Processing Lecture 12 Text Classification and Clustering