Text classification II CE-324: Modern Information Retrieval Sharif - PowerPoint PPT Presentation

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Outline  Vector space classification  Rocchio  Linear classifiers  kNN 2

Ch. 13 Standing queries  The path from IR to text classification:  You have an information need to monitor, say:  Unrest in the Niger delta region  You want to rerun an appropriate query periodically to find new news items on this topic  You will be sent new documents that are found  I.e., it ’ s not ranking but classification (relevant vs. not relevant)  Such queries are called standing queries  Long used by “ information professionals ”  A modern mass instantiation is Google Alerts  Standing queries are (hand-written) text classifiers

Sec.14.1 Recall: vector space representation  Each doc is a vector  One component for each term (= word).  Terms are axes  Usually normalize vectors to unit length.  High-dimensional vector space:  10,000+ dimensions, or even 100,000+  Docs are vectors in this space  How can we do classification in this space? 4

Sec.14.1 Classification using vector spaces  Training set: a set of docs, each labeled with its class (e.g., topic)  This set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space  Premise 1: Docs in the same class form a contiguous regions of space  Premise 2 : Docs from different classes don ’ t overlap (much)  We define surfaces to delineate classes in the space 5

Sec.14.1 Documents in a vector space Government Science Arts 6

Sec.14.1 Test document of what class? Government Science Arts 7

Sec.14.1 Test document of what class? Government Is this similarity hypothesis true in general? Government Science Arts Our main topic today is how to find good separators 8

Relevance feedback relation to classification  In relevance feedback, the user marks docs as relevant/non-relevant.  Relevant/non-relevant can be viewed as classes or categories.  For each doc, the user decides which of these two classes is correct.  Relevance feedback is a form of text classification. 9

Sec.14.2 Rocchino for text classification  Relevance feedback methods can be adapted for text categorization  Relevance feedback can be viewed as 2-class classification  Use standard tf-idf weighted vectors to represent text docs  For training docs in each category, compute a prototype as centroid of the vectors of the training docs in the category.  Prototype = centroid of members of class  Assign test docs to the category with the closest prototype vector based on cosine similarity. 10

Sec.14.2 Definition of centroid 1 𝜈 𝑑 = 𝑒 𝐸 𝑑 𝑒∈𝐸 𝑑  𝐸 𝑑 : docs that belong to class 𝑑  𝑒 : vector space representation of 𝑒 .  Centroid will in general not be a unit vector even when the inputs are unit vectors. 11

Rocchino algorithm 12

Rocchio: example  We will see that Rocchino finds linear boundaries between classes Government Science Arts 13

Sec.14.2 Illustration of Rocchio: text classification 14

Sec.14.2 Rocchio properties  Forms a simple generalization of the examples in each class (a prototype ).  Prototype vector does not need to be normalized.  Classification is based on similarity to class prototypes.  Does not guarantee classifications are consistent with the given training data. 15

Sec.14.2 Rocchio anomaly  Prototype models have problems with polymorphic (disjunctive) categories. 16

Sec.14.2 Rocchio classification: summary  Rocchio forms a simple representation for each class:  Centroid/prototype  Classification is based on similarity to the prototype  It does not guarantee that classifications are consistent with the given training data  It is little used outside text classification  It has been used quite effectively for text classification  But in general worse than many other classifiers  Rocchio does not handle nonconvex, multimodal classes correctly. 17

Linear classifiers  Assumption:The classes are linearly separable. 𝑛 𝑥 𝑗 𝑦 𝑗 + 𝑥 0 > 0 ?  Classification decision: 𝑗=1  First, we only consider binary classifiers.  Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane (higher dimensionalities) decision boundary.  Find the parameters 𝑥 0 , 𝑥 1 , … , 𝑥 𝑛 based on training set.  Methods for finding these parameters: Perceptron, Rocchio, … 18

Sec.14.4 Separation by hyperplanes  A simplifying assumption is linear separability :  in 2 dimensions, can separate classes by a line  in higher dimensions, need hyperplanes 19

Sec.14.2 Two-class Rocchio as a linear classifier  Line or hyperplane defined by: 𝑁 𝑥 𝑗 𝑒 𝑗 = 𝑥 0 + 𝑥 𝑈 𝑥 0 + 𝑒 ≥ 0 𝑗=1  For Rocchio, set: 𝑥 = 𝜈 𝑑 1 − 𝜈 𝑑 2 𝑥 0 = 1 2 − 2 𝜈 𝑑 1 𝜈 𝑑 2 2 20

Sec.14.4 Linear classifier: example  Class: “ interest ” (as in interest rate)  Example features of a linear classifier w i t i w i t i  • 0.70 prime • − 0.71 dlrs • 0.67 rate • − 0.35 world • 0.63 interest • − 0.33 sees • 0.60 rates • − 0.25 year • 0.46 discount • − 0.24 group • 0.43 bundesbank • − 0.24 dlr  To classify, find dot product of feature vector and weights 21

Linear classifier: example 𝑥 0 = 0  Class “ interest ” in Reuters-21578  𝑒 1 : “ rate discount dlrs world ”  𝑒 2 : “ prime dlrs ”  𝑥 𝑈 𝑒 1 = 0.07 ⇒ 𝑒 1 is assigned to the “ interest ” class  𝑥 𝑈 𝑒 2 = −0.01 ⇒ 𝑒 2 is not assigned to this class 22

Naïve Bayes as a linear classifier 𝑁 𝑁 𝑄 𝑢 𝑗 𝐷 1 𝑢𝑔 𝑗,𝑒 > 𝑄(𝐷 2 ) 𝑄 𝑢 𝑗 𝐷 2 𝑢𝑔 𝑗,𝑒 𝑄 𝐷 1 𝑗=1 𝑗=1 𝑁 log 𝑄(𝐷 1 ) + 𝑢𝑔 𝑗,𝑒 × log 𝑄 𝑢 𝑗 𝐷 1 𝑗=1 𝑁 > log 𝑄(𝐷 2 ) + 𝑢𝑔 𝑗,𝑒 × log 𝑄 𝑢 𝑗 𝐷 2 𝑗=1 𝑄 𝑢 𝑗 𝐷 1 𝑄 𝐷 1 𝑥 𝑗 = log 𝑦 𝑗 = 𝑢𝑔 𝑥 0 = log 𝑗,𝑒 𝑄 𝑢 𝑗 𝐷 2 𝑄 𝐷 1 23

Sec.14.4 Linear programming / Perceptron Find a,b,c, such that ax + by > c for red points ax + by < c for blue points 24

Sec.14.4 Which hyperplane? In general, lots of possible solutions for a,b,c. 25

Sec.14.4 Which hyperplane?  Lots of possible solutions for a,b,c.  Some methods find a separating hyperplane, but not the optimal one [according to some criterion of expected goodness]  Which points should influence optimality?  All points  E.g., Rocchino  Only “ difficult points ” close to decision boundary  E.g., SupportVector Machine (SVM) 26

Sec. 15.1 Support Vector Machine (SVM)  SVMs maximize the margin around the separating hyperplane. Support vectors  A.k.a. large margin classifiers  Solving SVMs is a quadratic programming problem  Seen by many as the most Maximizes successful current text classification Narrower margin method* margin *but other discriminative methods often perform very similarly 27

Sec.14.4 Linear classifiers  Many common text classifiers are linear classifiers  Classifiers more powerful than linear often don ’ t perform better on text problems.Why?  Despite the similarity of linear classifiers, noticeable performance differences between them  For separable problems, there is an infinite number of separating hyperplanes.  Different training methods pick different hyperplanes.  Also different strategies for non-separable problems 28

Sec.14.4 Linear classifiers: binary and multiclass classification  Consider 2 class problems  Deciding between two classes, perhaps, government and non- government  Multi-class  How do we define (and find) the separating surface?  How do we decide which region a test doc is in? 29

Sec.14.5 More than two classes  One-of classification (multi-class classification)  Classes are mutually exclusive.  Each doc belongs to exactly one class  Any-of classification  Classes are not mutually exclusive.  A doc can belong to 0, 1, or >1 classes.  For simplicity, decompose into K binary problems  Quite common for docs 30

Sec.14.5 Set of binary classifiers: any of  Build a separator between each class and its complementary set (docs from all other classes).  Given test doc, evaluate it for membership in each class.  Apply decision criterion of classifiers independently  It works although considering dependencies between categories may be more accurate 31

Sec.14.5 Multi-class: set of binary classifiers  Build a separator between each class and its complementary set (docs from all other classes).  Given test doc, evaluate it for membership in each class.  Assign doc to class with:  maximum score  maximum confidence  maximum probability ? ? ? 32

Sec.14.3 k Nearest Neighbor Classification  kNN = k Nearest Neighbor  To classify a document d :  Define k -neighborhood as the k nearest neighbors of d  Pick the majority class label in the k - neighborhood 33

Text classification II CE-324: Modern Information Retrieval Sharif - PowerPoint PPT Presentation

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Outline Vector space

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Text Classification and Sequence Labeling Graham Neubig Text Classification

Automatic text classification and extraction of Automatic text classification and extraction of

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M.

Classification of Line and Character Pixels on Raster Maps Using Discrete Cosine Transformation

Edge Detection State of The Art P. Dollar and C. Zitnick Structured Forests for Fast Edge

Visual Search and Classification of Art Collections Andrew Zisserman Relja Arandjelovic and

Precursors of endometrioid Disclosure carcinoma of the uterus S tate of the Art *

c NB argmax log P ( c j ) log P ( x i | c j )

Empirical Methods in Natural Language Processing Lecture 12 Text Classification and Clustering

Hierarchical Classification of Pulmonary Lesions: A Large-Scale Radio-Pathomics Study Jiancheng

Sambuz

Useful Links

Newsletter

Mail Us