V ECTOR V ECTOR ECTOR S PACE ECTOR S PACE PACE C LASSIFICATION PACE C LASSIFICATION LASSIFICATION LASSIFICATION Christopher D. Christopher D. Manning, Manning, Prabhaka Prabhakar Raghavan aghavan and Hinrich nd Hinrich Schütze, chütze, I I Intro ntroductio d d uction to to Information I f I f ormation Retrie R R etrieval, l C l , Cam C ambri b b d ridge ge University U niversity Pre P ress ss. . Cha Chapter p ter 14 Wei Wei wwei@idi.ntnu.no Lect Lecture serie eries Vector Space Classification 1 TDT4215
RecALL RecALL: Naïve RecALL: Naïve RecALL : Naïve Bayes : Naïve Bayes Bayes classifiers Bayes classifiers classifiers classifiers • • Classify based on prior weight of class and conditional parameter for Classify based on prior weight of class and conditional parameter for what each word says: c NB argmax log P ( c j ) log P ( x i | c j ) c j C i positions Training is done by counting and dividing: • T c j x k N c j P ( c j ) P ( x k | c j ) [ T c x ] N [ ] c j x i x i V • Don’t forget to smooth Vector Space Classification 2 TDT4215
Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification • Today: T d : – Vector space methods for Text Classification l f • Rocchio classification • K Nearest Neighbors – Linear classifier and non-linear classifier Linear classifier and non-linear classifier – Classification with more than two classes Vector Space Classification 3 TDT4215
Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification Vector space methods for Vector space methods for Text Classification Vector Space Classification 4 TDT4215
V ECTOR V ECTOR ECTOR S PACE ECTOR S PACE PACE C LASSIFICATION PACE C LASSIFICATION LASSIFICATION LASSIFICATION • Vector Space Representation • Vector Space Representation – Each document is a vector, one component for each term (= word). each term (= word) – Normally normalize vectors to unit length. – High dimensional vector space: – High-dimensional vector space: • Terms are axes • 10 000+ dimensions or even 100 000+ • 10,000+ dimensions, or even 100,000+ • Docs are vectors in this space – How can we do classification in this space? Vector Space Classification 5 TDT4215
V ECTOR V ECTOR ECTOR S PACE ECTOR S PACE PACE C LASSIFICATION PACE C LASSIFICATION LASSIFICATION LASSIFICATION • As before the training set is a set of documents • As before, the training set is a set of documents, each labeled with its class (e.g., topic) • In vector space classification, this set corresponds • In vector space classification this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space p • Hypothesis 1: Documents in the same class form a contiguous region of space • Hypothesis 2: Documents from different classes don’t overlap • We define surfaces to delineate classes in the space Vector Space Classification 6 TDT4215
Documents Documents in a vector Documents Documents in a vector in a vector space in a vector space space space Government Science Arts Vector Space Classification 7 TDT4215
Test document: which class? Test document: which class? Test document: which class? Test document: which class? Government Science Arts Vector Space Classification 8 TDT4215
Test document Test document = government Test document Test document government government government Government Science Arts Our main topic today is how to find good separators 9 Vector Space Classification 9 TDT4215
Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification R Rocchio text classification hi l ifi i Vector Space Classification 10 TDT4215
Rocchio Rocchio text classification Rocchio Rocchio text classification text classification text classification • Rocchio Text Classification • Rocchio Text Classification – Use standard tf-idf weighted vectors to represent text documents represent text documents – For training documents in each category, compute a prototype vector by summing the vectors of the a prototype vector by summing the vectors of the training documents in the category • Prototype = centroid of members of class yp f m m f – Assign test documents to the category with the closest prototype vector based on cosine p yp similarity Vector Space Classification 11 TDT4215
D EFINITION D EFINITION OF C ENTROID OF C ENTROID EFINITION OF EFINITION OF ENTROID ENTROID 1 ( c ) ( d ) v | D | | D c | d D c • Where D c is the set of all documents that belong g c to class c and v ( d ) is the vector space representation of d. • Note that centroid will in general not be a unit vector even when the inputs are unit vectors. h h Vector Space Classification 12 TDT4215
R OCCHIO R OCCHIO CCHIO TEXT CCHIO TEXT TEXT CLASSIFICA TEXT CLASSIFICA CLASSIFICATION CLASSIFICATION ON ON r r1 r2 r3 t t=b b b b1 b2 Vector Space Classification 13 TDT4215
R OCCHIO R OCCHIO CCHIO P ROPERTIES CCHIO P ROPERTIES ROPERTIES ROPERTIES • Forms a simple generalization of the • Forms a simple generalization of the examples in each class (a prototype ). • Prototype vector does not need to be • Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to since cosine similarity is insensitive to vector length. • Classification is based on similarity to class Classification is based on similarity to class prototypes. • Does not guarantee classifications are Does not guarantee classifications are consistent with the given training data. Vector Space Classification 14 TDT4215
R OCCHIO R OCCHIO CCHIO A NOMA CCHIO A NOMA NOMALY NOMALY LY LY • Prototype models have problems with polymorphic (disjunctive) categories. r r1 b r2 t t=r b1 b2 r3 r4 Vector Space Classification 15 TDT4215
Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification k N k Nearest Neighbor Classification N i hb Cl ifi i Vector Space Classification 16 TDT4215
K N EAREST K N EAREST EAREST N EIGHBOR EAREST N EIGHBOR EIGHBOR C LASSIFICA EIGHBOR C LASSIFICA LASSIFICATION LASSIFICATION ON ON • kNN = k Nearest Neighbor • kNN = k Nearest Neighbor • To classify document d into class c: T l ssif d m nt d int l ss : • Define k -neighborhood N as k nearest neighbors of d • Count number of documents i in N that belong to c • Estimate P(c| d ) as i/k • Estimate P(c| d ) as i/k • Choose as class argmax c P(c| d ) [ = majority class] Vector Space Classification 17 TDT4215
K N EAREST K N EAREST EAREST N EIGHBOR EAREST N EIGHBOR EIGHBOR C LASSIFICA EIGHBOR C LASSIFICA LASSIFICATION LASSIFICATION ON ON • Unlike Rocchio kNN classification determines the • Unlike Rocchio, kNN classification determines the decision boundary locally. • For 1NN (k=1) we assign each document to the class • For 1NN (k=1), we assign each document to the class of its closest neighbor. • For kNN we assign each document to the majority For kNN, we assign each document to the majority class of its k closest neighbors. K here is a parameter. • The rationale of kNN : contiguity hypothesis. – We expect a test document d to have the same p label as the training documents located nearly. Vector Space Classification 18 TDT4215
Knn Knn: k Knn Knn: k : k=1 : k 1 Vector Space Classification 19 TDT4215
Knn Knn: k Knn Knn: k : k=1 5 1 : k 1,5,10 1 5 10 10 Vector Space Classification 20 TDT4215
KNN: weighted sum voting KNN: weighted sum voting KNN: weighted sum voting KNN: weighted sum voting Vector Space Classification 21 TDT4215
K N EAREST K N EAREST EAREST N EIGHBOR EAREST N EIGHBOR EIGHBOR C LASSIFICA EIGHBOR C LASSIFICA LASSIFICATION LASSIFICATION ON ON Test Government Science Arts Vector Space Classification 22 TDT4215
K N EAREST K N EAREST EAREST N EIGHBOR EAREST N EIGHBOR EIGHBOR C LASSIFICA EIGHBOR C LASSIFICA LASSIFICATION LASSIFICATION ON ON • Learning is just storing the representations of the training examples in D . • • Testing instance x (under 1NN) : Testing instance x (under 1NN) : – Compute similarity between x and all examples in D . – Assign x the category of the most similar example in D . g g y p • Does not explicitly compute a generalization or category prototypes. • • Also called: Also called: – Case-based learning – Memory-based learning y g – Lazy learning • Rationale of kNN: contiguity hypothesis Vector Space Classification 23 TDT4215
Recommend
More recommend