c nb argmax log p c j log p x i c j
play

c NB argmax log P ( c j ) log P ( x i | c j ) - PowerPoint PPT Presentation

V ECTOR V ECTOR ECTOR S PACE ECTOR S PACE PACE C LASSIFICATION PACE C LASSIFICATION LASSIFICATION LASSIFICATION Christopher D. Christopher D. Manning, Manning, Prabhaka Prabhakar Raghavan aghavan and Hinrich nd Hinrich Schtze, chtze, I


  1. V ECTOR V ECTOR ECTOR S PACE ECTOR S PACE PACE C LASSIFICATION PACE C LASSIFICATION LASSIFICATION LASSIFICATION Christopher D. Christopher D. Manning, Manning, Prabhaka Prabhakar Raghavan aghavan and Hinrich nd Hinrich Schütze, chütze, I I Intro ntroductio d d uction to to Information I f I f ormation Retrie R R etrieval, l C l , Cam C ambri b b d ridge ge University U niversity Pre P ress ss. . Cha Chapter p ter 14 Wei Wei wwei@idi.ntnu.no Lect Lecture serie eries Vector Space Classification 1 TDT4215

  2. RecALL RecALL: Naïve RecALL: Naïve RecALL : Naïve Bayes : Naïve Bayes Bayes classifiers Bayes classifiers classifiers classifiers • • Classify based on prior weight of class and conditional parameter for Classify based on prior weight of class and conditional parameter for what each word says:         c NB  argmax log P ( c j )  log P ( x i | c j )     c j  C i  positions Training is done by counting and dividing: • T c j x k   N c j P ( c j )  P ( x k | c j )    [ T c x   ] N [ ] c j x i x i  V • Don’t forget to smooth Vector Space Classification 2 TDT4215

  3. Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification • Today: T d : – Vector space methods for Text Classification l f • Rocchio classification • K Nearest Neighbors – Linear classifier and non-linear classifier Linear classifier and non-linear classifier – Classification with more than two classes Vector Space Classification 3 TDT4215

  4. Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification Vector space methods for Vector space methods for Text Classification Vector Space Classification 4 TDT4215

  5. V ECTOR V ECTOR ECTOR S PACE ECTOR S PACE PACE C LASSIFICATION PACE C LASSIFICATION LASSIFICATION LASSIFICATION • Vector Space Representation • Vector Space Representation – Each document is a vector, one component for each term (= word). each term (= word) – Normally normalize vectors to unit length. – High dimensional vector space: – High-dimensional vector space: • Terms are axes • 10 000+ dimensions or even 100 000+ • 10,000+ dimensions, or even 100,000+ • Docs are vectors in this space – How can we do classification in this space? Vector Space Classification 5 TDT4215

  6. V ECTOR V ECTOR ECTOR S PACE ECTOR S PACE PACE C LASSIFICATION PACE C LASSIFICATION LASSIFICATION LASSIFICATION • As before the training set is a set of documents • As before, the training set is a set of documents, each labeled with its class (e.g., topic) • In vector space classification, this set corresponds • In vector space classification this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space p • Hypothesis 1: Documents in the same class form a contiguous region of space • Hypothesis 2: Documents from different classes don’t overlap • We define surfaces to delineate classes in the space Vector Space Classification 6 TDT4215

  7. Documents Documents in a vector Documents Documents in a vector in a vector space in a vector space space space Government Science Arts Vector Space Classification 7 TDT4215

  8. Test document: which class? Test document: which class? Test document: which class? Test document: which class? Government Science Arts Vector Space Classification 8 TDT4215

  9. Test document Test document = government Test document Test document government government government Government Science Arts Our main topic today is how to find good separators 9 Vector Space Classification 9 TDT4215

  10. Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification R Rocchio text classification hi l ifi i Vector Space Classification 10 TDT4215

  11. Rocchio Rocchio text classification Rocchio Rocchio text classification text classification text classification • Rocchio Text Classification • Rocchio Text Classification – Use standard tf-idf weighted vectors to represent text documents represent text documents – For training documents in each category, compute a prototype vector by summing the vectors of the a prototype vector by summing the vectors of the training documents in the category • Prototype = centroid of members of class yp f m m f – Assign test documents to the category with the closest prototype vector based on cosine p yp similarity Vector Space Classification 11 TDT4215

  12. D EFINITION D EFINITION OF C ENTROID OF C ENTROID EFINITION OF EFINITION OF ENTROID ENTROID   1   ( c )  ( d ) v | D | | D c | d  D c  • Where D c is the set of all documents that belong g c to class c and v ( d ) is the vector space representation of d. • Note that centroid will in general not be a unit vector even when the inputs are unit vectors. h h Vector Space Classification 12 TDT4215

  13. R OCCHIO R OCCHIO CCHIO TEXT CCHIO TEXT TEXT CLASSIFICA TEXT CLASSIFICA CLASSIFICATION CLASSIFICATION ON ON r r1 r2 r3 t t=b b b b1 b2 Vector Space Classification 13 TDT4215

  14. R OCCHIO R OCCHIO CCHIO P ROPERTIES CCHIO P ROPERTIES ROPERTIES ROPERTIES • Forms a simple generalization of the • Forms a simple generalization of the examples in each class (a prototype ). • Prototype vector does not need to be • Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to since cosine similarity is insensitive to vector length. • Classification is based on similarity to class Classification is based on similarity to class prototypes. • Does not guarantee classifications are Does not guarantee classifications are consistent with the given training data. Vector Space Classification 14 TDT4215

  15. R OCCHIO R OCCHIO CCHIO A NOMA CCHIO A NOMA NOMALY NOMALY LY LY • Prototype models have problems with polymorphic (disjunctive) categories. r r1 b r2 t t=r b1 b2 r3 r4 Vector Space Classification 15 TDT4215

  16. Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification Vector SPACE text classification k N k Nearest Neighbor Classification N i hb Cl ifi i Vector Space Classification 16 TDT4215

  17. K N EAREST K N EAREST EAREST N EIGHBOR EAREST N EIGHBOR EIGHBOR C LASSIFICA EIGHBOR C LASSIFICA LASSIFICATION LASSIFICATION ON ON • kNN = k Nearest Neighbor • kNN = k Nearest Neighbor • To classify document d into class c: T l ssif d m nt d int l ss : • Define k -neighborhood N as k nearest neighbors of d • Count number of documents i in N that belong to c • Estimate P(c| d ) as i/k • Estimate P(c| d ) as i/k • Choose as class argmax c P(c| d ) [ = majority class] Vector Space Classification 17 TDT4215

  18. K N EAREST K N EAREST EAREST N EIGHBOR EAREST N EIGHBOR EIGHBOR C LASSIFICA EIGHBOR C LASSIFICA LASSIFICATION LASSIFICATION ON ON • Unlike Rocchio kNN classification determines the • Unlike Rocchio, kNN classification determines the decision boundary locally. • For 1NN (k=1) we assign each document to the class • For 1NN (k=1), we assign each document to the class of its closest neighbor. • For kNN we assign each document to the majority For kNN, we assign each document to the majority class of its k closest neighbors. K here is a parameter. • The rationale of kNN : contiguity hypothesis. – We expect a test document d to have the same p label as the training documents located nearly. Vector Space Classification 18 TDT4215

  19. Knn Knn: k Knn Knn: k : k=1 : k 1 Vector Space Classification 19 TDT4215

  20. Knn Knn: k Knn Knn: k : k=1 5 1 : k 1,5,10 1 5 10 10 Vector Space Classification 20 TDT4215

  21. KNN: weighted sum voting KNN: weighted sum voting KNN: weighted sum voting KNN: weighted sum voting Vector Space Classification 21 TDT4215

  22. K N EAREST K N EAREST EAREST N EIGHBOR EAREST N EIGHBOR EIGHBOR C LASSIFICA EIGHBOR C LASSIFICA LASSIFICATION LASSIFICATION ON ON Test Government Science Arts Vector Space Classification 22 TDT4215

  23. K N EAREST K N EAREST EAREST N EIGHBOR EAREST N EIGHBOR EIGHBOR C LASSIFICA EIGHBOR C LASSIFICA LASSIFICATION LASSIFICATION ON ON • Learning is just storing the representations of the training examples in D . • • Testing instance x (under 1NN) : Testing instance x (under 1NN) : – Compute similarity between x and all examples in D . – Assign x the category of the most similar example in D . g g y p • Does not explicitly compute a generalization or category prototypes. • • Also called: Also called: – Case-based learning – Memory-based learning y g – Lazy learning • Rationale of kNN: contiguity hypothesis Vector Space Classification 23 TDT4215

Recommend


More recommend