text categorization
play

Text Categorization P2P Security Datamining Semantic Web Case - PDF document

Course Overview Info Extraction Ecommerce Web Services Text Categorization P2P Security Datamining Semantic Web Case Studies: Nutch, Google, Altavista CSE 454 Information Retrieval Crawler Architecture Precision vs Recall


  1. Course Overview Info Extraction Ecommerce Web Services Text Categorization P2P Security Datamining Semantic Web Case Studies: Nutch, Google, Altavista CSE 454 Information Retrieval Crawler Architecture Precision vs Recall Synchronization & Monitors Inverted Indicies Systems Foundation: Networking & Clusters 1 2 Why is Learning Possible? Bias Experience alone never justifies any • The nice word for prejudice is “bias”. conclusion about any unseen instance. • What kind of hypotheses will you consider? –What is allowable range of functions you use when approximating? • What kind of hypotheses do you prefer? Learning occurs when PREJUDICE meets DATA! Learning a “FOO” 3 4 A Learning Problem Some Typical Bias: The World is Simple • Occam’s razor “It is needless to do more when less will suffice” – William of Occam, died 1349 of the Black plague • MDL – Minimum description length • Concepts can be approximated by ... conjunctions of predicates ... by linear functions ... by short decision trees 5 6 1

  2. Hypothesis Spaces 7 8 Terminology Two Strategies for ML • Restriction bias: use prior knowledge to specify a restricted hypothesis space. –Naïve Bayes • Preference bias: use a broad hypothesis space, but impose an ordering on the hypotheses. –Decision Trees. 9 10 Key Issues for ML Framework for Learning Algos 11 12 2

  3. Categorization (review) Learning for Categorization • A training example is an instance x ∈ X, • Given: – A description of an instance, x ∈ X , where X is paired with its correct category c ( x ): < x , c ( x )> for an unknown categorization the instance language or instance space . function, c . – A fixed set of categories: C= { c 1 , c 2 ,… c n } • Given a set of training examples, D . • Determine: • Find a hypothesized categorization function, – The category of x : c ( x ) ∈ C, where c ( x ) is a h ( x ), such that: categorization function whose domain is X and ∀ < > ∈ = x , c ( x ) D : h ( x ) c ( x ) whose range is C . Consistency 13 14 Sample Category Learning More to the Point Problem • Instance language: <size, color, shape> • C(X) = true if X is a Webcam page – size ∈ {small, medium, large} • Features – color ∈ {red, blue, green} Words on page – shape ∈ {square, circle, triangle} …. • C = {positive, negative} • Hypothesis Language • D : Example Size Color Shape Category 1 small red circle positive 2 large red circle positive 3 small red triangle negative 4 large blue circle negative 15 16 Text Categorization Generalization • Assigning documents to a fixed set of categories. • Applications: • Hypotheses must generalize to correctly classify – Web pages instances not in the training data. • Categories in search (see microsoft.com) – Simply memorizing training examples gives a • Yahoo-like classification consistent hypothesis that does not generalize. – Newsgroup Messages / News articles • Occam’s razor : • Recommending – Finding a simple hypothesis helps ensure • Personalized newspaper generalization. – Email messages • Routing • Prioritizing • Folderizing • spam filtering 17 18 3

  4. General Learning Issues Learning for Text Categorization • Many hypotheses often consistent w/ training data. • Manual development of text categorization • Bias functions is difficult. – Any criteria other than consistency with the training • Learning Algorithms: data that is used to select a hypothesis. – Bayesian (naïve) • Classification accuracy – Neural network – % of instances classified correctly – Relevance Feedback (Rocchio) – Measured on independent test data. – Rule based (C4.5, Ripper, Slipper) • Training time – Nearest Neighbor (case based) – Efficiency of training algorithm – Support Vector Machines (SVM) • Testing time – Efficiency of subsequent classification 19 20 Using Relevance Feedback Rocchio Text Categorization Algorithm (Rocchio) (Training) • Adapt relevance feedback for text categorization. Assume the set of categories is { c 1 , c 2 ,… c n } • Use standard TF/IDF weighted vectors to represent For i from 1 to n let p i = <0, 0,…,0> ( init. prototype vectors ) text documents (normalized by maximum term For each training example < x , c ( x )> ∈ D frequency). Let d = frequency normalized TF/IDF term vector for doc x • For each category, compute a prototype vector by Let i = j : ( c j = c ( x )) ( sum all the document vectors in c i to get p i ) summing the vectors of the training documents in Let p i = p i + d the category. • Assign test documents to the category with the closest prototype vector based on cosine similarity. 21 22 Rocchio Text Categorization Algo Illustration of Rocchio Text (Test) Categorization Given test document x Let d be the TF/IDF weighted term vector for x Let m = –2 ( init. maximum cosSim ) For i from 1 to n : ( compute similarity to prototype vector ) Let s = cosSim( d , p i ) if s > m let m = s let r = c i ( update most similar class prototype ) Return class r 23 24 4

  5. Rocchio Properties Rocchio Time Complexity • Note: The time to add two sparse vectors is • Does not guarantee a consistent hypothesis. proportional to minimum number of non-zero • Forms a simple generalization of the entries in the two vectors. examples in each class (a prototype ). • Training Time: O(| D |( L d + | V d |)) = O(| D | L d ) • Prototype vector does not need to be where L d is the average length of a document in D and V d averaged or otherwise normalized for length is the average vocabulary size for a document in D. since cosine similarity is insensitive to • Test Time: O( L t + |C||V t | ) where L t is the average length of a test document and | V t | vector length. is the average vocabulary size for a test document. • Classification is based on similarity to class – Assumes lengths of p i vectors are computed and stored during training, allowing cosSim( d , p i ) to be computed in time prototypes. proportional to the number of non-zero entries in d (i.e. |V t | ) 25 26 Nearest-Neighbor Learning K Nearest-Neighbor Algorithm • Learning is just storing the representations of the • Using only the closest example to determine training examples in D . categorization is subject to errors due to: • Testing instance x : – A single atypical example. – Compute similarity between x and all examples in D . – Noise (i.e. error) in the category label of a – Assign x the category of the most similar example in D . single training example. • Does not explicitly compute a generalization or • More robust alternative is to find the k category prototypes. most-similar examples and return the • Also called: majority category of these k examples. – Case-based • Value of k is typically odd to avoid ties, 3 – Memory-based and 5 are most common. – Lazy learning 27 28 3 Nearest Neighbor Illustration Similarity Metrics (Euclidian Distance) • Nearest neighbor method depends on a similarity (or distance) metric. . . • Simplest for continuous m -dimensional . instance space is Euclidian distance . . . . . • Simplest for m -dimensional binary instance . . space is Hamming distance (number of . feature values that differ). . • For text, cosine similarity of TF-IDF weighted vectors is typically most effective. 29 30 5

  6. Illustration of 3 Nearest Neighbor K Nearest Neighbor for Text for Text Training: For each each training example < x , c ( x )> ∈ D Compute the corresponding TF-IDF vector, d x , for document x Test instance y : Compute TF-IDF vector d for document y For each < x , c ( x )> ∈ D Let s x = cosSim( d , d x ) Sort examples, x , in D by decreasing value of s x Let N be the first k examples in D. ( get most similar neighbors ) Return the majority class of examples in N 31 32 Rocchio Anomaly 3 Nearest Neighbor Comparison • Prototype models have problems with • Nearest Neighbor tends to handle polymorphic (disjunctive) categories. polymorphic categories better. Cause: strong bias of Rocchio learner 33 34 Nearest Neighbor Time Nearest Neighbor Complexity with Inverted Index • Determining k nearest neighbors is the same as • Training Time: O(| D | L d ) to compose determining the k best retrievals using the test TF-IDF vectors. document as a query to a database of training • Testing Time: O( L t + |D||V t | ) to compare to documents. all training vectors. • Use standard VSR inverted index methods to find the k nearest neighbors. – Assumes lengths of d x vectors are computed and stored during training, allowing cosSim( d , d x ) to be computed • Testing Time: O( B|V t | ) in time proportional to the number of non-zero entries where B is the average number of training documents in in d (i.e. |V t | ) which a test-document word appears. • Testing time can be high for large training • Therefore, overall classification is O( L t + B|V t | ) sets. – Typically B << | D | 35 36 6

Recommend


More recommend