1
play

1 Text Nave Bayes Algorithm Text Nave Bayes Algorithm (Train) - PDF document

Text Categorization Applications Web pages Recommending Yahoo-like classification CS 391L: Machine Learning Newsgroup/Blog Messages Text Categorization Recommending spam filtering Sentiment analysis for marketing


  1. Text Categorization Applications • Web pages – Recommending – Yahoo-like classification CS 391L: Machine Learning • Newsgroup/Blog Messages Text Categorization – Recommending – spam filtering – Sentiment analysis for marketing • News articles – Personalized newspaper • Email messages Raymond J. Mooney – Routing – Prioritizing University of Texas at Austin – Folderizing – spam filtering – Advertising on Gmail 1 2 Text Categorization Methods Naïve Bayes for Text • Representations of text are very high dimensional • Modeled as generating a bag of words for a (one feature for each word). document in a given category by repeatedly • Vectors are sparse since most words are rare. sampling with replacement from a – Zipf’s law and heavy-tailed distributions vocabulary V = { w 1 , w 2 ,… w m } based on the • High-bias algorithms that prevent overfitting in probabilities P( w j | c i ). high-dimensional space are best. – SVMs maximize margin to avoid over-fitting in hi-D • Smooth probability estimates with Laplace • For most text categorization tasks, there are many m -estimates assuming a uniform distribution irrelevant and many relevant features. over all words ( p = 1/| V |) and m = | V | • Methods that sum evidence from many or all – Equivalent to a virtual sample of seeing each word in features (e.g. naïve Bayes, KNN, neural-net, each category exactly once. SVM) tend to work better than ones that try to isolate just a few relevant features (decision-tree or rule induction). 3 4 Naïve Bayes Generative Model for Text Naïve Bayes Classification Win lotttery $ ! ?? ?? spam legit spam spam spam legit legit spam legit legitspam spam spam Category legit legit spam legitspam science Viagra science Viagra Category win PM win PM !! !! hot hot computer Friday ! !! hot computer Nigeria deal deal Friday ! test homework Nigeria deal lottery nude test homework March score lottery nude ! $Viagra Viagra March score May exam ! $Viagra May exam spam legit spam legit 5 6 1

  2. Text Naïve Bayes Algorithm Text Naïve Bayes Algorithm (Train) (Test) Let V be the vocabulary of all words in the documents in D Given a test document X For each category c i ∈ C Let n be the number of word occurrences in X Let D i be the subset of documents in D in category c i Return the category: P( c i ) = | D i | / | D | n ∏ argmax P ( c ) P ( a | c ) Let T i be the concatenation of all the documents in D i i i i ∈ c C = 1 i i Let n i be the total number of word occurrences in T i where a i is the word occurring the i th position in X For each word w j ∈ V Let n ij be the number of occurrences of w j in T i Let P( w j | c i ) = ( n ij + 1) / ( n i + | V |) 7 8 Underflow Prevention Naïve Bayes Posterior Probabilities • Multiplying lots of probabilities, which are • Classification results of naïve Bayes (the between 0 and 1 by definition, can result in class with maximum posterior probability) floating-point underflow. are usually fairly accurate. • Since log( xy ) = log( x ) + log( y ), it is better to • However, due to the inadequacy of the perform all computations by summing logs conditional independence assumption, the of probabilities rather than multiplying actual posterior-probability numerical probabilities. estimates are not. – Output probabilities are generally very close to • Class with highest final un-normalized log 0 or 1. probability score is still the most probable. 9 10 The Vector-Space Model Textual Similarity Metrics • Measuring similarity of two texts is a well-studied • Assume t distinct terms remain after preprocessing; problem. call them index terms or the vocabulary. • Standard metrics are based on a “bag of words” • These “orthogonal” terms form a vector space. model of a document that ignores word order and Dimension = t = |vocabulary| syntactic structure. • Each term, i , in a document or query, j , is given a • May involve removing common “stop words” and real-valued weight, w ij. stemming to reduce words to their root form. • Both documents and queries are expressed as • Vector-space model from Information Retrieval t-dimensional vectors: (IR) is the standard approach. • Other metrics (e.g. edit-distance) are also used. d j = ( w 1j , w 2j , …, w tj ) 11 12 2

  3. Graphic Representation Document Collection • A collection of n documents can be represented in the Example : vector space model by a term-document matrix. D 1 = 2T 1 + 3T 2 + 5T 3 T 3 • An entry in the matrix corresponds to the “weight” of a D 2 = 3T 1 + 7T 2 + T 3 term in the document; zero means the term has no 5 Q = 0T 1 + 0T 2 + 2T 3 significance in the document or it simply doesn’t exist in the document. D 1 = 2T 1 + 3T 2 + 5T 3 T 1 T 2 …. T t Q = 0T 1 + 0T 2 + 2T 3 D 1 w 11 w 21 … w t1 2 3 D 2 w 12 w 22 … w t2 T 1 : : : : D 2 = 3T 1 + 7T 2 + T 3 • Is D 1 or D 2 more similar to Q? : : : : • How to measure the degree of D n w 1n w 2n … w tn 7 similarity? Distance? Angle? T 2 Projection? 13 14 Term Weights: Term Frequency Term Weights: Inverse Document Frequency • Terms that appear in many different documents • More frequent terms in a document are more are less indicative of overall topic. important, i.e. more indicative of the topic. df i = document frequency of term i f ij = frequency of term i in document j = number of documents containing term i • May want to normalize term frequency ( tf ) by idf i = inverse document frequency of term i, dividing by the frequency of the most common = log 2 ( N/ df i ) term in the document: (N: total number of documents) tf ij = f ij / max i { f ij } • An indication of a term’s discrimination power. • Log used to dampen the effect relative to tf . 15 16 TF-IDF Weighting Cosine Similarity Measure • A typical combined term importance indicator is • Cosine similarity measures the cosine of t 3 the angle between two vectors. tf-idf weighting : θ 1 • Inner product normalized by the vector w ij = tf ij idf i = tf ij log 2 ( N/ df i ) lengths. D 1 t • A term occurring frequently in the document but ⋅ Q r ( w w ) d q r j ij iq • = ∑ = θ 2 CosSim( d j , q ) = i 1 rarely in the rest of the collection is given high t 1 r t t d ⋅ q r 2 2 ⋅ j w w ij iq weight. ∑ = ∑ = i 1 i 1 t 2 D 2 • Many other ways of determining term weights have been proposed. D 1 = 2T 1 + 3T 2 + 5T 3 CosSim( D 1 , Q ) = 10 / √ (4+9+25)(0+0+4) = 0.81 D 2 = 3T 1 + 7T 2 + 1T 3 CosSim( D 2 , Q ) = 2 / √ (9+49+1)(0+0+4) = 0.13 • Experimentally, tf-idf has been found to work well. Q = 0T 1 + 0T 2 + 2T 3 D 1 is 6 times better than D 2 using cosine similarity but only 5 times better using inner product. 17 18 3

  4. Relevance Feedback in IR Relevance Feedback Architecture • After initial retrieval results are presented, Document Query allow the user to provide feedback on the corpus String relevance of one or more of the retrieved documents. Revised Rankings ReRanked IR • Use this feedback information to reformulate Query Documents System the query. 1. Doc2 • Produce new results based on reformulated 2. Doc4 Query Ranked 3. Doc5 1. Doc1 Reformulation query. . 2. Doc2 Documents . 3. Doc3 • Allows more interactive, multi-pass process. . 1. Doc1 ⇓ . 2. Doc2 ⇑ 3. Doc3 ⇓ Feedback . . 19 20 Using Relevance Feedback (Rocchio) Illustration of Rocchio Text Categorization • Relevance feedback methods can be adapted for text categorization. • Use standard TF/IDF weighted vectors to represent text documents (normalized by maximum term frequency). • For each category, compute a prototype vector by summing the vectors of the training documents in the category. • Assign test documents to the category with the closest prototype vector based on cosine similarity. 21 22 Rocchio Text Categorization Algorithm Rocchio Text Categorization Algorithm (Training) (Test) Given test document x Assume the set of categories is { c 1 , c 2 ,… c n } Let d be the TF/IDF weighted term vector for x For i from 1 to n let p i = <0, 0,…,0> ( init. prototype vectors ) For each training example < x , c ( x )> ∈ D Let m = –2 ( init. maximum cosSim ) For i from 1 to n : Let d be the frequency normalized TF/IDF term vector for doc x ( compute similarity to prototype vector ) Let i = j : ( c j = c ( x )) Let s = cosSim( d , p i ) ( sum all the document vectors in c i to get p i ) if s > m Let p i = p i + d let m = s let r = c i ( update most similar class prototype ) Return class r 23 24 4

Recommend


More recommend