overview introduction to information retrieval
play

Overview Introduction to Information Retrieval Text classification - PowerPoint PPT Presentation

Overview Introduction to Information Retrieval Text classification http://informationretrieval.org 1 IIR 13: Text Classification & Naive Bayes Naive Bayes 2 Hinrich Sch utze Evaluation of TC 3 Institute for Natural Language


  1. Overview Introduction to Information Retrieval Text classification http://informationretrieval.org 1 IIR 13: Text Classification & Naive Bayes Naive Bayes 2 Hinrich Sch¨ utze Evaluation of TC 3 Institute for Natural Language Processing, Universit¨ at Stuttgart 2008.06.10 4 NB independence assumptions 1 / 54 2 / 54 Outline Relevance feedback In relevance feedback, the user marks a number of documents as relevant/nonrelevant. 1 Text classification We then use this information to return better search results. This is a form of text classification. Naive Bayes 2 Two “classes”: relevant, nonrelevant For each document, decide whether it is relevant or Evaluation of TC 3 nonrelevant The problem space relevance feedback belongs to is called classification. 4 NB independence assumptions The notion of classification is very general and has many applications within and beyond information retrieval. 3 / 54 4 / 54

  2. Another TC task: spam filtering From: ‘‘’’ <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for From information retrieval to text similar courses I am 22 years old and I have already purchased 6 properties using the classification: methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: standing queries – Google Alerts http://www.wholesaledaily.com/sales/nmd.htm ================================================= How would you write a program that would automatically detect and delete this type of message? 5 / 54 6 / 54 Formal definition of TC: Training Formal definition of TC: Application/Testing Given: A document space X Documents are represented in this space, typically some type of high-dimensional space. A fixed set of classes C = { c 1 , c 2 , . . . , c J } The classes are human-defined for the needs of an application Given: a description d ∈ X of a document Determine: γ ( d ) ∈ C , (e.g., spam vs. non-spam). that is, the class that is most appropriate for d A training set D of labeled documents with each labeled document � d , c � ∈ X × C Using a learning method or learning algorithm, we then wish to learn a classifier γ that maps documents to classes: γ : X → C 7 / 54 8 / 54

  3. Topic classification γ ( d ′ ) = China regions industries subject areas classes: poultry sports UK China coffee elections d ′ Many search engine functionalities are based first congestion Olympics feed roasting recount diamond training test private London Beijing chicken beans votes baseball Chinese set: set: on classification. airline tourism pate arabica seat forward Parliament Big Ben Great Wall ducks robusta run-off soccer Examples? Windsor Mao bird flu Kenya TV ads team the Queen communist turkey harvest campaign captain 9 / 54 10 / 54 Applications of text classification in IR Classification methods: 1. Manual Language identification (classes: English vs. French etc.) The automatic detection of spam pages (spam vs. nonspam, example: googel.org) The automatic detection of sexually explicit content (sexually Manual classification was used by Yahoo in the beginning of explicit vs. not) the web. Also: ODP, PubMed Sentiment detection: is a movie or product review positive or Very accurate if job is done by experts negative (positive vs. negative) Consistent when the problem size and team is small Topic-specific or vertical search – restrict search to a Manual classification is difficult and expensive to scale. “vertical” like “related to health” (relevant to vertical vs. not) → We need automatic methods for classification. Machine-learned ranking function in ad hoc retrieval (relevant vs. nonrelevant) Semantic Web: Automatically add semantic tags for non-tagged text (e.g., for each paragraph: relevant to a vertical like health or not) 11 / 54 12 / 54

  4. Classification methods: 2. Rule-based A Verity topic (a complex classification rule) Our Google Alerts example was rule-based classification. There are “IDE” type development enviroments for writing very complex rules efficiently. (e.g., Verity) Often: Boolean combinations (as in Google Alerts) Accuracy is very high if a rule has been carefully refined over time by a subject expert. Building and maintaining rule-based classification systems is expensive. 13 / 54 14 / 54 Classification methods: 3. Statistical/Probabilistic Outline 1 Text classification As per our definition of the classification problem – text classification as a learning problem Supervised learning of a the classification function γ and its Naive Bayes 2 application to classifying new documents We will look at a couple of methods for doing this: Naive Bayes, Rocchio, kNN Evaluation of TC 3 No free lunch: requires hand-classified training data But this manual classification can be done by non-experts. 4 NB independence assumptions 15 / 54 16 / 54

  5. The Naive Bayes classifier Maximum a posteriori class The Naive Bayes classifier is a probabilistic classifier. We compute the probability of a document d being in a class c as follows: Our goal is to find the “best” class. The best class in Naive Bayes classification is the most likely � P ( c | d ) ∝ P ( c ) P ( t k | c ) or maximum a posteriori (MAP) class c map : 1 ≤ k ≤ n d ˆ ˆ � ˆ P ( t k | c ) is the conditional probability of term t k occurring in a c map = arg max P ( c | d ) = arg max P ( c ) P ( t k | c ) c ∈ C c ∈ C document of class c 1 ≤ k ≤ n d P ( t k | c ) as a measure of how much evidence t k contributes We write ˆ P for P since these values are estimates from the that c is the correct class. training set. P ( c ) is the prior probability of c . If a document’s terms do not provide clear evidence for one class vs. another, we choose the one that has a higher prior probability. 17 / 54 18 / 54 Taking the log Naive Bayes classifier Classification rule: Multiplying lots of small probabilities can result in floating [ log ˆ � log ˆ c map = arg max P ( c ) + P ( t k | c )] point underflow. c ∈ C 1 ≤ k ≤ n d Since log( xy ) = log( x ) + log( y ), we can sum log probabilities instead of multiplying probabilities. Simple interpretation: Each conditional parameter log ˆ P ( t k | c ) is a weight that Since log is a monotonic function, the class with the highest indicates how good an indicator t k is for c . score does not change. The prior log ˆ P ( c ) is a weight that indicates the relative So what we usually compute in practice is: frequency of c . The sum of log prior and term weights is then a measure of [log ˆ � log ˆ c map = arg max P ( c ) + P ( t k | c )] how much evidence there is for the document being in the c ∈ C class. 1 ≤ k ≤ n d We select the class with the most evidence. Questions? 19 / 54 20 / 54

  6. Naive Bayes classifier Naive Bayes classifier Classification rule: Classification rule: [ log ˆ � log ˆ [ log ˆ � log ˆ c map = arg max P ( c ) + P ( t k | c )] c map = arg max P ( c ) + P ( t k | c )] c ∈ C c ∈ C 1 ≤ k ≤ n d 1 ≤ k ≤ n d Simple interpretation: Simple interpretation: Each conditional parameter log ˆ Each conditional parameter log ˆ P ( t k | c ) is a weight that P ( t k | c ) is a weight that indicates how good an indicator t k is for c . indicates how good an indicator t k is for c . The prior log ˆ The prior log ˆ P ( c ) is a weight that indicates the relative P ( c ) is a weight that indicates the relative frequency of c . frequency of c . The sum of log prior and term weights is then a measure of The sum of log prior and term weights is then a measure of how much evidence there is for the document being in the how much evidence there is for the document being in the class. class. We select the class with the most evidence. We select the class with the most evidence. Questions? Questions? 21 / 54 22 / 54 Naive Bayes classifier Naive Bayes classifier Classification rule: Classification rule: [ log ˆ � log ˆ [ log ˆ � log ˆ c map = arg max P ( c ) + P ( t k | c )] c map = arg max P ( c ) + P ( t k | c )] c ∈ C c ∈ C 1 ≤ k ≤ n d 1 ≤ k ≤ n d Simple interpretation: Simple interpretation: Each conditional parameter log ˆ Each conditional parameter log ˆ P ( t k | c ) is a weight that P ( t k | c ) is a weight that indicates how good an indicator t k is for c . indicates how good an indicator t k is for c . The prior log ˆ The prior log ˆ P ( c ) is a weight that indicates the relative P ( c ) is a weight that indicates the relative frequency of c . frequency of c . The sum of log prior and term weights is then a measure of The sum of log prior and term weights is then a measure of how much evidence there is for the document being in the how much evidence there is for the document being in the class. class. We select the class with the most evidence. We select the class with the most evidence. Questions? Questions? 23 / 54 24 / 54

Recommend


More recommend