t ext classification na ve bayes
play

T ext Classification & Nave Bayes CMSC 723 / LING 723 / INST - PowerPoint PPT Presentation

T ext Classification & Nave Bayes CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob Eisenstein T oday Text classification problems and their evaluation


  1. T ext Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob Eisenstein

  2. T oday • Text classification problems – and their evaluation • Linear classifiers Machine Learning, Probability – Features & Weights – Bag of words – Naïve Bayes Linguistics

  3. TE TEXT T CLAS ASSIFIC SIFICATIO TION

  4. Is this spam? From: " Fabian Starr“ <Patrick_Freeman@pamietaniepeerelu.pl> Subject: Hey! Sofware for the funny prices! Get the great discounts on popular software today for PC and Macintosh http://iiled.org/Cj4Lmx 70-90% Discounts from retail price!!! All sofware is instantly available to download - No Need Wait!

  5. Who wrote which Federalist papers? • 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton. • Authorship of 12 of the letters in dispute • 1963: solved by Mosteller and Wallace using Bayesian methods James Madison Alexander Hamilton

  6. Positive or negative movie review? • unbelievably disappointing • Full of zany characters and richly applied satire, and some great plot twists • this is the greatest screwball comedy ever filmed • It was pathetic. The worst part about it was the boxing scenes.

  7. What is the subject of this article? MeSH Subject Category Hierarchy MEDLINE Article • Antogonists and Inhibitors • Blood Supply ? • Chemistry • Drug Therapy • Embryology • Epidemiology • …

  8. T ext Classification • Assigning subject categories, topics, or genres • Spam detection • Authorship identification • Age/gender identification • Language Identification • Sentiment analysis • …

  9. T ext Classification: definition • Input : – a document w – a fixed set of classes Y = { y 1 , y 2 ,…, y J } • Output : a predicted class y  Y

  10. Classification Methods: Hand-coded rules • Rules based on combinations of words or other features – spam: black-list- address OR (“dollars” AND “have been selected”) • Accuracy can be high – If rules carefully refined by expert • But building and maintaining these rules is expensive

  11. Classification Methods: Supervised Machine Learning • Input – a document w – a fixed set of classes Y = { y 1 , y 2 ,…, y J } – A training set of m hand-labeled documents (w 1 ,y 1 ),....,(w m ,y m ) • Output – a learned classifier w  y

  12. Aside: getting examples for supervised learning • Human annotation – By experts or non-experts (crowdsourcing) – Found data • Truth vs. gold standard • How do we know how good a classifier is? – Accuracy on held out data

  13. Aside: evaluating classifiers • How do we know how good a classifier is? – Compare classifier predictions with human annotation – On held out test examples – Evaluation metrics: accuracy, precision, recall

  14. The 2-by-2 contingency table correct not correct selected tp fp not selected fn tn

  15. Precision and recall • Precision : % of selected items that are correct Recall : % of correct items that are selected correct not correct selected tp fp not selected fn tn

  16. A combined measure: F • A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean): b + PR 2 1 ( 1 ) = = F b + P R 1 1 2 a + - a ( 1 ) P R • People usually use balanced F1 measure i.e., with  = 1 (that is,  = ½): – F = 2 PR /( P + R )

  17. LINE NEAR AR CLAS ASSIFIERS SIFIERS

  18. Bag of words

  19. Defining features

  20. Linear classification

  21. Linear Models for Classification Feature function representation Weights

  22. How can we learn weights? • By hand • Probability – Today: Naïve Bayes • Discriminative training – e.g., perceptron, support vector machines

  23. Generative Story for Multinomial Naïve Bayes • A hypothetical stochastic process describing how training examples are generated

  24. Prediction with Naïve Bayes

  25. Parameter Estimation • “count and normalize” • Parameters of a multinomial distribution – Relative frequency estimator – Formally: this is the maximum likelihood estimate • See CIML for derivation

  26. Smoothing

  27. Naïve Bayes recap

  28. T oday • Text classification problems – and their evaluation • Linear classifiers Machine Learning, Probability – Features & Weights – Bag of words – Naïve Bayes Linguistics

Recommend


More recommend