Nave Bayes Classifiers Review Let event D = data we - PowerPoint PPT Presentation

Naïve ¡Bayes ¡Classifiers ¡

Review ¡ • Let ¡event ¡D ¡= ¡data ¡we ¡have ¡observed. ¡ • Let ¡events ¡H 1 , ¡…, ¡H k ¡be ¡events ¡represenAng ¡ hypotheses ¡we ¡want ¡to ¡choose ¡between. ¡ • Use ¡D ¡to ¡pick ¡the ¡"best" ¡H. ¡ ¡ • There ¡are ¡two ¡"standard" ¡ways ¡to ¡do ¡this, ¡ depending ¡on ¡what ¡informaAon ¡we ¡have ¡ available. ¡

Maximum ¡likelihood ¡hypothesis ¡ • The ¡maximum ¡likelihood ¡hypothesis ¡(H ML ) ¡is ¡ the ¡hypothesis ¡that ¡maximizes ¡the ¡probability ¡ of ¡the ¡data ¡given ¡that ¡hypothesis. ¡ H ML = argmax P ( D | H i ) i • How ¡to ¡use ¡it: ¡compute ¡P(D ¡| ¡H i ) ¡for ¡each ¡ hypothesis ¡and ¡select ¡the ¡one ¡with ¡the ¡ greatest ¡value. ¡

Maximum ¡a ¡posteriori ¡(MAP) ¡ hypothesis ¡ • The ¡MAP ¡hypothesis ¡is ¡the ¡hypothesis ¡that ¡ maximizes ¡the ¡posterior ¡probability: ¡ H MAP = argmax P ( H i | D ) i P ( D | H i ) P ( H i ) = argmax P ( D ) i ∝ argmax P ( D | H i ) P ( H i ) i • The ¡P(D ¡| ¡H i ) ¡terms ¡are ¡now ¡ weighted ¡by ¡the ¡ hypothesis ¡prior ¡probabiliAes. ¡

Combining ¡evidence ¡ • If ¡we ¡have ¡mulAple ¡pieces ¡of ¡data/evidence ¡ (say ¡2), ¡then ¡we ¡need ¡to ¡compute ¡or ¡esAmate ¡ P ( D 1 , D 2 | H 0 ) which ¡is ¡oWen ¡hard. ¡ • Instead, ¡we ¡assume ¡all ¡pieces ¡of ¡evidence ¡are ¡ condiAonally ¡independent ¡given ¡a ¡hypothesis: ¡ P ( D 1 , D 2 | H 0 ) = P ( D 1 | H 0 ) P ( D 2 | H 0 )

Combining ¡evidence ¡ P ( H 0 | D 1 , . . . , D m ) = P ( D 1 , . . . , D m | H 0 ) P ( H 0 ) P ( D 1 , . . . , D m ) h i P ( D 1 | H 0 ) · · · P ( D m | H 0 ) P ( H 0 ) = P ( D 1 , . . . , D m ) h Q m i j =1 P ( D j | H 0 ) P ( H 0 ) = P ( D 1 , . . . , D m ) where ¡ k h m ! i X Y P ( D 1 . . . , D m ) = P ( D j | H i ) P ( H i ) i =1 j =1

ClassificaAon ¡ • ClassificaAon ¡is ¡the ¡problem ¡of ¡idenAfying ¡which ¡of ¡ a ¡set ¡categories ¡(called ¡classes) ¡a ¡parAcular ¡item ¡ belongs ¡in. ¡ • Lots ¡of ¡real-‑world ¡problems ¡are ¡classificaAon ¡ problems: ¡ – spam ¡filtering ¡(classes: ¡spam/not-‑spam) ¡ – handwriAng ¡recogniAon ¡& ¡OCR ¡(classes: ¡one ¡for ¡each ¡ le[er, ¡number, ¡or ¡symbol) ¡ – text ¡classificaAon, ¡image ¡classificaAon, ¡music ¡ classificaAon, ¡etc. ¡ • Almost ¡any ¡problem ¡where ¡you ¡are ¡assigning ¡a ¡ label ¡to ¡items ¡can ¡be ¡set ¡up ¡as ¡a ¡classificaAon ¡task. ¡

ClassificaAon ¡ • An ¡algorithm ¡that ¡does ¡classificaAon ¡is ¡called ¡a ¡ classifier. ¡ ¡Classifiers ¡take ¡an ¡item ¡as ¡input ¡and ¡output ¡ the ¡class ¡it ¡thinks ¡that ¡item ¡belongs ¡to. ¡ ¡That ¡is, ¡the ¡ classifier ¡ predicts ¡a ¡class ¡for ¡each ¡item. ¡ • Lots ¡of ¡classifiers ¡are ¡based ¡on ¡probabiliAes ¡and ¡ staAsAcal ¡inference: ¡ – The ¡classes ¡become ¡the ¡hypotheses ¡being ¡tested. ¡ – The ¡item ¡being ¡classified ¡is ¡turned ¡into ¡a ¡collecAon ¡of ¡data ¡ called ¡features. ¡ ¡Useful ¡features ¡are ¡a[ributes ¡of ¡the ¡item ¡ that ¡are ¡strongly ¡correlated ¡with ¡certain ¡classes. ¡ – The ¡classificaAon ¡algorithm ¡is ¡usually ¡ML ¡or ¡MAP, ¡ depending ¡on ¡what ¡data ¡we ¡have ¡available. ¡

Example: ¡Spam ¡classificaAon ¡ • New ¡email ¡arrives: ¡is ¡it ¡spam ¡or ¡not ¡spam? ¡ • A ¡useful ¡set ¡of ¡features ¡might ¡be ¡the ¡presence ¡or ¡ absence ¡of ¡various ¡words ¡in ¡the ¡email: ¡ – F1, ¡~F1: ¡"Kirlin" ¡appears/does ¡not ¡appear ¡ – F2, ¡~F2: ¡"viagra" ¡appears/does ¡not ¡appear ¡ – F3, ¡~F3: ¡"cash" ¡appears/does ¡not ¡appear ¡ • Let's ¡say ¡our ¡new ¡email ¡contains ¡"Kirlin" ¡and ¡ "cash," ¡but ¡not ¡"viagra." ¡ • The ¡features ¡for ¡this ¡email ¡are ¡F1, ¡~F2, ¡and ¡F3. ¡ • Let's ¡use ¡MAP ¡for ¡classificaAon. ¡

Example: ¡Spam ¡classificaAon ¡ • Features: ¡F1, ¡~F2, ¡F3. ¡ H MAP = argmax P ( D | H i ) P ( H i ) i H MAP = argmax P ( F 1 , ¬ F 2 , F 3 | H i ) P ( H i ) i ∈ { spam , not-spam } • But ¡where ¡do ¡these ¡probabiliAes ¡come ¡from? ¡

Learning ¡probabiliAes ¡from ¡data ¡ • To ¡use ¡MAP, ¡we ¡need ¡to ¡calculate ¡or ¡esAmate ¡ P(Hi) ¡and ¡P(F1, ¡~F2, ¡F3 ¡| ¡Hi) ¡for ¡each ¡i. ¡ • In ¡other ¡words, ¡we ¡need ¡to ¡know: ¡ – P(spam) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ – P(not-‑spam) ¡ – P(F1, ¡~F2, ¡F3 ¡| ¡spam) ¡ – P(F1, ¡~F2, ¡F3 ¡| ¡not-‑spam) ¡

Learning ¡probabiliAes ¡from ¡data ¡ • Let's ¡assume ¡we ¡have ¡access ¡to ¡a ¡large ¡ number ¡of ¡old ¡emails ¡that ¡are ¡correctly ¡ labeled ¡as ¡spam/not-‑spam. ¡ • How ¡can ¡we ¡esAmate ¡P(spam)? ¡ P (spam) = # of emails labeled as spam total # of emails

Learning ¡probabiliAes ¡from ¡data ¡ • Let's ¡assume ¡we ¡have ¡access ¡to ¡a ¡large ¡ number ¡of ¡old ¡emails ¡that ¡are ¡correctly ¡ labeled ¡as ¡spam/not-‑spam. ¡ • How ¡can ¡we ¡esAmate ¡P(F1, ¡~F2, ¡F3 ¡| ¡spam)? ¡ P ( F 1 , ¬ F 2 , F 3 | spam) = # of spam emails with those exact features total # of spam emails • Why ¡is ¡this ¡probably ¡going ¡to ¡be ¡a ¡very ¡rough ¡ esAmate? ¡

CondiAonal ¡independence ¡to ¡the ¡rescue! ¡ • It ¡is ¡unlikely ¡that ¡our ¡set ¡of ¡old ¡emails ¡contains ¡ many ¡messages ¡with ¡that ¡exact ¡set ¡of ¡features. ¡ • Let's ¡make ¡an ¡assumpAon ¡that ¡all ¡of ¡our ¡features ¡ are ¡condiAonally ¡independent ¡of ¡each ¡other, ¡ given ¡the ¡hypothesis ¡(spam/not-‑spam). ¡ P ( F 1 , ¬ F 2 , F 3 | spam) = P ( F 1 | spam) · P ( ¬ F 2 | spam) · P ( F 3 | spam) • These ¡probabiliAes ¡are ¡easier ¡to ¡get ¡good ¡ esAmates ¡for! ¡ • A ¡classifier ¡that ¡makes ¡this ¡assumpAon ¡is ¡called ¡a ¡ Naïve ¡Bayes ¡classifier. ¡

Learning ¡probabiliAes ¡from ¡data ¡ • So ¡now ¡we ¡need ¡to ¡esAmate ¡P(F1 ¡| ¡spam) ¡ instead ¡of ¡P(F1, ¡~F2, ¡F3 ¡| ¡spam). ¡ • Equivalently, ¡how ¡can ¡we ¡esAmate ¡the ¡ probability ¡of ¡seeing ¡"Kirlin" ¡in ¡an ¡email ¡given ¡ that ¡the ¡email ¡is ¡spam? ¡ P ( F 1 | spam) = # of spam emails with the word Kirlin total # of spam emails

Another ¡problem ¡to ¡handle… ¡ • What ¡if ¡we ¡see ¡a ¡word ¡we've ¡never ¡encountered ¡ before? ¡ ¡What ¡happens ¡to ¡its ¡probability ¡esAmate? ¡ ¡ (and ¡why ¡is ¡this ¡bad?) ¡ P ( F j | spam) = # of spam emails with word F j total # of spam emails h Q m i j =1 P ( F j | spam) P (spam) P (spam | F 1 , . . . , F m ) = P ( F 1 , . . . , F m ) • Probability ¡of ¡zero ¡destroys ¡the ¡enAre ¡calculaAon! ¡

Another ¡problem ¡to ¡handle… ¡ • Fix ¡the ¡esAmates: ¡ P ( F j | spam) = # of spam emails with word F j + 1 total # of spam emails + 2 • This ¡is ¡called ¡ smoothing . ¡ ¡Removes ¡the ¡possibility ¡ of ¡a ¡zero ¡probability ¡wiping ¡out ¡the ¡enAre ¡ calculaAon. ¡ • "Simulates" ¡two ¡addiAonal ¡spam ¡emails, ¡one ¡with ¡ every ¡word, ¡and ¡one ¡with ¡no ¡words. ¡

Summary ¡of ¡Naïve ¡Bayes ¡ • Naïve ¡Bayes ¡classifies ¡using ¡MAP: ¡ H MAP = argmax P ( D | H i ) P ( H i ) i = argmax P ( F 1 , . . . , F m | H i ) P ( H i ) i ∈ { spam , not-spam } h i = argmax P ( F 1 | H i ) · · · P ( F m | H i ) P ( H i ) i ∈ { spam , not-spam } h m i Y = argmax P ( F j | H i ) P ( H i ) i ∈ { spam , not-spam } j =1 • Compute ¡this ¡for ¡spam ¡and ¡for ¡not-‑spam; ¡see ¡ which ¡is ¡bigger. ¡

Nave Bayes Classifiers Review Let event D = data we - PowerPoint PPT Presentation

Nave Bayes Classifiers Review Let event D = data we have observed. Let events H 1 , , H k be events represenAng hypotheses we want to

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

BAYES AND NEAREST NEIGHBOR BAYES AND NEAREST NEIGHBOR CLASSIFIERS CLASSIFIERS Matthieu R Bloch

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

1 Problem with Brute force Nave Bayes ( ) ( ) ( ) It cannot generalize to unseen

Adversarial Machine Learning Daniel Lowd University of Oregon

COMP 2103 Programming 3 Part 4 Jim Diamond CAR 409 Jodrey School of Computer Science

Spatio-temporal Analysis of Reverted Wikipedia Edits Johannes Kiesel , Martin Potthast, Matthias

Exploring our data Edmund Hart Instructor DataCamp Network Analysis in R: Case Studies Bike

Welcome Machine Learning Andrew Ng Andrew Ng Andrew Ng Machine Learning - Grew

Using Google Classroom Best practices for using TCI with Google Classroom Meet The Speakers!

Quantitative Approaches to Discourse on Social Media Workshop, Computational Humanities Summer

Processing Twitter Text Alex Hanna Computational Social Scientist DataCamp Analyzing Social