Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / - PowerPoint PPT Presentation

Naïve Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

Administrative • HW 1 out today. Please start early! • Office hours • Chen: Wed 4pm-5pm • Shih-Yang: Fri 3pm-4pm • Location: Whittemore 266

Linear Regression • Model representation ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 + ⋯ + 𝜄 𝑜 𝑦 𝑜 = 𝜄 ⊤ 𝑦 2 1 • Cost function 𝐾 𝜄 = 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 σ 𝑗=1 • Gradient descent for linear regression 𝑗 } 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑛 σ 𝑗=1 Repeat until convergence { 𝜄 𝑘 ≔ 𝜄 𝑘 − 𝛽 𝑦 𝑘 • Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) • Normal equation 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧

Number of ( 𝑦 0 ) Size in feet^2 Number of Age of home Price ($) in bedrooms ( 𝑦 1 ) floors ( 𝑦 3 ) (years) ( 𝑦 4 ) 1000’s (y) ( 𝑦 2 ) 1 2104 5 1 45 460 1 1416 3 2 40 232 1 1534 3 2 30 315 1 852 2 1 36 178 … … 460 232 𝑧 = 315 178 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧 Slide credit: Andrew Ng

Least square solution 2 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 • 𝐾 𝜄 = 2𝑛 σ 𝑗=1 2 1 𝜄 ⊤ 𝑦 (𝑗) − 𝑧 𝑗 𝑛 2𝑛 σ 𝑗=1 = 1 2 = 2𝑛 𝑌𝜄 − 𝑧 2 𝜖 • 𝜖𝜄 𝐾 𝜄 = 0 • 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧

𝒛 Justification/interpretation 1 𝒀𝜾 − 𝒛 • Geometric interpretation 𝒀𝜾 column space of 𝒀 ← 𝒚 (1) → 1 ↑ ↑ ↑ ↑ ← 𝒚 (2) → ⋯ 1 𝒀 = = 𝒜 𝟏 𝒜 𝟐 𝒜 𝟑 𝒜 𝒐 ⋮ ⋮ ↓ ↓ ↓ ↓ ← 𝒚 (𝑛) → 1 • 𝒀𝜾 : column space of 𝒀 or span( {𝒜 𝟏 , 𝒜 𝟐 , ⋯ , 𝒜 𝒐 } ) • Residual 𝒀𝜾 − 𝐳 is orthogonal to the column space of 𝒀 • 𝒀 ⊤ 𝒀𝜾 − 𝐳 = 0 → (𝒀 ⊤ 𝒀)𝜾 = 𝒀 ⊤ 𝒛

Justification/interpretation 2 • Probabilistic model • Assume linear model with Gaussian errors 2𝜌𝜏 2 exp(− 1 1 𝑞 𝜄 𝑧 𝑗 𝑦 𝑗 2𝜏 2 (𝑧 𝑗 − 𝜄 ⊤ 𝑦 𝑗 ) = • Solving maximum likelihood 𝑛 𝑞 𝜄 𝑧 𝑗 𝑦 𝑗 argmin ෑ 𝜄 𝑛 1 𝑗=1 𝑛 1 2 𝑞 𝑧 𝑗 𝑦 𝑗 2 𝜄 ⊤ 𝑦 𝑗 − 𝑧 𝑗 argmin log(ෑ ) = argmin 2𝜏 2 ෍ 𝜄 𝜄 𝑗=1 𝑗=1 Image credit: CS 446@UIUC

Justification/interpretation 3 • Loss minimization 𝑛 𝑛 𝐾 𝜄 = 1 2 = 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑀 𝑚𝑡 ℎ 𝜄 𝑦 𝑗 , 𝑧 𝑗 2𝑛 ෍ 𝑛 ෍ 𝑗=1 𝑗=1 1 2 : Least squares loss • 𝑀 𝑚𝑡 𝑧, ො 𝑧 = 2 𝑧 − ො 𝑧 2 • Empirical Risk Minimization (ERM) 𝑛 1 𝑀 𝑚𝑡 𝑧 𝑗 , ො 𝑛 ෍ 𝑧 𝑗=1

𝑛 training examples, 𝑜 features Gradient Descent Normal Equation • Need to choose 𝛽 • No need to choose 𝛽 • Need many iterations • Don’t need to iterate • Works well even when • Need to compute (𝑌 ⊤ 𝑌) −1 𝑜 is large • Slow if 𝑜 is very large Slide credit: Andrew Ng

Things to remember • Model representation ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 + ⋯ + 𝜄 𝑜 𝑦 𝑜 = 𝜄 ⊤ 𝑦 2 1 • Cost function 𝐾 𝜄 = 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 σ 𝑗=1 • Gradient descent for linear regression 𝑗 } 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑛 σ 𝑗=1 Repeat until convergence { 𝜄 𝑘 ≔ 𝜄 𝑘 − 𝛽 𝑦 𝑘 • Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) • Normal equation 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧

Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori estimation (MAP) • Naïve Bayes

Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori estimation (MAP) • Naive Bayes

Random variables • Outcome space S • Space of possible outcomes • Random variables • Functions that map outcomes to real numbers • Event E • Subset of S

Visualizing probability 𝑄(𝐵) Sample space Area = 1 A is true A is false 𝑄 𝐵 = Area of the blue circle

Visualizing probability 𝑄 𝐵 + P ~A A is true A is false 𝑄 𝐵 + P ~A = 1

Visualizing probability 𝑄 𝐵 A^B B A^~B 𝑄 𝐵 = P(A, B) + P A, ~𝐶

Visualizing conditional probability A^B 𝑄 𝐵|𝐶 = 𝑄 𝐵, 𝐶 /𝑄(𝐶) Corollary: The chain rule B A 𝑄 𝐵, 𝐶 = 𝑄 𝐵|𝐶 𝑄(𝐶)

Bayes rule A^B Thomas Bayes B A 𝑄 𝐵|𝐶 = 𝑄 𝐵, 𝐶 𝑄 𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄(𝐶) Corollary: The chain rule 𝑄 𝐵, 𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = P B P A B

Other forms of Bayes rule 𝑄 𝐵|𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄(𝐶) 𝑄 𝐵|𝐶, 𝑌 = 𝑄 𝐶 𝐵, 𝑌 𝑄 𝐵, 𝑌 𝑄(𝐶, 𝑌) 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄 𝐵|𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 + 𝑄 𝐶 ~𝐵 𝑄(~𝐵)

Applying Bayes rule 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄 𝐵|𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 + 𝑄 𝐶 ~𝐵 𝑄(~𝐵) • A = you have the flu B = you just coughed • Assume: • 𝑄 𝐵 = 0.05 0.8 × 0.05 𝑄 𝐵|𝐶 = 0.8 × 0.05 + 0.2 × 0.95 ~0.17 • 𝑄 𝐶 𝐵 = 0.8 • 𝑄 𝐶 ~𝐵 = 0.2 • What is P(flu | cough) = P(A|B)? Slide credit: Tom Mitchell

Why we are learning this? 𝑦 ℎ 𝑧 Hypothesis Learn 𝑄 𝑍|𝑌

A B C Prob Joint distribution 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 • Making a joint distribution of M variables 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1. Make a truth table listing all combinations 1 1 0 0.25 1 1 1 0.10 2. For each combination of values, say how probable it is 3. Probability must sum to 1 Slide credit: Tom Mitchell

Using joint distribution • Can ask for any logical expression involving these variables • 𝑄 𝐹 = σ rows matching E 𝑄(row) σ rows matching E1 and 𝐹2 𝑄 row • 𝑄 𝐹 1 |𝐹 2 = σ rows matching 𝐹 2 𝑄 row Slide credit: Tom Mitchell

The solution to learn 𝑄 𝑍|𝑌 ? • Main problem: learning 𝑄 𝑍|𝑌 may require more data than we have • Say, learning a joint distribution with 100 attributes 2 100 ≥ 10 30 • # of rows in this table? • # of people on earth? 10 9 Slide credit: Tom Mitchell

What should we do? 1. Be smart about how we estimate probabilities from sparse data • Maximum likelihood estimates (ML) • Maximum a posteriori estimates (MAP) 2. Be smart about how to represent joint distributions • Bayes network, graphical models (more on this later) Slide credit: Tom Mitchell

Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori (MAP) • Naive Bayes

Estimating the probability 𝑌 = 1 𝑌 = 0 • Flip the coin repeatedly, observing • It turns heads 𝛽 1 times • It turns tails 𝛽 0 times • Your estimate for 𝑄 𝑌 = 1 is? • Case A: 100 flips: 51 Heads ( 𝑌 = 1 ), 49 Tails ( 𝑌 = 0 ) 𝑄 𝑌 = 1 = ? • Case B: 3 flips: 2 Heads ( 𝑌 = 1 ), 1 Tails ( 𝑌 = 0 ) 𝑄 𝑌 = 1 = ? Slide credit: Tom Mitchell

Two principles for estimating parameters • Maximum Likelihood Estimate (MLE) Choose 𝜄 that maximizes probability of observed data 𝜾 MLE = argmax ෡ 𝑄(𝐸𝑏𝑢𝑏|𝜄) 𝜄 • Maximum a posteriori estimation (MAP) Choose 𝜄 that is most probable given prior probability and data 𝑄 𝐸𝑏𝑢𝑏 𝜄 𝑄 𝜄 𝜾 MAP = argmax ෡ 𝑄 𝜄 𝐸 = argmax 𝑄(𝐸𝑏𝑢𝑏) 𝜄 𝜄 Slide credit: Tom Mitchell

Two principles for estimating parameters • Maximum Likelihood Estimate (MLE) Choose 𝜄 that maximizes 𝑄 𝐸𝑏𝑢𝑏 𝜄 𝛽 1 𝜾 MLE = ෡ 𝛽 1 + 𝛽 0 • Maximum a posteriori estimation (MAP) Choose 𝜄 that maximize 𝑄 𝜄 𝐸𝑏𝑢𝑏 (𝛽 1 + #halluciated 1s) 𝜾 MAP = ෡ (𝛽 1 +#halluciated 1𝑡) + (𝛽 0 + #halluciated 0s) Slide credit: Tom Mitchell

Maximum likelihood estimate 𝑌 = 1 𝑌 = 0 • Each flip yields Boolean value for 𝑌 𝑄 𝑌 = 1 = 𝜄 𝑌 ∼ Bernoulli : 𝑄 𝑌 = 𝜄 𝑌 1 − 𝜄 1−𝑌 𝑄 𝑌 = 0 = 1 − 𝜄 • Data set 𝐸 of independent, identically distributed (iid) flips, produces 𝛽 1 ones, 𝛽 0 zeros 𝑄 𝐸 𝜄 = 𝑄 𝛽 1 , 𝛽 0 𝜄 = 𝜄 𝛽 1 1 − 𝜄 𝛽 0 𝛽 1 ෡ 𝜾 = argmax 𝑄(𝐸|𝜄) = 𝛽 1 + 𝛽 0 𝜄 Slide credit: Tom Mitchell

Beta prior distribution 𝑄 𝜄 1 𝐶(𝛾 1 ,𝛾 0 ) 𝜄 𝛾 1 −1 1 − 𝜄 𝛾 0 −1 • 𝑄 𝜄 = 𝐶𝑓𝑢𝑏 𝛾 1 , 𝛾 0 = Slide credit: Tom Mitchell

Maximum likelihood estimate 𝑌 = 1 𝑌 = 0 • Data set 𝐸 of iid flips, produces 𝛽 1 ones, 𝛽 0 zeros 𝑄 𝐸 𝜄 = 𝑄 𝛽 1 , 𝛽 0 𝜄 = 𝜄 𝛽 1 1 − 𝜄 𝛽 0 • Assume prior ( Conjugate prior: Closed form representation of posterior) 1 𝐶(𝛾 1 , 𝛾 0 ) 𝜄 𝛾 1 −1 1 − 𝜄 𝛾 0 −1 𝑄 𝜄 = 𝐶𝑓𝑢𝑏 𝛾 1 , 𝛾 0 = 𝛽 1 + 𝛾 1 − 1 ෡ 𝜾 = argmax 𝑄 𝐸 𝜄 P(𝜄) = (𝛽 1 + 𝛾 1 − 1) + (𝛽 0 + 𝛾 0 − 1) 𝜄 Slide credit: Tom Mitchell

Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / - PowerPoint PPT Presentation

Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266 Linear Regression

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

1 Bayes Nets: Assumptions Independence in a BN Assumptions we are required to make to define

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

Nave Bayes in a Nutshell Bayes rule: Assuming conditional independence among X i s: So,

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

La th eorie PAC-Bayes en apprentissage supervis e Pr esentation au LRI de luniversit

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Another Walkthrough of Variational Bayes Bevan Jones ML for NLP Reading Group The University of

Bayes Net Representation CS 4100: Artificial Intelligence Bayes Nets: Sampling A A di

Bayes Net Representation CS 4100: Artificial Intelligence Bayes Nets: Sampling A A di

Out line Wrap up d-separ at ion I nf erence in Bayes Net s Bayes Net s (cont )

Probabilistic Diagnosis Albert R Meyer, May 3, 2013 Albert R Meyer, May 3, 2013 bayes.1

Bayes Nets 10-701 recitation 04-02-2013 Bayes Nets Represent dependencies between variables

Arthur Berg Pennsylvania State University Introduction Bayes Estimation Empirical Bayes

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

1 Problem with Brute force Nave Bayes ( ) ( ) ( ) It cannot generalize to unseen

CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Nave Bayes Pieter Abbeel

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

MLE/MAP + Nave Bayes MLE / MAP Readings: Nave Bayes Readings: Matt Gormley

1 Example: Alarm Network Bayes Net Semantics Variables B: Burglary A: Alarm goes off

Bayes Networks 2 Robert Platt Northeastern University All slides in this file are adapted from

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners Minimum description

Bayes Networks Robert Platt Northeastern University Some images, slides, or ideas are used from:

Outline Inference in Bayes Nets Variable Elimination Bayes Nets (cont) CS 486/686