Naïve Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824
Administrative • HW 1 out today. Please start early! • Office hours • Chen: Wed 4pm-5pm • Shih-Yang: Fri 3pm-4pm • Location: Whittemore 266
Linear Regression • Model representation ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 + ⋯ + 𝜄 𝑜 𝑦 𝑜 = 𝜄 ⊤ 𝑦 2 1 • Cost function 𝐾 𝜄 = 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 σ 𝑗=1 • Gradient descent for linear regression 𝑗 } 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑛 σ 𝑗=1 Repeat until convergence { 𝜄 𝑘 ≔ 𝜄 𝑘 − 𝛽 𝑦 𝑘 • Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) • Normal equation 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧
Number of ( 𝑦 0 ) Size in feet^2 Number of Age of home Price ($) in bedrooms ( 𝑦 1 ) floors ( 𝑦 3 ) (years) ( 𝑦 4 ) 1000’s (y) ( 𝑦 2 ) 1 2104 5 1 45 460 1 1416 3 2 40 232 1 1534 3 2 30 315 1 852 2 1 36 178 … … 460 232 𝑧 = 315 178 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧 Slide credit: Andrew Ng
Least square solution 2 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 • 𝐾 𝜄 = 2𝑛 σ 𝑗=1 2 1 𝜄 ⊤ 𝑦 (𝑗) − 𝑧 𝑗 𝑛 2𝑛 σ 𝑗=1 = 1 2 = 2𝑛 𝑌𝜄 − 𝑧 2 𝜖 • 𝜖𝜄 𝐾 𝜄 = 0 • 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧
𝒛 Justification/interpretation 1 𝒀𝜾 − 𝒛 • Geometric interpretation 𝒀𝜾 column space of 𝒀 ← 𝒚 (1) → 1 ↑ ↑ ↑ ↑ ← 𝒚 (2) → ⋯ 1 𝒀 = = 𝒜 𝟏 𝒜 𝟐 𝒜 𝟑 𝒜 𝒐 ⋮ ⋮ ↓ ↓ ↓ ↓ ← 𝒚 (𝑛) → 1 • 𝒀𝜾 : column space of 𝒀 or span( {𝒜 𝟏 , 𝒜 𝟐 , ⋯ , 𝒜 𝒐 } ) • Residual 𝒀𝜾 − 𝐳 is orthogonal to the column space of 𝒀 • 𝒀 ⊤ 𝒀𝜾 − 𝐳 = 0 → (𝒀 ⊤ 𝒀)𝜾 = 𝒀 ⊤ 𝒛
Justification/interpretation 2 • Probabilistic model • Assume linear model with Gaussian errors 2𝜌𝜏 2 exp(− 1 1 𝑞 𝜄 𝑧 𝑗 𝑦 𝑗 2𝜏 2 (𝑧 𝑗 − 𝜄 ⊤ 𝑦 𝑗 ) = • Solving maximum likelihood 𝑛 𝑞 𝜄 𝑧 𝑗 𝑦 𝑗 argmin ෑ 𝜄 𝑛 1 𝑗=1 𝑛 1 2 𝑞 𝑧 𝑗 𝑦 𝑗 2 𝜄 ⊤ 𝑦 𝑗 − 𝑧 𝑗 argmin log(ෑ ) = argmin 2𝜏 2 𝜄 𝜄 𝑗=1 𝑗=1 Image credit: CS 446@UIUC
Justification/interpretation 3 • Loss minimization 𝑛 𝑛 𝐾 𝜄 = 1 2 = 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑀 𝑚𝑡 ℎ 𝜄 𝑦 𝑗 , 𝑧 𝑗 2𝑛 𝑛 𝑗=1 𝑗=1 1 2 : Least squares loss • 𝑀 𝑚𝑡 𝑧, ො 𝑧 = 2 𝑧 − ො 𝑧 2 • Empirical Risk Minimization (ERM) 𝑛 1 𝑀 𝑚𝑡 𝑧 𝑗 , ො 𝑛 𝑧 𝑗=1
𝑛 training examples, 𝑜 features Gradient Descent Normal Equation • Need to choose 𝛽 • No need to choose 𝛽 • Need many iterations • Don’t need to iterate • Works well even when • Need to compute (𝑌 ⊤ 𝑌) −1 𝑜 is large • Slow if 𝑜 is very large Slide credit: Andrew Ng
Things to remember • Model representation ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 + ⋯ + 𝜄 𝑜 𝑦 𝑜 = 𝜄 ⊤ 𝑦 2 1 • Cost function 𝐾 𝜄 = 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 σ 𝑗=1 • Gradient descent for linear regression 𝑗 } 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑛 σ 𝑗=1 Repeat until convergence { 𝜄 𝑘 ≔ 𝜄 𝑘 − 𝛽 𝑦 𝑘 • Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) • Normal equation 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧
Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori estimation (MAP) • Naïve Bayes
Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori estimation (MAP) • Naive Bayes
Random variables • Outcome space S • Space of possible outcomes • Random variables • Functions that map outcomes to real numbers • Event E • Subset of S
Visualizing probability 𝑄(𝐵) Sample space Area = 1 A is true A is false 𝑄 𝐵 = Area of the blue circle
Visualizing probability 𝑄 𝐵 + P ~A A is true A is false 𝑄 𝐵 + P ~A = 1
Visualizing probability 𝑄 𝐵 A^B B A^~B 𝑄 𝐵 = P(A, B) + P A, ~𝐶
Visualizing conditional probability A^B 𝑄 𝐵|𝐶 = 𝑄 𝐵, 𝐶 /𝑄(𝐶) Corollary: The chain rule B A 𝑄 𝐵, 𝐶 = 𝑄 𝐵|𝐶 𝑄(𝐶)
Bayes rule A^B Thomas Bayes B A 𝑄 𝐵|𝐶 = 𝑄 𝐵, 𝐶 𝑄 𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄(𝐶) Corollary: The chain rule 𝑄 𝐵, 𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = P B P A B
Other forms of Bayes rule 𝑄 𝐵|𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄(𝐶) 𝑄 𝐵|𝐶, 𝑌 = 𝑄 𝐶 𝐵, 𝑌 𝑄 𝐵, 𝑌 𝑄(𝐶, 𝑌) 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄 𝐵|𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 + 𝑄 𝐶 ~𝐵 𝑄(~𝐵)
Applying Bayes rule 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄 𝐵|𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 + 𝑄 𝐶 ~𝐵 𝑄(~𝐵) • A = you have the flu B = you just coughed • Assume: • 𝑄 𝐵 = 0.05 0.8 × 0.05 𝑄 𝐵|𝐶 = 0.8 × 0.05 + 0.2 × 0.95 ~0.17 • 𝑄 𝐶 𝐵 = 0.8 • 𝑄 𝐶 ~𝐵 = 0.2 • What is P(flu | cough) = P(A|B)? Slide credit: Tom Mitchell
Why we are learning this? 𝑦 ℎ 𝑧 Hypothesis Learn 𝑄 𝑍|𝑌
A B C Prob Joint distribution 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 • Making a joint distribution of M variables 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1. Make a truth table listing all combinations 1 1 0 0.25 1 1 1 0.10 2. For each combination of values, say how probable it is 3. Probability must sum to 1 Slide credit: Tom Mitchell
Using joint distribution • Can ask for any logical expression involving these variables • 𝑄 𝐹 = σ rows matching E 𝑄(row) σ rows matching E1 and 𝐹2 𝑄 row • 𝑄 𝐹 1 |𝐹 2 = σ rows matching 𝐹 2 𝑄 row Slide credit: Tom Mitchell
The solution to learn 𝑄 𝑍|𝑌 ? • Main problem: learning 𝑄 𝑍|𝑌 may require more data than we have • Say, learning a joint distribution with 100 attributes 2 100 ≥ 10 30 • # of rows in this table? • # of people on earth? 10 9 Slide credit: Tom Mitchell
What should we do? 1. Be smart about how we estimate probabilities from sparse data • Maximum likelihood estimates (ML) • Maximum a posteriori estimates (MAP) 2. Be smart about how to represent joint distributions • Bayes network, graphical models (more on this later) Slide credit: Tom Mitchell
Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori (MAP) • Naive Bayes
Estimating the probability 𝑌 = 1 𝑌 = 0 • Flip the coin repeatedly, observing • It turns heads 𝛽 1 times • It turns tails 𝛽 0 times • Your estimate for 𝑄 𝑌 = 1 is? • Case A: 100 flips: 51 Heads ( 𝑌 = 1 ), 49 Tails ( 𝑌 = 0 ) 𝑄 𝑌 = 1 = ? • Case B: 3 flips: 2 Heads ( 𝑌 = 1 ), 1 Tails ( 𝑌 = 0 ) 𝑄 𝑌 = 1 = ? Slide credit: Tom Mitchell
Two principles for estimating parameters • Maximum Likelihood Estimate (MLE) Choose 𝜄 that maximizes probability of observed data 𝜾 MLE = argmax 𝑄(𝐸𝑏𝑢𝑏|𝜄) 𝜄 • Maximum a posteriori estimation (MAP) Choose 𝜄 that is most probable given prior probability and data 𝑄 𝐸𝑏𝑢𝑏 𝜄 𝑄 𝜄 𝜾 MAP = argmax 𝑄 𝜄 𝐸 = argmax 𝑄(𝐸𝑏𝑢𝑏) 𝜄 𝜄 Slide credit: Tom Mitchell
Two principles for estimating parameters • Maximum Likelihood Estimate (MLE) Choose 𝜄 that maximizes 𝑄 𝐸𝑏𝑢𝑏 𝜄 𝛽 1 𝜾 MLE = 𝛽 1 + 𝛽 0 • Maximum a posteriori estimation (MAP) Choose 𝜄 that maximize 𝑄 𝜄 𝐸𝑏𝑢𝑏 (𝛽 1 + #halluciated 1s) 𝜾 MAP = (𝛽 1 +#halluciated 1𝑡) + (𝛽 0 + #halluciated 0s) Slide credit: Tom Mitchell
Maximum likelihood estimate 𝑌 = 1 𝑌 = 0 • Each flip yields Boolean value for 𝑌 𝑄 𝑌 = 1 = 𝜄 𝑌 ∼ Bernoulli : 𝑄 𝑌 = 𝜄 𝑌 1 − 𝜄 1−𝑌 𝑄 𝑌 = 0 = 1 − 𝜄 • Data set 𝐸 of independent, identically distributed (iid) flips, produces 𝛽 1 ones, 𝛽 0 zeros 𝑄 𝐸 𝜄 = 𝑄 𝛽 1 , 𝛽 0 𝜄 = 𝜄 𝛽 1 1 − 𝜄 𝛽 0 𝛽 1 𝜾 = argmax 𝑄(𝐸|𝜄) = 𝛽 1 + 𝛽 0 𝜄 Slide credit: Tom Mitchell
Beta prior distribution 𝑄 𝜄 1 𝐶(𝛾 1 ,𝛾 0 ) 𝜄 𝛾 1 −1 1 − 𝜄 𝛾 0 −1 • 𝑄 𝜄 = 𝐶𝑓𝑢𝑏 𝛾 1 , 𝛾 0 = Slide credit: Tom Mitchell
Maximum likelihood estimate 𝑌 = 1 𝑌 = 0 • Data set 𝐸 of iid flips, produces 𝛽 1 ones, 𝛽 0 zeros 𝑄 𝐸 𝜄 = 𝑄 𝛽 1 , 𝛽 0 𝜄 = 𝜄 𝛽 1 1 − 𝜄 𝛽 0 • Assume prior ( Conjugate prior: Closed form representation of posterior) 1 𝐶(𝛾 1 , 𝛾 0 ) 𝜄 𝛾 1 −1 1 − 𝜄 𝛾 0 −1 𝑄 𝜄 = 𝐶𝑓𝑢𝑏 𝛾 1 , 𝛾 0 = 𝛽 1 + 𝛾 1 − 1 𝜾 = argmax 𝑄 𝐸 𝜄 P(𝜄) = (𝛽 1 + 𝛾 1 − 1) + (𝛽 0 + 𝛾 0 − 1) 𝜄 Slide credit: Tom Mitchell
Recommend
More recommend