na ve bayes
play

Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / - PowerPoint PPT Presentation

Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266 Linear Regression


  1. Naïve Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

  2. Administrative • HW 1 out today. Please start early! • Office hours • Chen: Wed 4pm-5pm • Shih-Yang: Fri 3pm-4pm • Location: Whittemore 266

  3. Linear Regression • Model representation ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 + ⋯ + 𝜄 𝑜 𝑦 𝑜 = 𝜄 ⊤ 𝑦 2 1 • Cost function 𝐾 𝜄 = 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 σ 𝑗=1 • Gradient descent for linear regression 𝑗 } 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑛 σ 𝑗=1 Repeat until convergence { 𝜄 𝑘 ≔ 𝜄 𝑘 − 𝛽 𝑦 𝑘 • Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) • Normal equation 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧

  4. Number of ( 𝑦 0 ) Size in feet^2 Number of Age of home Price ($) in bedrooms ( 𝑦 1 ) floors ( 𝑦 3 ) (years) ( 𝑦 4 ) 1000’s (y) ( 𝑦 2 ) 1 2104 5 1 45 460 1 1416 3 2 40 232 1 1534 3 2 30 315 1 852 2 1 36 178 … … 460 232 𝑧 = 315 178 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧 Slide credit: Andrew Ng

  5. Least square solution 2 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 • 𝐾 𝜄 = 2𝑛 σ 𝑗=1 2 1 𝜄 ⊤ 𝑦 (𝑗) − 𝑧 𝑗 𝑛 2𝑛 σ 𝑗=1 = 1 2 = 2𝑛 𝑌𝜄 − 𝑧 2 𝜖 • 𝜖𝜄 𝐾 𝜄 = 0 • 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧

  6. 𝒛 Justification/interpretation 1 𝒀𝜾 − 𝒛 • Geometric interpretation 𝒀𝜾 column space of 𝒀 ← 𝒚 (1) → 1 ↑ ↑ ↑ ↑ ← 𝒚 (2) → ⋯ 1 𝒀 = = 𝒜 𝟏 𝒜 𝟐 𝒜 𝟑 𝒜 𝒐 ⋮ ⋮ ↓ ↓ ↓ ↓ ← 𝒚 (𝑛) → 1 • 𝒀𝜾 : column space of 𝒀 or span( {𝒜 𝟏 , 𝒜 𝟐 , ⋯ , 𝒜 𝒐 } ) • Residual 𝒀𝜾 − 𝐳 is orthogonal to the column space of 𝒀 • 𝒀 ⊤ 𝒀𝜾 − 𝐳 = 0 → (𝒀 ⊤ 𝒀)𝜾 = 𝒀 ⊤ 𝒛

  7. Justification/interpretation 2 • Probabilistic model • Assume linear model with Gaussian errors 2𝜌𝜏 2 exp(− 1 1 𝑞 𝜄 𝑧 𝑗 𝑦 𝑗 2𝜏 2 (𝑧 𝑗 − 𝜄 ⊤ 𝑦 𝑗 ) = • Solving maximum likelihood 𝑛 𝑞 𝜄 𝑧 𝑗 𝑦 𝑗 argmin ෑ 𝜄 𝑛 1 𝑗=1 𝑛 1 2 𝑞 𝑧 𝑗 𝑦 𝑗 2 𝜄 ⊤ 𝑦 𝑗 − 𝑧 𝑗 argmin log(ෑ ) = argmin 2𝜏 2 ෍ 𝜄 𝜄 𝑗=1 𝑗=1 Image credit: CS 446@UIUC

  8. Justification/interpretation 3 • Loss minimization 𝑛 𝑛 𝐾 𝜄 = 1 2 = 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑀 𝑚𝑡 ℎ 𝜄 𝑦 𝑗 , 𝑧 𝑗 2𝑛 ෍ 𝑛 ෍ 𝑗=1 𝑗=1 1 2 : Least squares loss • 𝑀 𝑚𝑡 𝑧, ො 𝑧 = 2 𝑧 − ො 𝑧 2 • Empirical Risk Minimization (ERM) 𝑛 1 𝑀 𝑚𝑡 𝑧 𝑗 , ො 𝑛 ෍ 𝑧 𝑗=1

  9. 𝑛 training examples, 𝑜 features Gradient Descent Normal Equation • Need to choose 𝛽 • No need to choose 𝛽 • Need many iterations • Don’t need to iterate • Works well even when • Need to compute (𝑌 ⊤ 𝑌) −1 𝑜 is large • Slow if 𝑜 is very large Slide credit: Andrew Ng

  10. Things to remember • Model representation ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 + ⋯ + 𝜄 𝑜 𝑦 𝑜 = 𝜄 ⊤ 𝑦 2 1 • Cost function 𝐾 𝜄 = 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 σ 𝑗=1 • Gradient descent for linear regression 𝑗 } 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑛 σ 𝑗=1 Repeat until convergence { 𝜄 𝑘 ≔ 𝜄 𝑘 − 𝛽 𝑦 𝑘 • Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) • Normal equation 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧

  11. Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori estimation (MAP) • Naïve Bayes

  12. Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori estimation (MAP) • Naive Bayes

  13. Random variables • Outcome space S • Space of possible outcomes • Random variables • Functions that map outcomes to real numbers • Event E • Subset of S

  14. Visualizing probability 𝑄(𝐵) Sample space Area = 1 A is true A is false 𝑄 𝐵 = Area of the blue circle

  15. Visualizing probability 𝑄 𝐵 + P ~A A is true A is false 𝑄 𝐵 + P ~A = 1

  16. Visualizing probability 𝑄 𝐵 A^B B A^~B 𝑄 𝐵 = P(A, B) + P A, ~𝐶

  17. Visualizing conditional probability A^B 𝑄 𝐵|𝐶 = 𝑄 𝐵, 𝐶 /𝑄(𝐶) Corollary: The chain rule B A 𝑄 𝐵, 𝐶 = 𝑄 𝐵|𝐶 𝑄(𝐶)

  18. Bayes rule A^B Thomas Bayes B A 𝑄 𝐵|𝐶 = 𝑄 𝐵, 𝐶 𝑄 𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄(𝐶) Corollary: The chain rule 𝑄 𝐵, 𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = P B P A B

  19. Other forms of Bayes rule 𝑄 𝐵|𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄(𝐶) 𝑄 𝐵|𝐶, 𝑌 = 𝑄 𝐶 𝐵, 𝑌 𝑄 𝐵, 𝑌 𝑄(𝐶, 𝑌) 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄 𝐵|𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 + 𝑄 𝐶 ~𝐵 𝑄(~𝐵)

  20. Applying Bayes rule 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄 𝐵|𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 + 𝑄 𝐶 ~𝐵 𝑄(~𝐵) • A = you have the flu B = you just coughed • Assume: • 𝑄 𝐵 = 0.05 0.8 × 0.05 𝑄 𝐵|𝐶 = 0.8 × 0.05 + 0.2 × 0.95 ~0.17 • 𝑄 𝐶 𝐵 = 0.8 • 𝑄 𝐶 ~𝐵 = 0.2 • What is P(flu | cough) = P(A|B)? Slide credit: Tom Mitchell

  21. Why we are learning this? 𝑦 ℎ 𝑧 Hypothesis Learn 𝑄 𝑍|𝑌

  22. A B C Prob Joint distribution 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 • Making a joint distribution of M variables 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1. Make a truth table listing all combinations 1 1 0 0.25 1 1 1 0.10 2. For each combination of values, say how probable it is 3. Probability must sum to 1 Slide credit: Tom Mitchell

  23. Using joint distribution • Can ask for any logical expression involving these variables • 𝑄 𝐹 = σ rows matching E 𝑄(row) σ rows matching E1 and 𝐹2 𝑄 row • 𝑄 𝐹 1 |𝐹 2 = σ rows matching 𝐹 2 𝑄 row Slide credit: Tom Mitchell

  24. The solution to learn 𝑄 𝑍|𝑌 ? • Main problem: learning 𝑄 𝑍|𝑌 may require more data than we have • Say, learning a joint distribution with 100 attributes 2 100 ≥ 10 30 • # of rows in this table? • # of people on earth? 10 9 Slide credit: Tom Mitchell

  25. What should we do? 1. Be smart about how we estimate probabilities from sparse data • Maximum likelihood estimates (ML) • Maximum a posteriori estimates (MAP) 2. Be smart about how to represent joint distributions • Bayes network, graphical models (more on this later) Slide credit: Tom Mitchell

  26. Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori (MAP) • Naive Bayes

  27. Estimating the probability 𝑌 = 1 𝑌 = 0 • Flip the coin repeatedly, observing • It turns heads 𝛽 1 times • It turns tails 𝛽 0 times • Your estimate for 𝑄 𝑌 = 1 is? • Case A: 100 flips: 51 Heads ( 𝑌 = 1 ), 49 Tails ( 𝑌 = 0 ) 𝑄 𝑌 = 1 = ? • Case B: 3 flips: 2 Heads ( 𝑌 = 1 ), 1 Tails ( 𝑌 = 0 ) 𝑄 𝑌 = 1 = ? Slide credit: Tom Mitchell

  28. Two principles for estimating parameters • Maximum Likelihood Estimate (MLE) Choose 𝜄 that maximizes probability of observed data 𝜾 MLE = argmax ෡ 𝑄(𝐸𝑏𝑢𝑏|𝜄) 𝜄 • Maximum a posteriori estimation (MAP) Choose 𝜄 that is most probable given prior probability and data 𝑄 𝐸𝑏𝑢𝑏 𝜄 𝑄 𝜄 𝜾 MAP = argmax ෡ 𝑄 𝜄 𝐸 = argmax 𝑄(𝐸𝑏𝑢𝑏) 𝜄 𝜄 Slide credit: Tom Mitchell

  29. Two principles for estimating parameters • Maximum Likelihood Estimate (MLE) Choose 𝜄 that maximizes 𝑄 𝐸𝑏𝑢𝑏 𝜄 𝛽 1 𝜾 MLE = ෡ 𝛽 1 + 𝛽 0 • Maximum a posteriori estimation (MAP) Choose 𝜄 that maximize 𝑄 𝜄 𝐸𝑏𝑢𝑏 (𝛽 1 + #halluciated 1s) 𝜾 MAP = ෡ (𝛽 1 +#halluciated 1𝑡) + (𝛽 0 + #halluciated 0s) Slide credit: Tom Mitchell

  30. Maximum likelihood estimate 𝑌 = 1 𝑌 = 0 • Each flip yields Boolean value for 𝑌 𝑄 𝑌 = 1 = 𝜄 𝑌 ∼ Bernoulli : 𝑄 𝑌 = 𝜄 𝑌 1 − 𝜄 1−𝑌 𝑄 𝑌 = 0 = 1 − 𝜄 • Data set 𝐸 of independent, identically distributed (iid) flips, produces 𝛽 1 ones, 𝛽 0 zeros 𝑄 𝐸 𝜄 = 𝑄 𝛽 1 , 𝛽 0 𝜄 = 𝜄 𝛽 1 1 − 𝜄 𝛽 0 𝛽 1 ෡ 𝜾 = argmax 𝑄(𝐸|𝜄) = 𝛽 1 + 𝛽 0 𝜄 Slide credit: Tom Mitchell

  31. Beta prior distribution 𝑄 𝜄 1 𝐶(𝛾 1 ,𝛾 0 ) 𝜄 𝛾 1 −1 1 − 𝜄 𝛾 0 −1 • 𝑄 𝜄 = 𝐶𝑓𝑢𝑏 𝛾 1 , 𝛾 0 = Slide credit: Tom Mitchell

  32. Maximum likelihood estimate 𝑌 = 1 𝑌 = 0 • Data set 𝐸 of iid flips, produces 𝛽 1 ones, 𝛽 0 zeros 𝑄 𝐸 𝜄 = 𝑄 𝛽 1 , 𝛽 0 𝜄 = 𝜄 𝛽 1 1 − 𝜄 𝛽 0 • Assume prior ( Conjugate prior: Closed form representation of posterior) 1 𝐶(𝛾 1 , 𝛾 0 ) 𝜄 𝛾 1 −1 1 − 𝜄 𝛾 0 −1 𝑄 𝜄 = 𝐶𝑓𝑢𝑏 𝛾 1 , 𝛾 0 = 𝛽 1 + 𝛾 1 − 1 ෡ 𝜾 = argmax 𝑄 𝐸 𝜄 P(𝜄) = (𝛽 1 + 𝛾 1 − 1) + (𝛽 0 + 𝛾 0 − 1) 𝜄 Slide credit: Tom Mitchell

Recommend


More recommend