Statistical Learning CS 786 University of Waterloo Lecture 6: May 17, 2012 Decision Tree Predictions • Can make deterministic and probabilistic predictions – Deterministic rule: • ��485 � � ∧ ����231 � � ⟹ ��786 � � – Probabilistic rule: ���786 � � | ��485 � � ∧ ����231 � �� � 0.9 • Pr • Probabilistic rule is a conditional distribution… could we use Bayes nets instead of decision trees? 2 CS786 Lecture Slides (c) 2012 P. Poupart 1
Bayesian Network Predictions • Inference queries can be used to make probabilistic predictions: • Advantages: – Predict any variable – Prediction based on partial evidence • Question: how do we learn the parameters of a Bayesian network? 3 CS786 Lecture Slides (c) 2012 P. Poupart Statistical Learning • Three common approaches – Bayesian learning – Maximum a posteriori – Maximum likelihood • Conditional maximum likelihood 4 CS786 Lecture Slides (c) 2012 P. Poupart 2
Candy Example • Favorite candy sold in two flavors: – Lime (hugh) – Cherry (yum) • Same wrapper for both flavors • Sold in bags with different ratios: – 100% cherry – 75% cherry + 25% lime – 50% cherry + 50% lime – 25% cherry + 75% lime – 100% lime 5 CS786 Lecture Slides (c) 2012 P. Poupart Candy Example • You bought a bag of candy but don’t know its flavor ratio • After eating k candies: – What’s the flavor ratio of the bag? – What will be the flavor of the next candy? 6 CS786 Lecture Slides (c) 2012 P. Poupart 3
Statistical Learning • Hypothesis H: probabilistic theory of the world – h 1 : 100% cherry – h 2 : 75% cherry + 25% lime – h 3 : 50% cherry + 50% lime – h 4 : 25% cherry + 75% lime – h 5 : 100% lime • Data D: evidence about the world – d 1 : 1 st candy is cherry – d 2 : 2 nd candy is lime – d 3 : 3 rd candy is lime – … 7 CS786 Lecture Slides (c) 2012 P. Poupart Bayesian Learning • Prior: Pr(H) • Likelihood: Pr(d|H) • Evidence: d = <d 1 ,d 2 ,…,d n > • Bayesian Learning amounts to computing the posterior using Bayes’ Theorem: Pr(H| d ) = k Pr( d |H)Pr(H) 8 CS786 Lecture Slides (c) 2012 P. Poupart 4
Bayesian Prediction • Suppose we want to make a prediction about an unknown quantity X (i.e., the flavor of the next candy) • Pr(X| d ) = Σ i Pr(X| d ,h i )P(h i | d ) = Σ i Pr(X|h i )P(h i | d ) • Predictions are weighted averages of the predictions of the individual hypotheses • Hypotheses serve as “intermediaries” between raw data and prediction 9 CS786 Lecture Slides (c) 2012 P. Poupart Candy Example • Assume prior P(H) = <0.1, 0.2, 0.4, 0.2, 0.1> • Assume candies are i.i.d. (identically and independently distributed) – P( d |h) = j P(d j |h) • Suppose first 10 candies all taste lime: – P( d |h 5 ) = 1 10 = 1 – P( d |h 3 ) = 0.5 10 = 0.00097 – P( d |h 1 ) = 0 10 = 0 10 CS786 Lecture Slides (c) 2012 P. Poupart 5
Posterior Posteriors given data generated from h_5 1 P(h_1|E) P(h_2|E) 0.8 P(h_3|E) P(h_4|E) P(h_i|e_1...e_t) P(h_5|E) 0.6 0.4 0.2 0 0 2 4 6 8 10 Number of samples 11 CS786 Lecture Slides (c) 2012 P. Poupart Prediction Bayes predictions with data generated from h_5 Probability that next candy is lime 1 0.9 P(red|e_1...e_t) 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 Number of samples 12 CS786 Lecture Slides (c) 2012 P. Poupart 6
Bayesian Learning • Bayesian learning properties: – Optimal (i.e. given prior, no other prediction is correct more often than the Bayesian one) – No overfitting (all hypotheses weighted and considered) • There is a price to pay: – When hypothesis space is large Bayesian learning may be intractable – i.e. sum (or integral) over hypothesis often intractable • Solution: approximate Bayesian learning 13 CS786 Lecture Slides (c) 2012 P. Poupart Maximum a posteriori (MAP) • Idea: make prediction based on most probable hypothesis h MAP – h MAP = argmax hi P(h i | d ) – P(X| d ) P(X|h MAP ) • In contrast, Bayesian learning makes prediction based on all hypotheses weighted by their probability 14 CS786 Lecture Slides (c) 2012 P. Poupart 7
Candy Example (MAP) • Prediction after – 1 lime: h MAP = h 3 , Pr(lime|h MAP ) = 0.5 – 2 limes: h MAP = h 4 , Pr(lime|h MAP ) = 0.75 – 3 limes: h MAP = h 5 , Pr(lime|h MAP ) = 1 – 4 limes: h MAP = h 5 , Pr(lime|h MAP ) = 1 – … • After only 3 limes, it correctly selects h 5 15 CS786 Lecture Slides (c) 2012 P. Poupart Candy Example (MAP) • But what if correct hypothesis is h 4 ? – h 4 : P(lime) = 0.75 and P(cherry) = 0.25 • After 3 limes – MAP incorrectly predicts h 5 – MAP yields P(lime|h MAP ) = 1 – Bayesian learning yields P(lime| d ) = 0.8 16 CS786 Lecture Slides (c) 2012 P. Poupart 8
MAP properties • MAP prediction less accurate than Bayesian prediction since it relies only on one hypothesis h MAP • But MAP and Bayesian predictions converge as data increases • Controlled overfitting (prior can be used to penalize complex hypotheses) • Finding h MAP may be intractable: – h MAP = argmax P(h| d ) – Optimization may be difficult 17 CS786 Lecture Slides (c) 2012 P. Poupart MAP computation • Optimization: – h MAP = argmax h P(h| d ) = argmax h P(h) P( d |h) = argmax h P(h) i P(d i |h) • Product induces non-linear optimization • Take the log to linearize optimization – h MAP = argmax h log P(h) + Σ i log P(d i |h) 18 CS786 Lecture Slides (c) 2012 P. Poupart 9
Maximum Likelihood (ML) • Idea: simplify MAP by assuming uniform prior (i.e., P(h i ) = P(h j ) i,j) – h MAP = argmax h P(h) P( d |h) – h ML = argmax h P( d |h) • Make prediction based on h ML only: – P(X| d ) P(X|h ML ) 19 CS786 Lecture Slides (c) 2012 P. Poupart Candy Example (ML) • Prediction after – 1 lime: h ML = h 5 , Pr(lime|h ML ) = 1 – 2 limes: h ML = h 5 , Pr(lime|h ML ) = 1 – … • Frequentist: “objective” prediction since it relies only on the data (i.e., no prior) • Bayesian: prediction based on data and uniform prior (since no prior uniform prior) 20 CS786 Lecture Slides (c) 2012 P. Poupart 10
ML properties • ML prediction less accurate than Bayesian and MAP predictions since it ignores prior info and relies only on one hypothesis h ML • But ML, MAP and Bayesian predictions converge as data increases • Subject to overfitting (no prior to penalize complex hypothesis that could exploit statistically insignificant data patterns) • Finding h ML is often easier than h MAP – h ML = argmax h Σ i log P(d i |h) 21 CS786 Lecture Slides (c) 2012 P. Poupart Statistical Learning • Use Bayesian Learning, MAP or ML • Complete data: – When data has multiple attributes, all attributes are known – Easy • Incomplete data: – When data has multiple attributes, some attributes are unknown – Harder 22 CS786 Lecture Slides (c) 2012 P. Poupart 11
Simple ML example • Hypothesis h : – P(cherry)= & P(lime)=1- • Data d : – c cherries and l limes • ML hypothesis: – is relative frequency of observed data – = c/(c+l) – P(cherry) = c/(c+l) and P(lime)= l/(c+l) 23 CS786 Lecture Slides (c) 2012 P. Poupart ML computation • 1) Likelihood expression – P( d |h ) = c (1- ) l • 2) log likelihood – log P( d |h ) = c log + l log (1- ) • 3) log likelihood derivative – d(log P( d |h ))/d = c/ - l/(1- ) • 4) ML hypothesis – c/ - l/(1- ) = 0 = c/(c+l) 24 CS786 Lecture Slides (c) 2012 P. Poupart 12
More complicated ML example • Hypothesis: h , 1, 2 • Data: – c cherries • g c green wrappers • r c red wrappers – l limes • g l green wrappers • r l red wrappers 25 CS786 Lecture Slides (c) 2012 P. Poupart ML computation • 1) Likelihood expression – P( d |h , 1, 2 ) = c (1- ) l 1 r c (1- 1 ) g c 2 r l (1- 2 ) g l • … • 4) ML hypothesis – c/ - l/(1- ) = 0 = c/(c+l) – r c / 1 - g c /(1- 1 ) = 0 1 = r c /(r c +g c ) – r l / 2 - g l /(1- 2 ) = 0 2 = r l /(r l +g l ) 26 CS786 Lecture Slides (c) 2012 P. Poupart 13
Naïve Bayes model • Want to predict a C class C based on attributes A i • Parameters: … A 1 A 2 A 3 A n – = P(C=true) – i1 = P(A i =true|C=true) – i2 = P(A i =true|C=false) • Assumption: A i ’s are independent given C 27 CS786 Lecture Slides (c) 2012 P. Poupart Naïve Bayes model for Restaurant Problem • Data: • ML sets – to relative frequencies of wait and ~wait – i1 , i2 to relative frequencies of each attribute value given wait and ~wait 28 CS786 Lecture Slides (c) 2012 P. Poupart 14
Naïve Bayes model vs decision trees • Wait prediction for restaurant problem 1 Proportion correct on test set 0.9 Why is naïve 0.8 Bayes less accurate than 0.7 decision tree? 0.6 Decision tree Naive Bayes 0.5 0.4 0 20 40 60 80 100 Training set size 29 CS786 Lecture Slides (c) 2012 P. Poupart Bayesian network parameter learning (ML) • Parameters V,pa(V)= v : – CPTs: V,pa(V)= v = P(V|pa(V)= v ) • Data d : – d 1 : <V 1 =v 1,1 , V 2 =v 2,1 , …, V n = v n,1 > – d 2 : <V 1 =v 1,2 , V 2 =v 2,2 , …, V n = v n,2 > – … • Maximum likelihood: – Set V,pa(V)= v to the relative frequencies of the values of V given the values v of the parents of V 30 CS786 Lecture Slides (c) 2012 P. Poupart 15
Recommend
More recommend