Statistical Learning [RN2 Sec 20.1-20.2] [RN3 Sec 20.1-20.2] CS 486/686 University of Waterloo Lecture 15: Oct 30, 2012 Outline • Statistical learning – Bayesian learning – Maximum a posteriori – Maximum likelihood • Learning from complete Data 2 CS486/686 Lecture Slides (c) 2012 P. Poupart 1
Statistical Learning • View: we have uncertain knowledge of the world • Idea: learning simply reduces this uncertainty 3 CS486/686 Lecture Slides (c) 2012 P. Poupart Candy Example • Favorite candy sold in two flavors: – Lime (hugh) – Cherry (yum) • Same wrapper for both flavors • Sold in bags with different ratios: – 100% cherry – 75% cherry + 25% lime – 50% cherry + 50% lime – 25% cherry + 75% lime – 100% lime 4 CS486/686 Lecture Slides (c) 2012 P. Poupart 2
Candy Example • You bought a bag of candy but don’t know its flavor ratio • After eating k candies: – What’s the flavor ratio of the bag? – What will be the flavor of the next candy? 5 CS486/686 Lecture Slides (c) 2012 P. Poupart Statistical Learning • Hypothesis H: probabilistic theory of the world – h 1 : 100% cherry – h 2 : 75% cherry + 25% lime – h 3 : 50% cherry + 50% lime – h 4 : 25% cherry + 75% lime – h 5 : 100% lime • Data D: evidence about the world – d 1 : 1 st candy is cherry – d 2 : 2 nd candy is lime – d 3 : 3 rd candy is lime – … 6 CS486/686 Lecture Slides (c) 2012 P. Poupart 3
Bayesian Learning • Prior: Pr(H) • Likelihood: Pr(d|H) • Evidence: d = <d 1 ,d 2 ,…,d n > • Bayesian Learning amounts to computing the posterior using Bayes’ Theorem: Pr(H| d ) = k Pr( d |H)Pr(H) 7 CS486/686 Lecture Slides (c) 2012 P. Poupart Bayesian Prediction • Suppose we want to make a prediction about an unknown quantity X (i.e., the flavor of the next candy) • Pr(X| d ) = Σ i Pr(X| d ,h i )P(h i | d ) = Σ i Pr(X|h i )P(h i | d ) • Predictions are weighted averages of the predictions of the individual hypotheses • Hypotheses serve as “intermediaries” between raw data and prediction 8 CS486/686 Lecture Slides (c) 2012 P. Poupart 4
Candy Example • Assume prior P(H) = <0.1, 0.2, 0.4, 0.2, 0.1> • Assume candies are i.i.d. (identically and independently distributed) – P( d |h) = j P(d j |h) • Suppose first 10 candies all taste lime: – P( d |h 5 ) = 1 10 = 1 – P( d |h 3 ) = 0.5 10 = 0.00097 – P( d |h 1 ) = 0 10 = 0 9 CS486/686 Lecture Slides (c) 2012 P. Poupart Posterior Posteriors given data generated from h_5 1 P(h_1|E) P(h_2|E) P(h_3|E) 0.8 P(h_4|E) P(h_i|e_1...e_t) P(h_5|E) 0.6 0.4 0.2 0 0 2 4 6 8 10 Number of samples 10 CS486/686 Lecture Slides (c) 2012 P. Poupart 5
Prediction Bayes predictions with data generated from h_5 Probability that next candy is lime 1 0.9 P(red|e_1...e_t) 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 Number of samples 11 CS486/686 Lecture Slides (c) 2012 P. Poupart Bayesian Learning • Bayesian learning properties: – Optimal (i.e. given prior, no other prediction is correct more often than the Bayesian one) – No overfitting (all hypotheses weighted and considered) • There is a price to pay: – When hypothesis space is large Bayesian learning may be intractable – i.e. sum (or integral) over hypothesis often intractable • Solution: approximate Bayesian learning 12 CS486/686 Lecture Slides (c) 2012 P. Poupart 6
Maximum a posteriori (MAP) • Idea: make prediction based on most probable hypothesis h MAP – h MAP = argmax hi P(h i | d ) – P(X| d ) P(X|h MAP ) • In contrast, Bayesian learning makes prediction based on all hypotheses weighted by their probability 13 CS486/686 Lecture Slides (c) 2012 P. Poupart Candy Example (MAP) • Prediction after – 1 lime: h MAP = h 3 , Pr(lime|h MAP ) = 0.5 – 2 limes: h MAP = h 4 , Pr(lime|h MAP ) = 0.75 – 3 limes: h MAP = h 5 , Pr(lime|h MAP ) = 1 – 4 limes: h MAP = h 5 , Pr(lime|h MAP ) = 1 – … • After only 3 limes, it correctly selects h 5 14 CS486/686 Lecture Slides (c) 2012 P. Poupart 7
Candy Example (MAP) • But what if correct hypothesis is h 4 ? – h 4 : P(lime) = 0.75 and P(cherry) = 0.25 • After 3 limes – MAP incorrectly predicts h 5 – MAP yields P(lime|h MAP ) = 1 – Bayesian learning yields P(lime| d ) = 0.8 15 CS486/686 Lecture Slides (c) 2012 P. Poupart MAP properties • MAP prediction less accurate than Bayesian prediction since it relies only on one hypothesis h MAP • But MAP and Bayesian predictions converge as data increases • Controlled overfitting (prior can be used to penalize complex hypotheses) • Finding h MAP may be intractable: – h MAP = argmax P(h| d ) – Optimization may be difficult 16 CS486/686 Lecture Slides (c) 2012 P. Poupart 8
MAP computation • Optimization: – h MAP = argmax h P(h| d ) = argmax h P(h) P( d |h) = argmax h P(h) i P(d i |h) • Product induces non-linear optimization • Take the log to linearize optimization – h MAP = argmax h log P(h) + Σ i log P(d i |h) 17 CS486/686 Lecture Slides (c) 2012 P. Poupart Maximum Likelihood (ML) • Idea: simplify MAP by assuming uniform prior (i.e., P(h i ) = P(h j ) i,j) – h MAP = argmax h P(h) P( d |h) – h ML = argmax h P( d |h) • Make prediction based on h ML only: – P(X| d ) P(X|h ML ) 18 CS486/686 Lecture Slides (c) 2012 P. Poupart 9
Candy Example (ML) • Prediction after – 1 lime: h ML = h 5 , Pr(lime|h ML ) = 1 – 2 limes: h ML = h 5 , Pr(lime|h ML ) = 1 – … • Frequentist: “objective” prediction since it relies only on the data (i.e., no prior) • Bayesian: prediction based on data and uniform prior (since no prior uniform prior) 19 CS486/686 Lecture Slides (c) 2012 P. Poupart ML properties • ML prediction less accurate than Bayesian and MAP predictions since it ignores prior info and relies only on one hypothesis h ML • But ML, MAP and Bayesian predictions converge as data increases • Subject to overfitting (no prior to penalize complex hypothesis that could exploit statistically insignificant data patterns) • Finding h ML is often easier than h MAP – h ML = argmax h Σ i log P(d i |h) 20 CS486/686 Lecture Slides (c) 2012 P. Poupart 10
Statistical Learning • Use Bayesian Learning, MAP or ML • Complete data: – When data has multiple attributes, all attributes are known – Easy • Incomplete data: – When data has multiple attributes, some attributes are unknown – Harder 21 CS486/686 Lecture Slides (c) 2012 P. Poupart Simple ML example • Hypothesis h : – P(cherry)= & P(lime)=1- • Data d : – c cherries and l limes • ML hypothesis: – is relative frequency of observed data – = c/(c+l) – P(cherry) = c/(c+l) and P(lime)= l/(c+l) 22 CS486/686 Lecture Slides (c) 2012 P. Poupart 11
ML computation • 1) Likelihood expression – P( d |h ) = c (1- ) l • 2) log likelihood – log P( d |h ) = c log + l log (1- ) • 3) log likelihood derivative – d(log P( d |h ))/d = c/ - l/(1- ) • 4) ML hypothesis – c/ - l/(1- ) = 0 = c/(c+l) 23 CS486/686 Lecture Slides (c) 2012 P. Poupart More complicated ML example • Hypothesis: h , 1, 2 • Data: – c cherries • g c green wrappers • r c red wrappers – l limes • g l green wrappers • r l red wrappers 24 CS486/686 Lecture Slides (c) 2012 P. Poupart 12
ML computation • 1) Likelihood expression – P( d |h , 1, 2 ) = c (1- ) l 1 r c (1- 1 ) g c 2 r l (1- 2 ) g l • … • 4) ML hypothesis – c/ - l/(1- ) = 0 = c/(c+l) – r c / 1 - g c /(1- 1 ) = 0 1 = r c /(r c +g c ) – r l / 2 - g l /(1- 2 ) = 0 2 = r l /(r l +g l ) 25 CS486/686 Lecture Slides (c) 2012 P. Poupart Laplace Smoothing • An important case of overfitting happens when there is no sample for a certain outcome – E.g. no cherries eaten so far – P(cherry) = = c/(c+l) = 0 – Zero prob. are dangerous: they rule out outcomes • Solution: Laplace (add-one) smoothing – Add 1 to all counts – P(cherry) = = (c+1)/(c+l+2) > 0 – Much better results in practice 26 CS486/686 Lecture Slides (c) 2012 P. Poupart 13
Naïve Bayes model • Want to predict a C class C based on attributes A i • Parameters: … A 1 A 2 A 3 A n – = P(C=true) – i1 = P(A i =true|C=true) – i2 = P(A i =true|C=false) • Assumption: A i ’s are independent given C 27 CS486/686 Lecture Slides (c) 2012 P. Poupart Naïve Bayes model for Restaurant Problem • Data: • ML sets – to relative frequencies of wait and ~wait – i1 , i2 to relative frequencies of each attribute value given wait and ~wait 28 CS486/686 Lecture Slides (c) 2012 P. Poupart 14
Naïve Bayes model vs decision trees • Wait prediction for restaurant problem 1 Proportion correct on test set 0.9 Why is naïve 0.8 Bayes less accurate than 0.7 decision tree? 0.6 Decision tree Naive Bayes 0.5 0.4 0 20 40 60 80 100 Training set size 29 CS486/686 Lecture Slides (c) 2012 P. Poupart Bayesian network parameter learning (ML) • Parameters V,pa(V)= v : – CPTs: V,pa(V)= v = P(V|pa(V)= v ) • Data d : – d 1 : <V 1 =v 1,1 , V 2 =v 2,1 , …, V n = v n,1 > – d 2 : <V 1 =v 1,2 , V 2 =v 2,2 , …, V n = v n,2 > – … • Maximum likelihood: – Set V,pa(V)= v to the relative frequencies of the values of V given the values v of the parents of V 30 CS486/686 Lecture Slides (c) 2012 P. Poupart 15
Recommend
More recommend