csce 478 878 lecture 7 bayesian learning
play

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted - PDF document

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchells slides) October 31, 2006 1 Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Might have


  1. CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell’s slides) October 31, 2006 1

  2. Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Might have reasons (domain information) to favor some hypotheses over others a priori Bayesian methods work with probabilities, and have two main roles: 1. Provide practical learning algorithms: • Na¨ ıve Bayes learning • Bayesian belief network learning • Combine prior knowledge (prior probabilities) with observed data • Requires prior probabilities 2. Provides useful conceptual framework • Provides “gold standard” for evaluating other learn- ing algorithms • Additional insight into Occam’s razor 2

  3. Outline • Bayes Theorem • MAP , ML hypotheses • MAP learners • Minimum description length principle • Bayes optimal classifier/Gibbs algorithm • Na¨ ıve Bayes classifier • Bayesian belief networks 3

  4. Bayes Theorem In general, an identity for conditional probabilities For our work, we want to know the probability that a par- ticular h ∈ H is the correct hypothesis given that we have seen training data D (examples and labels). Bayes theo- rem lets us do this. P ( h | D ) = P ( D | h ) P ( h ) P ( D ) • P ( h ) = prior probability of hypothesis h (might include domain information) • P ( D ) = probability of training data D • P ( h | D ) = probability of h given D • P ( D | h ) = probability of D given h Note P ( h | D ) increases with P ( D | h ) and P ( h ) and decreases with P ( D ) 4

  5. Choosing Hypotheses P ( h | D ) = P ( D | h ) P ( h ) P ( D ) Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis h MAP : h MAP = argmax P ( h | D ) h ∈ H P ( D | h ) P ( h ) = argmax P ( D ) h ∈ H = argmax P ( D | h ) P ( h ) h ∈ H If assume P ( h i ) = P ( h j ) for all i, j , then can further sim- plify, and choose the maximum likelihood (ML) hypothesis h ML = argmax P ( D | h i ) h i ∈ H 5

  6. Bayes Theorem Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the dis- ease is actually present, and a correct negative result in only 97% of the cases in which the dis- Furthermore, . 008 of the ease is not present. entire population have this cancer. P ( cancer ) = P ( ¬ cancer ) = P (+ | cancer ) = P ( − | cancer ) = P (+ | ¬ cancer ) = P ( − | ¬ cancer ) = Now consider new patient for whom the test is positive. What is our diagnosis? P (+ | cancer ) P ( cancer ) = P (+ | ¬ cancer ) P ( ¬ cancer ) = So h MAP = 6

  7. Basic Formulas for Probabilities • Product Rule : probability P ( A ∧ B ) of a conjunction of two events A and B: P ( A ∧ B ) = P ( A | B ) P ( B ) = P ( B | A ) P ( A ) • Sum Rule : probability of a disjunction of two events A and B: P ( A ∨ B ) = P ( A ) + P ( B ) − P ( A ∧ B ) • Theorem of total probability : if events A 1 , . . . , A n are mutually exclusive with � n i =1 P ( A i ) = 1 , then n � P ( B ) = P ( B | A i ) P ( A i ) i =1 7

  8. Brute Force MAP Hypothesis Learner 1. For each hypothesis h in H , calculate the posterior probability P ( h | D ) = P ( D | h ) P ( h ) P ( D ) 2. Output the hypothesis h MAP with the highest poste- rior probability h MAP = argmax P ( h | D ) h ∈ H Problem: what if H exponentially or infinitely large? 8

  9. Relation to Concept Learning Consider our usual concept learning task: instance space X , hypothesis space H , training examples D Consider the Find-S learning algorithm (outputs most spe- cific hypothesis from the version space V S H,D ) What would brute-force MAP learner output as MAP hy- pothesis? Does Find-S output a MAP hypothesis?? 9

  10. Relation to Concept Learning (cont’d) Assume fixed set of instances � x 1 , . . . , x m � Assume D is the set of classifications D = � c ( x 1 ) , . . . , c ( x m ) � Assume no noise and c ∈ H , so choose  1 if d i = h ( x i ) for all d i ∈ D   P ( D | h ) =  0  otherwise Choose P ( h ) = 1 / | H | ∀ h ∈ H , i.e. uniform dist. If h inconsistent with D , then P ( h | D ) = (0 · P ( h )) /P ( D ) = 0 If h consistent with D , then � � P ( h | D ) = (1 · 1 / | H | ) /P ( D ) = (1 / | H | ) / | V S H,D | / | H | = 1 / | V S H,D | (see Thrm of total prob., slide 7) Thus if D noise-free and c ∈ H and P ( h ) uniform, every consistent hypothesis is a MAP hypothesis 10

  11. Characterizing Learning Algorithms by Equivalent MAP Learners Inductive system Training examples D Output hypotheses Candidate Elimination Hypothesis space H Algorithm Equivalent Bayesian inference system Training examples D Output hypotheses Hypothesis space H Brute force MAP learner P(h) uniform P(D|h) = 0 if inconsistent, = 1 if consistent Prior assumptions made explicit So we can characterize algorithms in a Bayesian frame- work even though they don’t directly manipulate probabili- ties Other priors will allow Find-S, etc. to output MAP; e.g. P ( h ) that favors more specific hypotheses 11

  12. Learning A Real-Valued Function Consider any real-valued target function f Training examples � x i , d i � , where d i is noisy training value • d i = f ( x i ) + e i • e i is random variable (noise) drawn independently for each x i according to some Gaussian distribution with mean µ e i = 0 Then the maximum likelihood hypothesis h ML is the one that minimizes the sum of squared errors, e.g. a linear unit trained with GD/EG: m � ( d i − h ( x i )) 2 h ML = argmin h ∈ H i =1 12

  13. Learning A Real-Valued Function (cont’d) h ML = argmax p ( D | h ) = argmax p ( d 1 , . . . , d m | h ) h ∈ H h ∈ H m � = argmax p ( d i | h ) (if d i ’s cond. indep.) h ∈ H i =1  � 2  � m 1  − 1 d i − h ( x i ) � = argmax √ 2 πσ 2 exp  2 σ h ∈ H i =1 ( µ e i = 0 ⇒ E [ d i | h ] = h ( x i ) ) Maximize natural log instead: � 2 � m 2 πσ 2 − 1 1 d i − h ( x i ) � √ h ML = argmax ln 2 σ h ∈ H i =1 � 2 � m − 1 d i − h ( x i ) � = argmax 2 σ h ∈ H i =1 m � − ( d i − h ( x i )) 2 = argmax h ∈ H i =1 m � ( d i − h ( x i )) 2 = argmin h ∈ H i =1 Thus have Bayesian justification for minimizing squared error (under certain assumptions) 13

  14. Learning to Predict Probabilities Consider predicting survival probability from patient data Training examples � x i , d i � , where d i is 1 or 0 (assume label is [or appears] probabilistically generated) Want to train neural network to output the probability that x i has label 1, not the label itself Using approach similar to previous slide (p. 169), can show m � h ML = argmax d i ln h ( x i )+(1 − d i ) ln(1 − h ( x i )) h ∈ H i =1 i.e. find h minimizing cross-entropy For single sigmoid unit, use update rule m � w j ← w j + η ( d i − h ( x i )) x ij i =1 to find h ML (can also derive EG rule) 14

  15. Minimum Description Length Principle Occam’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that satisfies h MDL = argmin L C 1 ( h ) + L C 2 ( D | h ) h ∈ H where L C ( x ) is the description length of x under encoding C Example: H = decision trees, D = training data labels • L C 1 ( h ) is # bits to describe tree h • L C 2 ( D | h ) is # bits to describe D given h – Note L C 2 ( D | h ) = 0 if examples classified per- fectly by h . Need only describe exceptions • Hence h MDL trades off tree size for training errors 15

  16. Minimum Description Length Principle Bayesian Justification = argmax P ( D | h ) P ( h ) h MAP h ∈ H = argmax log 2 P ( D | h ) + log 2 P ( h ) h ∈ H = argmin − log 2 P ( D | h ) − log 2 P ( h ) (1) h ∈ H Interesting fact from information theory: The optimal (short- est expected coding length) code for an event with proba- bility p is − log 2 p bits. So interpret (1): • − log 2 P ( h ) is length of h under optimal code • − log 2 P ( D | h ) is length of D given h under optimal code → prefer the hypothesis that minimizes length ( h ) + length ( misclassifications ) Caveat: h MDL = h MAP doesn’t apply for arbitrary en- codings (need P ( h ) and P ( D | h ) to be optimal); merely a guide 16

  17. Bayes Optimal Classifier • So far we’ve sought the most probable hypothesis given the data D , i.e. h MAP • But given new instance x , h MAP ( x ) is not necessar- ily the most probable classification! • Consider three possible hypotheses: P ( h 1 | D ) = 0 . 4 , P ( h 2 | D ) = 0 . 3 , P ( h 3 | D ) = 0 . 3 Given new instance x , h 1 ( x ) = + , h 2 ( x ) = − , h 3 ( x ) = − • h MAP ( x ) = • What’s the most probable classification of x ? 17

  18. Bayes Optimal Classifier (cont’d) Bayes optimal classification: � argmax P ( v j | h i ) P ( h i | D ) v j ∈ V h i ∈ H where V is set of possible labels (e.g. { + , −} ) Example: P ( h 1 | D ) = 0 . 4 , P ( − | h 1 ) = 0 , P (+ | h 1 ) = 1 P ( h 2 | D ) = 0 . 3 , P ( − | h 2 ) = 1 , P (+ | h 2 ) = 0 P ( h 3 | D ) = 0 . 3 , P ( − | h 3 ) = 1 , P (+ | h 3 ) = 0 therefore � P (+ | h i ) P ( h i | D ) = 0 . 4 h i ∈ H � P ( − | h i ) P ( h i | D ) = 0 . 6 h i ∈ H and � argmax P ( v j | h i ) P ( h i | D ) = − v j ∈ V h i ∈ H On average, no other classifier using same prior and same hyp. space can outperform Bayes optimal! 18

Recommend


More recommend