Bayesian Learning • Bayes Theorem • MAP, ML hypotheses • MAP learners • Minimum description length principle • Bayes optimal classifier • Naive Bayes learner • Example: Learning over text data • Bayesian belief networks • Expectation Maximization algorithm 1
Two Roles for Bayesian Methods Provides practical learning algorithms: • Naive Bayes learning • Bayesian belief network learning • Combine prior knowledge (prior probabilities) with observed data • Requires prior probabilities Provides useful conceptual framework • Provides “gold standard” for evaluating other learning algorithms • Additional insight into Occam’s razor 2
Bayes Theorem P ( h | D ) = P ( D | h ) P ( h ) P ( D ) • P ( h ) = prior probability of hypothesis h • P ( D ) = prior probability of training data D • P ( h | D ) = probability of h given D • P ( D | h ) = probability of D given h 3
Choosing Hypotheses P ( h | D ) = P ( D | h ) P ( h ) P ( D ) Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis h MAP : h MAP = arg max h ∈ H P ( h | D ) P ( D | h ) P ( h ) = arg max P ( D ) h ∈ H = arg max h ∈ H P ( D | h ) P ( h ) If assume P ( h i ) = P ( h j ) then can further simplify, and choose the Maximum likelihood (ML) hypothesis h ML = arg max h i ∈ H P ( D | h i ) 4
Bayes Theorem Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, . 008 of the entire population have this cancer. P ( cancer ) = P ( ¬ cancer ) = P (+ | cancer ) = P ( −| cancer ) = P (+ |¬ cancer ) = P ( −|¬ cancer ) = 5
Basic Formulas for Probabilities • Product Rule : probability P ( A ∧ B ) of a conjunction of two events A and B: P ( A ∧ B ) = P ( A | B ) P ( B ) = P ( B | A ) P ( A ) • Sum Rule : probability of a disjunction of two events A and B: P ( A ∨ B ) = P ( A ) + P ( B ) − P ( A ∧ B ) • Theorem of total probability : if events A 1 , . . . , A n � n are mutually exclusive with i =1 P ( A i ) = 1, then n P ( B ) = i =1 P ( B | A i ) P ( A i ) � 6
Brute Force MAP Hypothesis Learner 1. For each hypothesis h in H , calculate the posterior probability P ( h | D ) = P ( D | h ) P ( h ) P ( D ) 2. Output the hypothesis h MAP with the highest posterior probability h MAP = argmax P ( h | D ) h ∈ H 7
Relation to Concept Learning Consider our usual concept learning task • instance space X , hypothesis space H , training examples D • consider the FindS learning algorithm (outputs most specific hypothesis from the version space V S H,D ) What would Bayes rule produce as the MAP hypothesis? Does FindS output a MAP hypothesis?? 8
Relation to Concept Learning Assume fixed set of instances � x 1 , . . . , x m � Assume D is the set of classifications D = � c ( x 1 ) , . . . , c ( x m ) � Choose P ( D | h ): 9
Relation to Concept Learning Assume fixed set of instances � x 1 , . . . , x m � Assume D is the set of classifications D = � c ( x 1 ) , . . . , c ( x m ) � Choose P ( D | h ) • P ( D | h ) = 1 if h consistent with D • P ( D | h ) = 0 otherwise Choose P ( h ) to be uniform distribution 1 • P ( h ) = | H | for all h in H Then, 1 ⎧ | V S H,D | if h is consistent with D ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ P ( h | D ) = ⎨ ⎪ ⎪ ⎪ ⎪ 0 otherwise ⎪ ⎪ ⎪ ⎩ 10
Evolution of Posterior Probabilities P h ) ( P(h|D 1) P(h|D 1, D 2) hypotheses hypotheses hypotheses ( ) a ( ) b ( ) c 11
Characterizing Learning Algorithms by Equiv- alent MAP Learners Inductive system Training examples D Output hypotheses Candidate Elimination Hypothesis space H Algorithm Equivalent Bayesian inference system Training examples D Output hypotheses Hypothesis space H Brute force MAP learner P(h) uniform P(D|h) = 0 if inconsistent, = 1 if consistent Prior assumptions made explicit 12
Learning A Real Valued Function y f h ML e x Consider any real-valued target function f Training examples � x i , d i � , where d i is noisy training value • d i = f ( x i ) + e i • e i is random variable (noise) drawn independently for each x i according to some Gaussian distribution with mean=0 Then the maximum likelihood hypothesis h ML is the one that minimizes the sum of squared errors: m i =1 ( d i − h ( x i )) 2 h ML = arg min � h ∈ H 13
Learning A Real Valued Function h ML = argmax p ( D | h ) h ∈ H m = argmax i =1 p ( d i | h ) � h ∈ H 1 2 ( di − h ( xi ) m 2 πσ 2 e − 1 ) 2 √ = argmax � σ i =1 h ∈ H Maximize natural log of this instead... 2 ⎝ d i − h ( x i ) 2 πσ 2 − 1 1 ⎛ ⎞ m √ h ML = argmax i =1 ln � ⎜ ⎟ ⎜ ⎟ 2 σ ⎠ h ∈ H 2 i =1 − 1 ⎝ d i − h ( x i ) ⎛ ⎞ m = argmax � ⎜ ⎟ ⎜ ⎟ 2 σ ⎠ h ∈ H m i =1 − ( d i − h ( x i )) 2 = argmax � h ∈ H m i =1 ( d i − h ( x i )) 2 = argmin � h ∈ H 14
Learning to Predict Probabilities Consider predicting survival probability from patient data Training examples � x i , d i � , where d i is 1 or 0 Want to train neural network to output a probability given x i (not a 0 or 1) In this case can show m h ML = argmax i =1 d i ln h ( x i ) + (1 − d i ) ln(1 − h ( x i )) � h ∈ H Weight update rule for a sigmoid unit: w jk ← w jk + ∆ w jk where m ∆ w jk = η i =1 ( d i − h ( x i )) x ijk � 15
Minimum Description Length Principle Occam’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that minimizes h MDL = argmin L C 1 ( h ) + L C 2 ( D | h ) h ∈ H where L C ( x ) is the description length of x under encoding C Example: H = decision trees, D = training data labels • L C 1 ( h ) is # bits to describe tree h • L C 2 ( D | h ) is # bits to describe D given h – Note L C 2 ( D | h ) = 0 if examples classified perfectly by h . Need only describe exceptions • Hence h MDL trades off tree size for training errors 16
Minimum Description Length Principle h MAP = arg max h ∈ H P ( D | h ) P ( h ) = arg max h ∈ H log 2 P ( D | h ) + log 2 P ( h ) = arg min h ∈ H − log 2 P ( D | h ) − log 2 P ( h ) (1) Interesting fact from information theory: The optimal (shortest expected coding length) code for an event with probability p is − log 2 p bits. So interpret (1): • − log 2 P ( h ) is length of h under optimal code • − log 2 P ( D | h ) is length of D given h under optimal code → prefer the hypothesis that minimizes length ( h ) + length ( misclassifications ) 17
Most Probable Classification of New Instances So far we’ve sought the most probable hypothesis given the data D (i.e., h MAP ) Given new instance x , what is its most probable classification ? • h MAP ( x ) is not the most probable classification! Consider: • Three possible hypotheses: P ( h 1 | D ) = . 4 , P ( h 2 | D ) = . 3 , P ( h 3 | D ) = . 3 • Given new instance x , h 1 ( x ) = + , h 2 ( x ) = − , h 3 ( x ) = − • What’s most probable classification of x ? 18
Bayes Optimal Classifier Bayes optimal classification: arg max h i ∈ H P ( v j | h i ) P ( h i | D ) � v j ∈ V Example: P ( h 1 | D ) = . 4 , P ( −| h 1 ) = 0 , P (+ | h 1 ) = 1 P ( h 2 | D ) = . 3 , P ( −| h 2 ) = 1 , P (+ | h 2 ) = 0 P ( h 3 | D ) = . 3 , P ( −| h 3 ) = 1 , P (+ | h 3 ) = 0 therefore h i ∈ H P (+ | h i ) P ( h i | D ) = . 4 � h i ∈ H P ( −| h i ) P ( h i | D ) = . 6 � and arg max h i ∈ H P ( v j | h i ) P ( h i | D ) = − � v j ∈ V 19
Gibbs Classifier Bayes optimal classifier provides best result, but can be expensive if many hypotheses. Gibbs algorithm: 1. Choose one hypothesis at random, according to P ( h | D ) 2. Use this to classify new instance Surprising fact: Assume target concepts are drawn at random from H according to priors on H . Then: E [ error Gibbs ] ≤ 2 E [ error BayesOptimal ] Suppose correct, uniform prior distribution over H , then • Pick any hypothesis from VS, with uniform probability • Its expected error no worse than twice Bayes optimal 20
Naive Bayes Classifier Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods. When to use • Moderate or large training set available • Attributes that describe instances are conditionally independent given classification Successful applications: • Diagnosis • Classifying text documents 21
Naive Bayes Classifier Assume target function f : X → V , where each instance x described by attributes � a 1 , a 2 . . . a n � . Most probable value of f ( x ) is: v MAP = argmax P ( v j | a 1 , a 2 . . . a n ) v j ∈ V P ( a 1 , a 2 . . . a n | v j ) P ( v j ) v MAP = argmax P ( a 1 , a 2 . . . a n ) v j ∈ V = argmax P ( a 1 , a 2 . . . a n | v j ) P ( v j ) v j ∈ V Naive Bayes assumption: P ( a 1 , a 2 . . . a n | v j ) = i P ( a i | v j ) � which gives Naive Bayes classifier: v NB = argmax P ( v j ) i P ( a i | v j ) � v j ∈ V 22
Naive Bayes Algorithm Naive Bayes Learn( examples ) For each target value v j ˆ P ( v j ) ← estimate P ( v j ) For each attribute value a i of each attribute a ˆ P ( a i | v j ) ← estimate P ( a i | v j ) Classify New Instance( x ) ˆ ˆ v NB = argmax P ( v j ) P ( a i | v j ) � a i ∈ x v j ∈ V 23
Recommend
More recommend