Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners - PDF document

Bayesian Learning • Bayes Theorem • MAP, ML hypotheses • MAP learners • Minimum description length principle • Bayes optimal classifier • Naive Bayes learner • Example: Learning over text data • Bayesian belief networks • Expectation Maximization algorithm 1

Two Roles for Bayesian Methods Provides practical learning algorithms: • Naive Bayes learning • Bayesian belief network learning • Combine prior knowledge (prior probabilities) with observed data • Requires prior probabilities Provides useful conceptual framework • Provides “gold standard” for evaluating other learning algorithms • Additional insight into Occam’s razor 2

Bayes Theorem P ( h | D ) = P ( D | h ) P ( h ) P ( D ) • P ( h ) = prior probability of hypothesis h • P ( D ) = prior probability of training data D • P ( h | D ) = probability of h given D • P ( D | h ) = probability of D given h 3

Choosing Hypotheses P ( h | D ) = P ( D | h ) P ( h ) P ( D ) Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis h MAP : h MAP = arg max h ∈ H P ( h | D ) P ( D | h ) P ( h ) = arg max P ( D ) h ∈ H = arg max h ∈ H P ( D | h ) P ( h ) If assume P ( h i ) = P ( h j ) then can further simplify, and choose the Maximum likelihood (ML) hypothesis h ML = arg max h i ∈ H P ( D | h i ) 4

Bayes Theorem Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, . 008 of the entire population have this cancer. P ( cancer ) = P ( ¬ cancer ) = P (+ | cancer ) = P ( −| cancer ) = P (+ |¬ cancer ) = P ( −|¬ cancer ) = 5

Basic Formulas for Probabilities • Product Rule : probability P ( A ∧ B ) of a conjunction of two events A and B: P ( A ∧ B ) = P ( A | B ) P ( B ) = P ( B | A ) P ( A ) • Sum Rule : probability of a disjunction of two events A and B: P ( A ∨ B ) = P ( A ) + P ( B ) − P ( A ∧ B ) • Theorem of total probability : if events A 1 , . . . , A n � n are mutually exclusive with i =1 P ( A i ) = 1, then n P ( B ) = i =1 P ( B | A i ) P ( A i ) � 6

Brute Force MAP Hypothesis Learner 1. For each hypothesis h in H , calculate the posterior probability P ( h | D ) = P ( D | h ) P ( h ) P ( D ) 2. Output the hypothesis h MAP with the highest posterior probability h MAP = argmax P ( h | D ) h ∈ H 7

Relation to Concept Learning Consider our usual concept learning task • instance space X , hypothesis space H , training examples D • consider the FindS learning algorithm (outputs most specific hypothesis from the version space V S H,D ) What would Bayes rule produce as the MAP hypothesis? Does FindS output a MAP hypothesis?? 8

Relation to Concept Learning Assume fixed set of instances � x 1 , . . . , x m � Assume D is the set of classifications D = � c ( x 1 ) , . . . , c ( x m ) � Choose P ( D | h ): 9

Relation to Concept Learning Assume fixed set of instances � x 1 , . . . , x m � Assume D is the set of classifications D = � c ( x 1 ) , . . . , c ( x m ) � Choose P ( D | h ) • P ( D | h ) = 1 if h consistent with D • P ( D | h ) = 0 otherwise Choose P ( h ) to be uniform distribution 1 • P ( h ) = | H | for all h in H Then, 1 ⎧ | V S H,D | if h is consistent with D ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ P ( h | D ) = ⎨ ⎪ ⎪ ⎪ ⎪ 0 otherwise ⎪ ⎪ ⎪ ⎩ 10

Evolution of Posterior Probabilities P h ) ( P(h|D 1) P(h|D 1, D 2) hypotheses hypotheses hypotheses ( ) a ( ) b ( ) c 11

Characterizing Learning Algorithms by Equiv- alent MAP Learners Inductive system Training examples D Output hypotheses Candidate Elimination Hypothesis space H Algorithm Equivalent Bayesian inference system Training examples D Output hypotheses Hypothesis space H Brute force MAP learner P(h) uniform P(D|h) = 0 if inconsistent, = 1 if consistent Prior assumptions made explicit 12

Learning A Real Valued Function y f h ML e x Consider any real-valued target function f Training examples � x i , d i � , where d i is noisy training value • d i = f ( x i ) + e i • e i is random variable (noise) drawn independently for each x i according to some Gaussian distribution with mean=0 Then the maximum likelihood hypothesis h ML is the one that minimizes the sum of squared errors: m i =1 ( d i − h ( x i )) 2 h ML = arg min � h ∈ H 13

Learning A Real Valued Function h ML = argmax p ( D | h ) h ∈ H m = argmax i =1 p ( d i | h ) � h ∈ H 1 2 ( di − h ( xi ) m 2 πσ 2 e − 1 ) 2 √ = argmax � σ i =1 h ∈ H Maximize natural log of this instead... 2 ⎝ d i − h ( x i ) 2 πσ 2 − 1 1 ⎛ ⎞ m √ h ML = argmax i =1 ln � ⎜ ⎟ ⎜ ⎟ 2 σ ⎠ h ∈ H 2 i =1 − 1 ⎝ d i − h ( x i ) ⎛ ⎞ m = argmax � ⎜ ⎟ ⎜ ⎟ 2 σ ⎠ h ∈ H m i =1 − ( d i − h ( x i )) 2 = argmax � h ∈ H m i =1 ( d i − h ( x i )) 2 = argmin � h ∈ H 14

Learning to Predict Probabilities Consider predicting survival probability from patient data Training examples � x i , d i � , where d i is 1 or 0 Want to train neural network to output a probability given x i (not a 0 or 1) In this case can show m h ML = argmax i =1 d i ln h ( x i ) + (1 − d i ) ln(1 − h ( x i )) � h ∈ H Weight update rule for a sigmoid unit: w jk ← w jk + ∆ w jk where m ∆ w jk = η i =1 ( d i − h ( x i )) x ijk � 15

Minimum Description Length Principle Occam’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that minimizes h MDL = argmin L C 1 ( h ) + L C 2 ( D | h ) h ∈ H where L C ( x ) is the description length of x under encoding C Example: H = decision trees, D = training data labels • L C 1 ( h ) is # bits to describe tree h • L C 2 ( D | h ) is # bits to describe D given h – Note L C 2 ( D | h ) = 0 if examples classified perfectly by h . Need only describe exceptions • Hence h MDL trades off tree size for training errors 16

Minimum Description Length Principle h MAP = arg max h ∈ H P ( D | h ) P ( h ) = arg max h ∈ H log 2 P ( D | h ) + log 2 P ( h ) = arg min h ∈ H − log 2 P ( D | h ) − log 2 P ( h ) (1) Interesting fact from information theory: The optimal (shortest expected coding length) code for an event with probability p is − log 2 p bits. So interpret (1): • − log 2 P ( h ) is length of h under optimal code • − log 2 P ( D | h ) is length of D given h under optimal code → prefer the hypothesis that minimizes length ( h ) + length ( misclassifications ) 17

Most Probable Classification of New Instances So far we’ve sought the most probable hypothesis given the data D (i.e., h MAP ) Given new instance x , what is its most probable classification ? • h MAP ( x ) is not the most probable classification! Consider: • Three possible hypotheses: P ( h 1 | D ) = . 4 , P ( h 2 | D ) = . 3 , P ( h 3 | D ) = . 3 • Given new instance x , h 1 ( x ) = + , h 2 ( x ) = − , h 3 ( x ) = − • What’s most probable classification of x ? 18

Bayes Optimal Classifier Bayes optimal classification: arg max h i ∈ H P ( v j | h i ) P ( h i | D ) � v j ∈ V Example: P ( h 1 | D ) = . 4 , P ( −| h 1 ) = 0 , P (+ | h 1 ) = 1 P ( h 2 | D ) = . 3 , P ( −| h 2 ) = 1 , P (+ | h 2 ) = 0 P ( h 3 | D ) = . 3 , P ( −| h 3 ) = 1 , P (+ | h 3 ) = 0 therefore h i ∈ H P (+ | h i ) P ( h i | D ) = . 4 � h i ∈ H P ( −| h i ) P ( h i | D ) = . 6 � and arg max h i ∈ H P ( v j | h i ) P ( h i | D ) = − � v j ∈ V 19

Gibbs Classifier Bayes optimal classifier provides best result, but can be expensive if many hypotheses. Gibbs algorithm: 1. Choose one hypothesis at random, according to P ( h | D ) 2. Use this to classify new instance Surprising fact: Assume target concepts are drawn at random from H according to priors on H . Then: E [ error Gibbs ] ≤ 2 E [ error BayesOptimal ] Suppose correct, uniform prior distribution over H , then • Pick any hypothesis from VS, with uniform probability • Its expected error no worse than twice Bayes optimal 20

Naive Bayes Classifier Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods. When to use • Moderate or large training set available • Attributes that describe instances are conditionally independent given classification Successful applications: • Diagnosis • Classifying text documents 21

Naive Bayes Classifier Assume target function f : X → V , where each instance x described by attributes � a 1 , a 2 . . . a n � . Most probable value of f ( x ) is: v MAP = argmax P ( v j | a 1 , a 2 . . . a n ) v j ∈ V P ( a 1 , a 2 . . . a n | v j ) P ( v j ) v MAP = argmax P ( a 1 , a 2 . . . a n ) v j ∈ V = argmax P ( a 1 , a 2 . . . a n | v j ) P ( v j ) v j ∈ V Naive Bayes assumption: P ( a 1 , a 2 . . . a n | v j ) = i P ( a i | v j ) � which gives Naive Bayes classifier: v NB = argmax P ( v j ) i P ( a i | v j ) � v j ∈ V 22

Naive Bayes Algorithm Naive Bayes Learn( examples ) For each target value v j ˆ P ( v j ) ← estimate P ( v j ) For each attribute value a i of each attribute a ˆ P ( a i | v j ) ← estimate P ( a i | v j ) Classify New Instance( x ) ˆ ˆ v NB = argmax P ( v j ) P ( a i | v j ) � a i ∈ x v j ∈ V 23

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners - PDF document

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners Minimum description length principle Bayes optimal classifier Naive Bayes learner Example: Learning over text data Bayesian belief networks

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Adversarial Approaches to Bayesian Learning and Bayesian Approaches to Adversarial Robustness

Friday Sunday, 28 - 30 September 2018 Saturday Programme 9.00am Arrival & Registration

Chapra, L1 (pp. 3-20) TMDL Process Lawsuits in 1990s Water Quality Standards forced EPA

>>>CLICK HERE<<< Web services powerpoint slides Gedling custom dissertation

July 31, 2015 10 AM 3 PM West Palm Beach, Florida Twitter: @FL_OH_Alliance #OH2020FL 1

1 2 3 4 5 6 7 8

CS 730/730W/830: Intro AI Naive Bayes Boosting 1 handout: slides asst 5 milestone was due

Conditional probabilities P R AC TIC IN G STATISTIC S IN TE R VIE W QU E STION S IN P YTH ON

Bayesian hypothesis testing Dr. Jarad Niemi STAT 544 - Iowa State University March 7, 2019

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners - PDF document

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners Minimum description length principle Bayes optimal classifier Naive Bayes learner Example: Learning over text data Bayesian belief networks

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Adversarial Approaches to Bayesian Learning and Bayesian Approaches to Adversarial Robustness

Friday Sunday, 28 - 30 September 2018 Saturday Programme 9.00am Arrival &amp; Registration

Chapra, L1 (pp. 3-20) TMDL Process Lawsuits in 1990s Water Quality Standards forced EPA

&gt;&gt;&gt;CLICK HERE&lt;&lt;&lt; Web services powerpoint slides Gedling custom dissertation

July 31, 2015 10 AM 3 PM West Palm Beach, Florida Twitter: @FL_OH_Alliance #OH2020FL 1

1 2 3 4 5 6 7 8

CS 730/730W/830: Intro AI Naive Bayes Boosting 1 handout: slides asst 5 milestone was due

Conditional probabilities P R AC TIC IN G STATISTIC S IN TE R VIE W QU E STION S IN P YTH ON

Bayesian hypothesis testing Dr. Jarad Niemi STAT 544 - Iowa State University March 7, 2019

Friday Sunday, 28 - 30 September 2018 Saturday Programme 9.00am Arrival & Registration

>>>CLICK HERE<<< Web services powerpoint slides Gedling custom dissertation