Learning Objectives At the end of the class you should be able to: derive Bayesian learning from first principles explain how the Beta and Dirichlet distributions are used for Bayesian learning. � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 1 / 11
Model Averaging (Bayesian Learning) We want to predict the output Y of a new case that has input X = x given the training examples e : � p ( Y | x ∧ e ) = P ( Y ∧ m | x ∧ e ) m ∈ M = � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 2 / 11
Model Averaging (Bayesian Learning) We want to predict the output Y of a new case that has input X = x given the training examples e : � p ( Y | x ∧ e ) = P ( Y ∧ m | x ∧ e ) m ∈ M � = P ( Y | m ∧ x ∧ e ) P ( m | x ∧ e ) m ∈ M = � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 2 / 11
Model Averaging (Bayesian Learning) We want to predict the output Y of a new case that has input X = x given the training examples e : � p ( Y | x ∧ e ) = P ( Y ∧ m | x ∧ e ) m ∈ M � = P ( Y | m ∧ x ∧ e ) P ( m | x ∧ e ) m ∈ M � = P ( Y | m ∧ x ) P ( m | e ) m ∈ M M is a set of mutually exclusive and covering models (hypotheses). What assumptions are made here? � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 2 / 11
Learning Under Uncertainty The posterior probability of a model m given examples e : P ( m | e ) = P ( e | m ) × P ( m ) P ( e ) The likelihood, P ( e | m ), is the probability that model m would have produced examples e . The prior, P ( m ), encodes the learning bias P ( e ) is a normalizing constant so the probabilities of the models sum to 1. � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 3 / 11
Plate Notation Examples e = [ e 1 , . . . , e k ] are independent and identically distributed (i.i.d.) given m if k � P ( e | m ) = P ( e i | m ) i =1 m m ei e1 e2 ... ek i � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 4 / 11
Bayesian Learning of Probabilities Y has two outcomes y and ¬ y . We want the probability of y given training examples e . We can treat the probability of y as a real-valued random variable on the interval [0 , 1], called φ . Bayes’ rule gives: P ( φ = p | e ) = � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 5 / 11
Bayesian Learning of Probabilities Y has two outcomes y and ¬ y . We want the probability of y given training examples e . We can treat the probability of y as a real-valued random variable on the interval [0 , 1], called φ . Bayes’ rule gives: P ( φ = p | e ) = P ( e | φ = p ) × P ( φ = p ) P ( e ) Suppose e is a sequence of n 1 instances of y and n 0 instances of ¬ y : P ( e | φ = p ) = � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 5 / 11
Bayesian Learning of Probabilities Y has two outcomes y and ¬ y . We want the probability of y given training examples e . We can treat the probability of y as a real-valued random variable on the interval [0 , 1], called φ . Bayes’ rule gives: P ( φ = p | e ) = P ( e | φ = p ) × P ( φ = p ) P ( e ) Suppose e is a sequence of n 1 instances of y and n 0 instances of ¬ y : P ( e | φ = p ) = p n 1 × (1 − p ) n 0 Uniform prior: P ( φ = p ) = 1 for all p ∈ [0 , 1]. � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 5 / 11
φ Posterior Probabilities for Different Training Examples (beta distribution) 3.5 n 0 =0, n 1 =0 n 0 =1, n 1 =2 3 n 0 =2, n 1 =4 n 0 =4, n 1 =8 2.5 2 P (φ| e) 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 6 / 11
MAP model The maximum a posteriori probability (MAP) model is the model m that maximizes P ( m | e ). That is, it maximizes: P ( e | m ) × P ( m ) Thus it minimizes: ( − log P ( e | m )) + ( − log P ( m )) which is the number of bits to send the examples, e , given the model m plus the number of bits to send the model m . � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 7 / 11
Averaging Over Models Idea: Rather than choosing the most likely model, average over all models, weighted by their posterior probabilities given the examples. If you have observed a sequence of n 1 instances of y and n 0 instances of ¬ y , with uniform prior: n 1 ◮ the most likely value (MAP) is n 0 + n 1 n 1 + 1 ◮ the expected value is n 0 + n 1 + 2 � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 8 / 11
Beta Distribution Beta α 0 ,α 1 ( p ) = 1 K p α 1 − 1 × (1 − p ) α 0 − 1 where K is a normalizing constant. α i > 0. The uniform distribution on [0 , 1] is Beta 1 , 1 . The expected value is α 1 / ( α 0 + α 1 ). If the prior probability of a Boolean variable is Beta α 0 ,α 1 , the posterior distribution after observing n 1 true cases and n 0 false cases is: � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 9 / 11
Beta Distribution Beta α 0 ,α 1 ( p ) = 1 K p α 1 − 1 × (1 − p ) α 0 − 1 where K is a normalizing constant. α i > 0. The uniform distribution on [0 , 1] is Beta 1 , 1 . The expected value is α 1 / ( α 0 + α 1 ). If the prior probability of a Boolean variable is Beta α 0 ,α 1 , the posterior distribution after observing n 1 true cases and n 0 false cases is: Beta α 0 + n 0 ,α 1 + n 1 � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 9 / 11
Dirichlet distribution Suppose Y has k values. The Dirichlet distribution has two sorts of parameters, ◮ positive counts α 1 , . . . , α k α i is one more than the count of the i th outcome. ◮ probability parameters p 1 , . . . , p k p i is the probability of the i th outcome k Dirichlet α 1 ,...,α k ( p 1 , . . . , p k ) = 1 α j − 1 � p j K j =1 where K is a normalizing constant The expected value of i th outcome is α i � j α j � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 10 / 11
Hierarchical Bayesian Model Where do the priors come from? Example: S XH is true when patient X is sick in hospital H . We want to learn the probability of Sick for each hospital. Where do the prior probabilities for the hospitals come from? α 1 α 2 α 2 α 1 φ H φ 1 φ 2 φ k ... ... ... ... SXH S1k S11 S12 S21 S22 X H (a) (b) � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 11 / 11
Recommend
More recommend