Conditional Probability Estimation Marco Cattaneo School of Mathematics and Physical Sciences University of Hull PGM 2016, Lugano, Switzerland 7 September 2016
MLE of conditional probability ◮ given: a probabilistic model P θ with unknown θ , past data D , and events E , Q concerning some new (independent) data Marco Cattaneo @ University of Hull Conditional Probability Estimation 2/7
MLE of conditional probability ◮ given: a probabilistic model P θ with unknown θ , past data D , and events E , Q concerning some new (independent) data ◮ MLE of P θ ( Q | E ) = P θ ( Q | D ∩ E ): ˆ P ˆ θ D ( Q | E ) with θ D = arg max P θ ( D ) (wrong) θ Marco Cattaneo @ University of Hull Conditional Probability Estimation 2/7
MLE of conditional probability ◮ given: a probabilistic model P θ with unknown θ , past data D , and events E , Q concerning some new (independent) data ◮ MLE of P θ ( Q | E ) = P θ ( Q | D ∩ E ): ˆ P ˆ θ D ( Q | E ) with θ D = arg max P θ ( D ) (wrong) θ ˆ P ˆ θ D ∩ E ( Q | E ) with θ D ∩ E = arg max P θ ( D ∩ E ) (right) θ Marco Cattaneo @ University of Hull Conditional Probability Estimation 2/7
MLE of conditional probability ◮ given: a probabilistic model P θ with unknown θ , past data D , and events E , Q concerning some new (independent) data ◮ MLE of P θ ( Q | E ) = P θ ( Q | D ∩ E ): ˆ P ˆ θ D ( Q | E ) with θ D = arg max P θ ( D ) (wrong) θ ˆ P ˆ θ D ∩ E ( Q | E ) with θ D ∩ E = arg max P θ ( D ∩ E ) (right) θ ◮ when P θ is a (generalized) regression model, and E , Q describe predictors and response, respectively, then there is no difference between (right) and (wrong) Marco Cattaneo @ University of Hull Conditional Probability Estimation 2/7
MLE of conditional probability ◮ given: a probabilistic model P θ with unknown θ , past data D , and events E , Q concerning some new (independent) data ◮ MLE of P θ ( Q | E ) = P θ ( Q | D ∩ E ): ˆ P ˆ θ D ( Q | E ) with θ D = arg max P θ ( D ) (wrong) θ ˆ P ˆ θ D ∩ E ( Q | E ) with θ D ∩ E = arg max P θ ( D ∩ E ) (right) θ ◮ when P θ is a (generalized) regression model, and E , Q describe predictors and response, respectively, then there is no difference between (right) and (wrong) ◮ when P θ is a Bayesian network, D is a training dataset, and E , Q concern some new instances, then the usual MLE is (wrong), and this partially explains the unsatisfactory performance of MLE for Bayesian networks Marco Cattaneo @ University of Hull Conditional Probability Estimation 2/7
conditional probability estimation in Bayesian networks ◮ given: a DAG with vertices v ∈ V representing categorical variables X v , a complete training dataset D with counts n ( · ), and conjugate Dirichlet priors with parameters d ( · ) Marco Cattaneo @ University of Hull Conditional Probability Estimation 3/7
conditional probability estimation in Bayesian networks ◮ given: a DAG with vertices v ∈ V representing categorical variables X v , a complete training dataset D with counts n ( · ), and conjugate Dirichlet priors with parameters d ( · ) ◮ estimates of local probability models: p D ( x v | x pa ( v ) ) = n ( x v , x pa ( v ) ) ˆ (ML) n ( x pa ( v ) ) p D ( x v | x pa ( v ) ) = n ( x v , x pa ( v ) ) + d ( x v , x pa ( v ) ) ˆ (Bayes) n ( x pa ( v ) ) + d ( x pa ( v ) ) Marco Cattaneo @ University of Hull Conditional Probability Estimation 3/7
conditional probability estimation in Bayesian networks ◮ given: a DAG with vertices v ∈ V representing categorical variables X v , a complete training dataset D with counts n ( · ), and conjugate Dirichlet priors with parameters d ( · ) ◮ estimates of local probability models: p D ( x v | x pa ( v ) ) = n ( x v , x pa ( v ) ) ˆ (ML) n ( x pa ( v ) ) p D ( x v | x pa ( v ) ) = n ( x v , x pa ( v ) ) + d ( x v , x pa ( v ) ) ˆ (Bayes) n ( x pa ( v ) ) + d ( x pa ( v ) ) ◮ estimates of probabilities concerning a new instance: n ( x v , x pa ( v ) ) � � � � p D ( x Q ) = ˆ p D ( x v | x pa ( v ) ) = ˆ (ML) n ( x pa ( v ) ) x V\Q v ∈V x V\Q v ∈V n ( x v , x pa ( v ) ) + d ( x v , x pa ( v ) ) � � � � p D ( x Q ) = ˆ p D ( x v | x pa ( v ) ) = ˆ n ( x pa ( v ) ) + d ( x pa ( v ) ) x V\Q v ∈V x V\Q v ∈V (Bayes) Marco Cattaneo @ University of Hull Conditional Probability Estimation 3/7
conditional probability estimation in Bayesian networks ◮ estimates of conditional probabilities concerning a new instance: � � v ∈V ˆ p D ( x v | x pa ( v ) ) x V\ ( Q∪E ) p D , x E ( x Q | x E ) = ˆ � � v ∈V ˆ p D ( x v | x pa ( v ) ) x V\Q (wrong ML) n ( x v , x pa ( v ) ) � � x V\ ( Q∪E ) v ∈V n ( x pa ( v ) ) = n ( x v , x pa ( v ) ) � � x V\Q v ∈V n ( x pa ( v ) ) � � v ∈V ˆ p D ( x v | x pa ( v ) ) x V\ ( Q∪E ) p D , x E ( x Q | x E ) = ˆ � � v ∈V ˆ p D ( x v | x pa ( v ) ) x V\Q (Bayes) n ( x v , x pa ( v ) )+ d ( x v , x pa ( v ) ) � � x V\ ( Q∪E ) v ∈V n ( x pa ( v ) )+ d ( x pa ( v ) ) = n ( x v , x pa ( v ) )+ d ( x v , x pa ( v ) ) � � x V\Q v ∈V n ( x pa ( v ) )+ d ( x pa ( v ) ) Marco Cattaneo @ University of Hull Conditional Probability Estimation 4/7
conditional probability estimation in Bayesian networks ◮ estimates of conditional probabilities concerning a new instance: � � v ∈V ˆ p D , x E ( x v | x pa ( v ) ) x V\ ( Q∪E ) p D , x E ( x Q | x E ) = ˆ � � v ∈V ˆ p D , x E ( x v | x pa ( v ) ) x V\Q (ML) n ( x v , x pa ( v ) )+ˆ e D , x E ( x v , x pa ( v ) ) � � x V\ ( Q∪E ) v ∈V n ( x pa ( v ) )+ˆ e D , x E ( x pa ( v ) ) = n ( x v , x pa ( v ) )+ˆ e D , x E ( x v , x pa ( v ) ) � � x V\Q v ∈V n ( x pa ( v ) )+ˆ e D , x E ( x pa ( v ) ) � � v ∈V ˆ p D ( x v | x pa ( v ) ) x V\ ( Q∪E ) p D , x E ( x Q | x E ) = ˆ � � v ∈V ˆ p D ( x v | x pa ( v ) ) x V\Q (Bayes) n ( x v , x pa ( v ) )+ d ( x v , x pa ( v ) ) � � x V\ ( Q∪E ) v ∈V n ( x pa ( v ) )+ d ( x pa ( v ) ) = n ( x v , x pa ( v ) )+ d ( x v , x pa ( v ) ) � � x V\Q v ∈V n ( x pa ( v ) )+ d ( x pa ( v ) ) Marco Cattaneo @ University of Hull Conditional Probability Estimation 4/7
conditional probability estimation in Bayesian networks ◮ estimates of conditional probabilities concerning a new instance: � � v ∈V ˆ p D , x E ( x v | x pa ( v ) ) x V\ ( Q∪E ) p D , x E ( x Q | x E ) = ˆ � � v ∈V ˆ p D , x E ( x v | x pa ( v ) ) x V\Q (ML) n ( x v , x pa ( v ) )+ˆ e D , x E ( x v , x pa ( v ) ) � � x V\ ( Q∪E ) v ∈V n ( x pa ( v ) )+ˆ e D , x E ( x pa ( v ) ) = n ( x v , x pa ( v ) )+ˆ e D , x E ( x v , x pa ( v ) ) � � x V\Q v ∈V n ( x pa ( v ) )+ˆ e D , x E ( x pa ( v ) ) � � v ∈V ˆ p D ( x v | x pa ( v ) ) x V\ ( Q∪E ) p D , x E ( x Q | x E ) = ˆ � � v ∈V ˆ p D ( x v | x pa ( v ) ) x V\Q (Bayes) n ( x v , x pa ( v ) )+ d ( x v , x pa ( v ) ) � � x V\ ( Q∪E ) v ∈V n ( x pa ( v ) )+ d ( x pa ( v ) ) = n ( x v , x pa ( v ) )+ d ( x v , x pa ( v ) ) � � x V\Q v ∈V n ( x pa ( v ) )+ d ( x pa ( v ) ) ◮ ˆ e D , x E ( · ) are the MLE of expected counts for the new instance, obtained from the EM algorithm Marco Cattaneo @ University of Hull Conditional Probability Estimation 4/7
√ performance comparison: MSE ◮ given: 3 binary variables X 1 , X 2 , Y with X 1 ⊥ X 2 | Y and p ( x 1 | y ) = p ( ¬ x 1 | ¬ y ) = 99%, while p ( ¬ x 2 | y ) = p ( ¬ x 2 | ¬ y ) = 99% Marco Cattaneo @ University of Hull Conditional Probability Estimation 5/7
√ performance comparison: MSE ◮ given: 3 binary variables X 1 , X 2 , Y with X 1 ⊥ X 2 | Y and p ( x 1 | y ) = p ( ¬ x 1 | ¬ y ) = 99%, while p ( ¬ x 2 | y ) = p ( ¬ x 2 | ¬ y ) = 99% ◮ estimate p ( y | x 1 , x 2 ) on the basis of a complete training dataset of size 100: 1.0 incomplete ML (when it exists) complete ML (when incomplete ML exists) Bayes−Laplace (when incomplete ML exists) complete ML (unconditional) Bayes−Laplace (unconditional) probability that incomplete ML exists 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 p(y) Marco Cattaneo @ University of Hull Conditional Probability Estimation 5/7
√ performance comparison: MSE ◮ given: 3 binary variables X 1 , X 2 , Y with X 1 ⊥ X 2 | Y and p ( x 1 | y ) = p ( ¬ x 1 | ¬ y ) = 99%, while p ( ¬ x 2 | y ) = p ( x 2 | ¬ y ) = 90% Marco Cattaneo @ University of Hull Conditional Probability Estimation 6/7
√ performance comparison: MSE ◮ given: 3 binary variables X 1 , X 2 , Y with X 1 ⊥ X 2 | Y and p ( x 1 | y ) = p ( ¬ x 1 | ¬ y ) = 99%, while p ( ¬ x 2 | y ) = p ( x 2 | ¬ y ) = 90% ◮ estimate p ( y | x 1 , x 2 ) on the basis of a complete training dataset of size 100: 1.0 0.8 0.6 incomplete ML (when it exists) complete ML (when incomplete ML exists) Bayes−Laplace (when incomplete ML exists) complete ML (unconditional) Bayes−Laplace (unconditional) 0.4 probability that incomplete ML exists 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 p(y) Marco Cattaneo @ University of Hull Conditional Probability Estimation 6/7
conclusion ◮ the following way of using Bayesian networks is in agreement with Bayes estimation, but not with ML estimation: estimate the local probability models of a Bayesian network from data, and then use the resulting global model to calculate conditional probabilities of future events Marco Cattaneo @ University of Hull Conditional Probability Estimation 7/7
Recommend
More recommend