bayesian networks part 2
play

Bayesian Networks Part 2 CS 760@UW-Madison Goals for the lecture - PowerPoint PPT Presentation

Bayesian Networks Part 2 CS 760@UW-Madison Goals for the lecture you should understand the following concepts the parameter learning task for Bayes nets the structure learning task for Bayes nets maximum likelihood estimation


  1. Bayesian Networks Part 2 CS 760@UW-Madison

  2. Goals for the lecture you should understand the following concepts • the parameter learning task for Bayes nets • the structure learning task for Bayes nets • maximum likelihood estimation • Laplace estimates • m -estimates • missing data in machine learning • hidden variables • missing at random • missing systematically • the EM approach to imputing missing values in Bayes net parameter learning • the Chow-Liu algorithm for structure search

  3. Learning Bayes Networks: Parameters

  4. The parameter learning task • Given: a set of training instances, the graph structure of a BN Burglary Earthquake B E A J M f f f t f Alarm f t f f f f f t f t … JohnCalls MaryCalls • Do: infer the parameters of the CPDs

  5. The structure learning task • Given: a set of training instances B E A J M f f f t f f t f f f f f t f t … • Do: infer the graph structure (and perhaps the parameters of the CPDs too)

  6. Parameter learning and MLE • maximum likelihood estimation (MLE) • given a model structure (e.g. a Bayes net graph) G and a set of data D • set the model parameters θ to maximize P ( D | G , θ ) • i.e. make the data D look as likely as possible under the model P ( D | G , θ )

  7. Maximum likelihood estimation review consider trying to estimate the parameter θ (probability of heads) of a biased coin from a sequence of flips (1 stands for head) { } x = 1,1,1,0,1,0,0,1,0,1 the likelihood function for θ is given by: What’s MLE of the parameter?

  8. MLE in a Bayes net   =  = ( ) ( ) ( ) d d d ( : , ) ( | , ) ( , ,..., ) L D G P D G P x x x 1 2 n  d D  = ( ) ( ) d d ( | ( )) P x Parents x i i  d D i     =   ( ) ( ) d d ( | ( )) P x Parents x i i    i d D

  9. MLE in a Bayes net   =  = ( ) ( ) ( ) d d d ( : , ) ( | , ) ( , ,..., ) L D G P D G P x x x 1 2 n  d D  = ( ) ( ) d d ( | ( )) P x Parents x i i  d D i     =   ( ) ( ) d d ( | ( )) P x Parents x i i    i d D independent parameter learning problem for each CPD

  10. Maximum likelihood estimation now consider estimating the CPD parameters for B and J in the alarm network given the following data set 1 = = ( ) 0 . 125 P b B E A J M E B 8 f f f t f 7  = = ( ) 0 . 875 P b f t f f f 8 A f f f t t t f f f t J M f f t t f f f t f t f f t t t f f t t t

  11. Maximum likelihood estimation now consider estimating the CPD parameters for B and J in the alarm network given the following data set 1 = = ( ) 0 . 125 P b B E A J M E B 8 f f f t f 7  = = ( ) 0 . 875 P b f t f f f 8 A f f f t t t f f f t 3 = = ( | ) 0 . 75 P j a J M f f t t f 4 1 f f t f t  = = ( | ) 0 . 25 P j a 4 f f t t t 2  = = f f t t t ( | ) 0 . 5 P j a 4 2   = = ( | ) 0 . 5 P j a 4

  12. Maximum likelihood estimation suppose instead, our data set was this… B E A J M E 0 B = = ( ) 0 P b f f f t f 8 f t f f f 8  = = ( ) 1 P b A f f f t t 8 f f f f t J M f f t t f do we really want to f f t f t set this to 0? f f t t t f f t t t

  13. Laplace estimates • instead of estimating parameters strictly from the data, we could start with some prior belief for each • for example, we could use Laplace estimates + 1 n = = ( ) x P X x  + pseudocounts ( 1 ) n v  Values ( ) v X • where n v represents the number of occurrences of value v

  14. M-estimates a more general form: m-estimates n x + p x m P ( X = x ) = prior probability of value x æ ö å ÷ + m n v ç number of “virtual” instances è ø v Î Values( X )

  15. M-estimates example now let’s estimate parameters for B using m=4 and p b =0.25 B E A J M E B f f f t f f t f f f f f f t t A f f f f t f f t t f f f t f t J M f f t t t f f t t t +  +  0 0 . 25 4 1 8 0 . 75 4 11 = = =  b = = = ( ) 0 . 08 ( ) 0 . 92 P b P + + 8 4 12 8 4 12

  16. EM Algorithm

  17. Missing data • Commonly in machine learning tasks, some feature values are missing • some variables may not be observable (i.e. hidden ) even for training instances • values for some variables may be missing at random : what caused the data to be missing does not depend on the missing data itself • e.g. someone accidentally skips a question on an questionnaire • e.g. a sensor fails to record a value due to a power blip • values for some variables may be missing systematically : the probability of value being missing depends on the value • e.g. a medical test result is missing because a doctor was fairly sure of a diagnosis given earlier test results • e.g. the graded exams that go missing on the way home from school are those with poor scores

  18. Missing data • hidden variables; values missing at random • these are the cases we’ll focus on • one solution: try impute the values • values missing systematically • may be sensible to represent “ missing ” as an explicit feature value

  19. Imputing missing data with EM Given: • data set with some missing values • model structure, initial model parameters Repeat until convergence • Expectation (E) step: using current model, compute expectation over missing values • Maximization (M) step: update model parameters with those that maximize probability of the data (MLE or MAP)

  20. Example: EM for parameter learning suppose we’re given the following initial BN and training set B E A J M P(B) P(E) f f ? f f 0.1 0.2 f f ? t f E B t f ? t t B E P(A) f f ? f t t t 0.9 f t ? t f t f A 0.6 f f ? f t f t 0.3 t t ? t t f f 0.2 f f ? f f J M f f ? t f A P(J) A P(M) f f ? f t t 0.9 t 0.8 f 0.2 f 0.1

  21. Example: E-step B E A J M     ( | , , , ) P a b e j m t: 0.0069 f f f f      f: 0.9931 ( | , , , ) P a b e j m t:0.2 f f t f f:0.8 t:0.98 P(B) P(E) t f t t f: 0.02 0.1 0.2 t: 0.2 f f f t E B f: 0.8 B E P(A) t: 0.3 f t t f t t 0.9 f: 0.7 t f A 0.6 t:0.2 f f f t f t 0.3 f: 0.8 f f t: 0.997 0.2 t t t t f: 0.003 J M t: 0.0069 f f f f A P(J) A P(M) f: 0.9931 t 0.9 t 0.8 t:0.2 f f t f f: 0.8 f 0.2 f 0.1 t: 0.2 f f f t f: 0.8

  22. Example: E-step     ( | , , , ) P a b e j m     ( , , , , ) P b e a j m =     +      ( , , , , ) ( , , , , ) P b e a j m P b e a j m     0 . 9 0 . 8 0 . 2 0 . 1 0 . 2 =     +     0 . 9 0 . 8 0 . 2 0 . 1 0 . 2 0 . 9 0 . 8 0 . 8 0 . 8 0 . 9 P(B) P(E) 0 . 00288 = = 0 . 0069 0 . 4176 0.1 0.2 E B    ( | , , , ) P a b e j m B E P(A)    ( , , , , ) P b e a j m t t = 0.9    +     ( , , , , ) ( , , , , ) P b e a j m P b e a j m t f A 0.6     0 . 9 0 . 8 0 . 2 0 . 9 0 . 2 = f t 0.3     +     0 . 9 0 . 8 0 . 2 0 . 9 0 . 2 0 . 9 0 . 8 0 . 8 0 . 2 0 . 9 f f 0.2 0 . 02592 = = 0 . 2 J M 0 . 1296 A P(J) A P(M) t 0.9 t 0.8 f 0.2 f 0.1

  23. Example: M-step   B E A J M re-estimate probabilities # ( ) E a b e = ( | , ) P a b e  using expected counts # ( ) t: 0.0069 E b e f f f f f: 0.9931 0 . 997 = t:0.2 ( | , ) P a b e f f t f 1 f:0.8 0 . 98  e = t:0.98 ( | , ) P a b t f t t 1 f: 0.02 0 . 3 t: 0.2  = ( | , ) f f f t P a b e f: 0.8 1 + + + + + + t: 0.3 0 . 0069 0 . 2 0 . 2 0 . 2 0 . 0069 0 . 2 0 . 2 f t t f   = ( | , ) P a b e f: 0.7 7 t:0.2 f f f t f: 0.8 B E P(A) E B t: 0.997 t t 0.997 t t t t f: 0.003 t f 0.98 t: 0.0069 f t 0.3 f f f f A f: 0.9931 f f 0.145 t:0.2 f f t f f: 0.8 re-estimate probabilities for t: 0.2 J M P ( J | A ) and P ( M | A ) in same way f f f t f: 0.8

Recommend


More recommend