learning bayesian network
play

Learning Bayesian network : Given structure and completely observed - PowerPoint PPT Presentation

Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution that maybe correspond to


  1. Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani

  2. Learning problem  Target: true distribution 𝑄 βˆ— that maybe correspond to β„³ βˆ— = 𝒧 βˆ— , 𝜾 βˆ—  Hypothesis space: specified probabilistic graphical models  Data: set of instances sampled from 𝑄 βˆ—  Learning goal: selecting a model β„³ to construct the best approximation to 𝑁 βˆ— according to a performance metric 2

  3. Learning tasks on graphical models  Parameter learning / structure learning  Completely observable / partially observable data  Directed model / undirected model 3

  4. Parameter learning in directed models Complete data  We assume that the structure of the model is known  consider learning parameters for a BN with a given structure  Goal: estimate CPDs from a dataset 𝒠 = {π’š 1 , . . . , π’š (𝑂) } of 𝑂 independent, identically distributed (i.i.d.) training samples.  Each training sample π’š π‘œ = 𝑦 1 (π‘œ) , … , 𝑦 𝑀 (π‘œ) is a vector that every (π‘œ) is known (no missing values, no hidden variables) element 𝑦 𝑗 4

  5. Density estimation review  We use density estimation to solve this learning problem  Density estimation: Estimating the probability density function 𝑂 𝑄(π’š) , given a set of data points π’š 𝑗 drawn from it. 𝑗=1  Parametric methods: Assume that 𝑄(π’š) in terms of a specific functional form which has a number of adjustable parameters  MLE and Bayesian estimate  MLE : Need to determine 𝜾 βˆ— given {π’š 1 , … , π’š (𝑂) }  MLE overfitting problem  Bayesian estimate: Probability distribution 𝑄(𝜾) over spectrum of hypotheses  Needs prior distribution on parameters 5

  6. Density estimation: Graphical model  i.i.d assumption 𝜾 𝜾 𝜾 π‘Œ (𝑗) π‘Œ (2) π‘Œ (𝑂) π‘Œ (1) … 𝑗 = 1, . . . , 𝑂 𝜷 𝜷 𝜷 hyperparametrs 𝜾 𝜾 𝜾 π‘Œ (π‘œ) π‘Œ (2) π‘Œ (𝑂) π‘Œ (1) … π‘œ = 1, . . . , 𝑂 6

  7. Maximum Likelihood Estimation (MLE)  Likelihood is the conditional probability of observations 𝒠 = π’š (1) , π’š (2) , … , π’š (𝑂) given the value of parameters 𝜾  Assuming i.i.d. ( independent, identically distributed ) samples 𝑂 𝑄 𝒠 𝜾 = 𝑄 π’š (1) , … , π’š (𝑂) 𝜾 = 𝑄(π’š (π‘œ) |𝜾) π‘œ=1 likelihood of 𝜾 w.r.t. the samples  Maximum Likelihood estimation 𝑂 𝑄(π’š (π‘œ) |𝜾) 𝜾 𝑁𝑀 = argmax 𝑄 𝒠 𝜾 = argmax 𝜾 𝜾 π‘œ=1 𝑂 ln π‘ž π’š (𝑗) 𝜾 𝜾 𝑁𝑀 = argmax MLE has closed form solution for 𝜾 𝑗=1 many parametric distributions 7

  8. MLE: Bernoulli distribution  Given: 𝒠 = 𝑦 (1) , 𝑦 (2) , … , 𝑦 (𝑂) , 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0): π‘ž 𝑦 πœ„ = πœ„ 𝑦 1 βˆ’ πœ„ 1βˆ’π‘¦ π‘ž 𝑦 = 1 πœ„ = πœ„ 𝑂 𝑂 πœ„ 𝑦 π‘œ 1 βˆ’ πœ„ 1βˆ’π‘¦ π‘œ π‘ž(𝑦 π‘œ |πœ„) = π‘ž 𝒠 πœ„ = π‘œ=1 π‘œ=1 𝑂 𝑂 ln π‘ž(𝑦 π‘œ |πœ„) = {𝑦 π‘œ ln πœ„ + (1 βˆ’ 𝑦 π‘œ ) ln 1 βˆ’ πœ„ } ln π‘ž 𝒠 πœ„ = π‘œ=1 π‘œ=1 𝑂 𝑦 (π‘œ) = 0 β‡’ πœ„ 𝑁𝑀 = 𝑗=1 πœ– ln π‘ž 𝒠 πœ„ = 𝑛 πœ–πœ„ 𝑂 𝑂

  9. MLE: Multinomial distribution  Multinomial distribution (on variable with 𝐿 state): 𝐿 Parameter space: 𝜾 𝑦 𝑙 𝑄 π’š 𝜾 = πœ„ 𝑙 = πœ„ 1 , … , πœ„ 𝐿 𝑙=1 πœ„ 𝑗 ∈ 0,1 𝐿 𝑄 𝑦 𝑙 = 1 = πœ„ 𝑙 πœ„ 𝑙 = 1 πœ„ 2 𝑙=1 Variable: 1-of-K coding π’š = 𝑦 1 , … , 𝑦 𝐿 𝑦 𝑙 ∈ {0,1} 𝐿 πœ„ 1 𝑦 𝑙 = 1 𝑙=1 πœ„ 3 πœ„ 1 + πœ„ 2 + πœ„ 3 = 1 where πœ„ 𝑗 ∈ 0,1 that is a simplex showing the set of 9 valid parameters

  10. MLE: Multinomial distribution 𝒠 = π’š (1) , π’š (2) , … , π’š (𝑂) 𝑂 𝑂 𝐿 𝐿 (π‘œ) (π‘œ) 𝑂 𝑦 𝑙 π‘œ=1 𝑦 𝑙 𝑄(π’š π‘œ |𝜾) = 𝑄 𝒠 𝜾 = πœ„ 𝑙 = πœ„ 𝑙 𝑙=1 𝑙=1 π‘œ=1 π‘œ=1 𝑂 (π‘œ) 𝑛 𝑙 = 𝑦 𝑙 π‘œ=1 𝐿 𝐿 𝑛 𝑙 = 𝑂 β„’ 𝜾, πœ‡ = ln π‘ž 𝒠 𝜾 + πœ‡(1 βˆ’ πœ„ 𝑙 ) 𝑙=1 𝑙=1 (π‘œ) 𝑂 π‘œ=1 𝑦 𝑙 = 𝑛 𝑙 πœ„ 𝑙 = 𝑂 𝑂 10

  11. MLE: Gaussian with unknown 𝜈 ln 𝑄(𝑦 π‘œ |𝜈) = βˆ’ 1 2𝜌𝜏 βˆ’ 1 2 2𝜏 2 𝑦 π‘œ βˆ’ 𝜈 2 ln 𝑂 πœ– ln 𝑄 𝒠 𝜈 = 0 β‡’ πœ– π‘šπ‘œ π‘ž 𝑦 (π‘œ) 𝜈 = 0 πœ–πœˆ πœ–πœˆ π‘œ=1 𝑂 𝑂 1 𝜈 𝑁𝑀 = 1 𝜏 2 𝑦 (π‘œ) βˆ’ 𝜈 = 0 β‡’ 𝑦 π‘œ β‡’ 𝑂 π‘œ=1 π‘œ=1 12

  12. Bayesian approach  Parameters 𝜾 as random variables with a priori distribution  utilizes the available prior information about the unknown parameter  As opposed to ML estimation, it does not seek a specific point estimate of the unknown parameter vector 𝜾  Samples 𝒠 convert the prior densities 𝑄 𝜾 into a posterior density 𝑄 𝜾|𝒠  Keep track of beliefs about 𝜾 ’ s values and uses these beliefs for reaching conclusions 13

  13. Maximum A Posteriori (MAP) estimation  MAP estimation 𝜾 𝑁𝐡𝑄 = argmax π‘ž 𝜾 𝒠 𝜾  Since π‘ž 𝜾|𝒠 ∝ π‘ž 𝒠|𝜾 π‘ž(𝜾) 𝜾 𝑁𝐡𝑄 = argmax π‘ž 𝒠 𝜾 π‘ž(𝜾) 𝜾  Example of prior distribution: π‘ž πœ„ = π’ͺ(πœ„ 0 , 𝜏 2 ) 14

  14. Bayesian approach: Predictive distribution 𝑂 , a prior distribution on  Given a set of samples 𝒠 = π’š 𝑗 𝑗=1 the parameters 𝑄(𝜾) , and the form of the distribution 𝑄 π’š 𝜾  We find 𝑄 𝜾|𝒠 and use it to specify 𝑄 π’š = 𝑄(π’š|𝒠) on new data as an estimate of 𝑄(π’š) : 𝑄 π’š 𝒠 = 𝑄 π’š, 𝜾|𝒠 π‘’πœΎ = 𝑄 π’š 𝒠, 𝜾 𝑄 𝜾|𝒠 π‘’πœΎ = 𝑄 π’š 𝜾 𝑄 𝜾|𝒠 π‘’πœΎ Predictive distribution  Analytical solutions exist for very special forms of the involved functions 15

  15. Conjugate Priors  We consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties  Choosing a prior such that the posterior distribution that is proportional to π‘ž(𝒠|𝜾)π‘ž(𝜾) will have the same functional form as the prior. βˆ€πœ·, 𝒠 βˆƒπœ· β€² 𝑄(𝜾|𝜷 β€² ) ∝ 𝑄 𝒠 𝜾 𝑄(𝜾|𝜷) Having the same functional form 16

  16. Prior for Bernoulli Likelihood 𝛽 1 𝐹 πœ„ = 𝛽 0 + 𝛽 1  Beta distribution over πœ„ ∈ [0,1] : 𝛽 1 βˆ’ 1 πœ„ = 𝛽 0 βˆ’ 1 + 𝛽 1 βˆ’ 1 Beta πœ„ 𝛽 1 , 𝛽 0 ∝ πœ„ 𝛽 1 βˆ’1 1 βˆ’ πœ„ 𝛽 0 βˆ’1 most probable πœ„ Beta πœ„ 𝛽 1 , 𝛽 0 = Ξ“(𝛽 0 + 𝛽 1 ) Ξ“(𝛽 0 )Ξ“(𝛽 1 ) πœ„ 𝛽 1 βˆ’1 1 βˆ’ πœ„ 𝛽 0 βˆ’1  Beta distribution is the conjugate prior of Bernoulli: 𝑄 𝑦 πœ„ = πœ„ 𝑦 1 βˆ’ πœ„ 1βˆ’π‘¦ 17

  17. Beta distribution 18

  18. Benoulli likelihood: posterior Given: 𝒠 = 𝑦 (1) , 𝑦 (2) , … , 𝑦 (𝑂) , 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0) π‘ž πœ„ 𝒠 ∝ π‘ž 𝒠 πœ„ π‘ž(πœ„) 𝑂 πœ„ 𝑦 𝑗 1 βˆ’ πœ„ 1βˆ’π‘¦ 𝑗 = Beta πœ„ 𝛽 1 , 𝛽 0 𝑗=1 ∝ πœ„ 𝛽 1 βˆ’1 1 βˆ’ πœ„ 𝛽 0 βˆ’1 ∝ πœ„ 𝑛+𝛽 1 βˆ’1 1 βˆ’ πœ„ π‘‚βˆ’π‘›+𝛽 0 βˆ’1 𝑂 𝑦 (𝑗) 𝑛 = β€² , 𝛽 0 β€² β‡’ π‘ž πœ„ 𝒠 ∝ 𝐢𝑓𝑒𝑏 πœ„ 𝛽 1 𝑗=1 β€² = 𝛽 1 + 𝑛 𝛽 1 β€² = 𝛽 0 + 𝑂 βˆ’ 𝑛 𝛽 0 19

  19. Example π‘ž 𝑦 πœ„ = πœ„ 𝑦 1 βˆ’ πœ„ 1βˆ’π‘¦ Prior Beta: 𝛽 0 = 𝛽 1 = 2 Bernoulli π‘ž 𝑦 = 1 πœ„ πœ„ πœ„ Given: 𝒠 = 𝑦 (1) , 𝑦 (2) , … , 𝑦 (𝑂) : Posterior 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0) β€² = 5, 𝛽 0 β€² = 2 Beta: 𝛽 1 𝛽 0 = 𝛽 1 = 2 𝒠 = 1,1,1 β‡’ 𝑂 = 3, 𝑛 = 3 β€² βˆ’ 1 𝛽 1 β€² βˆ’ 1 = 4 πœ„ 𝑁𝐡𝑄 = argmax 𝑄 πœ„ 𝒠 = β€² βˆ’ 1 + 𝛽 0 𝛽 1 5 πœ„ πœ„ 20

  20. Benoulli: Predictive distribution  Training samples: 𝒠 = 𝑦 (1) , … , 𝑦 (𝑂) 𝑄 πœ„ = 𝐢𝑓𝑒𝑏 πœ„ 𝛽 1 , 𝛽 0 ∝ πœ„ 𝛽 1 βˆ’1 1 βˆ’ πœ„ 𝛽 0 βˆ’1 𝑄 πœ„|𝒠 = 𝐢𝑓𝑒𝑏 πœ„ 𝛽 1 + 𝑛, 𝛽 0 + 𝑂 βˆ’ 𝑛 ∝ πœ„ 𝛽 1 +π‘›βˆ’1 1 βˆ’ πœ„ 𝛽 0 + π‘‚βˆ’π‘› βˆ’1 𝑄 𝑦|𝒠 = 𝑄 𝑦|πœ„ 𝑄 πœ„|𝒠 π‘’πœ„ = 𝐹 𝑄 πœ„|𝒠 𝑄(𝑦|πœ„) 𝛽 1 + 𝑛 β‡’ 𝑄 𝑦 = 1|𝒠 = 𝐹 𝑄 πœ„|𝒠 πœ„ = 𝛽 0 + 𝛽 1 + 𝑂 21

  21. Dirichlet distribution Input space: 𝜾 = πœ„ 1 , … , πœ„ 𝐿 π‘ˆ πœ„ 𝑙 ∈ 0,1 𝐿 πœ„ 𝑙 = 1 𝐿 𝛽 𝑙 βˆ’1 𝑄 𝜾 𝜷 ∝ πœ„ 𝑙 𝑙=1 𝑙=1 𝐿 Ξ“(𝛽) 𝛽 𝑙 βˆ’1 = Ξ“ 𝛽 1 … Ξ“(𝛽 𝐿 ) πœ„ 𝑙 𝑙=1 𝐿 𝐹 πœ„ 𝑙 = 𝛽 𝑙 𝛽 𝛽 = 𝛽 𝑙 πœ„ 𝑙 = 𝛽 𝑙 βˆ’ 1 𝛽 βˆ’ 𝐿 𝑙=1 22

  22. Dirichlet distribution: Examples 𝜷 = [10,10,10] 𝜷 = [0.1,0.1,0.1] 𝜷 = [1,1,1] Dirichlet parameters determine both the prior beliefs and their strength. The larger values of 𝛽 correspond to more confidence on the prior belief (i.e., more imaginary samples) 23

  23. Dirichlet distribution: Example 𝜷 = [2,2,2] 𝜷 = [20,2,2] 24

  24. Multinomial distribution: Prior  Dirichlet distribution is the conjugate prior of Multinomial 𝐿 𝑛 𝑙 +𝛽 𝑙 βˆ’1 𝑄 𝜾 𝒠, 𝜷 ∝ 𝑄 𝒠 𝜾 𝑄 𝜾 𝜷 ∝ πœ„ 𝑙 𝑙=1 𝒏 = 𝑛 1 , … , 𝑛 𝐿 π‘ˆ sufficient statistics of data 𝑄 𝜾 𝒠, 𝜷 = 𝐸𝑗𝑠 𝜾 𝜷 + 𝒏 𝑄 𝜾 𝒠, 𝜷 𝑄 𝜾 𝜷 𝜾~𝐸𝑗𝑠(𝛽 1 + 𝑛 1 , … , 𝛽 𝐿 + 𝑛 𝐿 ) 𝜾~𝐸𝑗𝑠(𝛽 1 , … , 𝛽 𝐿 ) 25

Recommend


More recommend