probabilistic graphical models probabilistic graphical
play

Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian networks Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives likelihood function and MLE role of the sufficient statistics MLE for


  1. Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian networks Siamak Ravanbakhsh Fall 2019

  2. Learning objectives Learning objectives likelihood function and MLE role of the sufficient statistics MLE for parameter learning in directed models why is it easy? conjugate priors and Bayesian parameter learning

  3. Likelihood function Likelihood function through an example through an example a thumbtack with unknown prob. of heads & tails ≡ 1 ≡ 0 Bernoulli dist. p ( x ; θ ) = θ (1 − θ ) (1− x ) x

  4. Likelihood function Likelihood function through an example through an example a thumbtack with unknown prob. of heads & tails ≡ 1 ≡ 0 Bernoulli dist. θ ) (1− x ) p ( x ; θ ) = θ (1 − x IID observations D = {1, 0, 0, 1, 1} likelihood of is 3 θ ) 2 L ( θ ; D ) = ∏ x ∈ D P ( x ; θ ) = θ (1 − θ θ likelihood function not a pdf (it does not integrate to 1)

  5. Likelihood function Likelihood function through an example through an example a thumbtack with unknown prob. of heads & tails ≡ 1 ≡ 0 Bernoulli dist. θ ) (1− x ) p ( x ; θ ) = θ (1 − x max-likelihood estimate (MLE) IID observations D = {1, 0, 0, 1, 1} likelihood of is 3 θ ) 2 L ( θ ; D ) = ∏ x ∈ D P ( x ; θ ) = θ (1 − θ log-likelihood: log L ( θ ; D ) = 3 log θ + 2 log(1 − θ ) maximizing the log-likelihood (M-projection of ) P D θ likelihood function ∂ ( ^ 3 2 3−5 θ 3 3 log θ + 2 log(1 − θ ) = ) − = = 0 ⇒ = θ not a pdf (it does not integrate to 1) ∂ θ 1− θ θ (1− θ ) 5 θ

  6. Sufficient statistics Sufficient statistics through an example through an example IID observations D = {1, 0, 0, 1, 1} ≡ 1 ≡ 0 likelihood of is 3 θ ) 2 L ( θ , D ) = P ( x ; θ ) = θ (1 − θ ∏ x ∈ D all we needed to know about the data: number of heads and tails given a distribution P ( x ; θ ) its sufficient statistics is function such that ϕ = [ ϕ , … , ϕ ] 1 K E E ′ 1 1 ′ ′ [ ϕ ( x )] = [ ϕ ( x )] ⇒ L ( θ , D ) = L ( θ , D ) ∀ D , D , θ D ′ D ′ ∣ D ∣ ∣ D ∣ sufficient statistics of the dataset is all that matters about the data

  7. Revisiting Revisiting exponential family exponential family given a distribution P ( x ; θ ) its sufficient statistics is function such that ϕ = [ ϕ , … , ϕ ] 1 K E E ′ 1 1 ′ ′ [ ϕ ( x )] = [ ϕ ( x )] ⇒ L ( θ , D ) = L ( θ , D ) ∀ D , D , θ L ( θ , D ) = p ( x ; θ ) ∏ x ∈ D D ′ D ′ ∣ D ∣ ∣ D ∣ the (linear) exponential family: p ( x ) ∝ exp(⟨ θ , ϕ ( x )⟩) max-entropy distribution subject to E [ ϕ ( x )] = μ p

  8. Revisiting Revisiting exponential family exponential family given a distribution P ( x ; θ ) its sufficient statistics is function such that ϕ = [ ϕ , … , ϕ ] 1 K E E ′ 1 1 ′ ′ [ ϕ ( x )] = [ ϕ ( x )] ⇒ L ( θ , D ) = L ( θ , D ) ∀ D , D , θ L ( θ , D ) = p ( x ; θ ) ∏ x ∈ D D ′ D ′ ∣ D ∣ ∣ D ∣ the (linear) exponential family: p ( x ) ∝ exp(⟨ θ , ϕ ( x )⟩) max-entropy distribution subject to E [ ϕ ( x )] = μ p if are linearly independent, then , … , ϕ θ ↔ μ ϕ 1 k

  9. MLE for Bayesian networks MLE for Bayesian networks an example an example X a simple network p ( x , y ; θ ) = p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X Y

  10. MLE for Bayesian networks MLE for Bayesian networks an example an example X a simple network p ( x , y ; θ ) = p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X likelihood L ( D ; θ ) = ∏ ( x , y )∈ D p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X Y ( ∏ ( x )∈ D X ) ( ∏ ( x , y )∈ D Y ∣ X ) = p ( x ; θ ) p ( y ∣ x ; θ ) likelihood of x cond. likelihood of y

  11. MLE for Bayesian networks MLE for Bayesian networks an example an example X a simple network p ( x , y ; θ ) = p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X likelihood L ( D ; θ ) = ∏ ( x , y )∈ D p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X Y ( ∏ ( x )∈ D X ) ( ∏ ( x , y )∈ D Y ∣ X ) = p ( x ; θ ) p ( y ∣ x ; θ ) likelihood of x cond. likelihood of y for discrete vars. x = ℓ, y = ℓ ′ number of times in the dataset ′ ) number of times in the dataset x = ℓ ( ∏ ℓ∈ V al ( X ) N ( x =ℓ) ) ( ∏ ℓ,ℓ ∈ V al ( X )× V al ( Y ) Y ∣ X ,ℓ,ℓ ′ N ( x =ℓ, y =ℓ ) L ( D ; θ ) = θ θ X ,ℓ ′ p ( X = ℓ) ′ p ( X = ℓ ∣ Y = ℓ )

  12. MLE for Bayesian networks MLE for Bayesian networks an example an example X a simple network p ( x , y ; θ ) = p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X likelihood L ( D ; θ ) = ∏ ( x , y )∈ D p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X Y ( ∏ ( x )∈ D X ) ( ∏ ( x , y )∈ D Y ∣ X ) = p ( x ; θ ) p ( y ∣ x ; θ ) likelihood of x cond. likelihood of y for discrete vars. x = ℓ, y = ℓ ′ number of times in the dataset ′ ) number of times in the dataset x = ℓ ( ∏ ℓ∈ V al ( X ) N ( x =ℓ) ) ( ∏ ℓ,ℓ ∈ V al ( X )× V al ( Y ) Y ∣ X ,ℓ,ℓ ′ N ( x =ℓ, y =ℓ ) L ( D ; θ ) = θ θ X ,ℓ ′ p ( X = ℓ) ′ p ( X = ℓ ∣ Y = ℓ ) MLE : maximize local likelihood terms individually N ( x =ℓ) ′ N ( x =ℓ, y =ℓ ) = = θ θ X ,ℓ Y ∣ X ,ℓ,ℓ ′ ∣ D ∣ ∣ D ∣

  13. MLE for Bayesian networks general case MLE for Bayesian networks general case Bayes-net p ( x ; θ ) = p ( x ∣ ; θ ) ∏ i Pa ∣ Pa i x X i i X i likelihood L ( D ; θ ) = ∏ x ∈ D ∏ i p ( x ∣ ; θ ) Pa i ∣ Pa i x i i = ∏ i ∏ ( x p ( x ∣ ; θ ) Pa i ∣ Pa , Pa )∈ D i x i i i x i local likelihood terms

  14. MLE for Bayesian networks MLE for Bayesian networks general case general case Bayes-net p ( x ; θ ) = p ( x ∣ ; θ ) ∏ i Pa ∣ Pa i x X i i X i likelihood L ( D ; θ ) = ∏ x ∈ D ∏ i p ( x ∣ ; θ ) Pa i ∣ Pa i x i i = ∏ i ∏ ( x p ( x ∣ ; θ ) Pa i ∣ Pa , Pa )∈ D i x i i i x i local likelihood terms maximizing the conditional likelihood for each node similar to solving individual prediction problems

  15. MLE for Bayesian networks general case MLE for Bayesian networks general case Bayes-net p ( x ; θ ) = p ( x ∣ ; θ ) ∏ i Pa ∣ Pa i x X i i X i likelihood L ( D ; θ ) = ∏ x ∈ D ∏ i p ( x ∣ ; θ ) Pa i ∣ Pa i x i i = ∏ i ∏ ( x p ( x ∣ ; θ ) Pa i ∣ Pa , Pa )∈ D i x i i i x i local likelihood terms maximizing the conditional likelihood for each node similar to solving individual prediction problems how to learn a naive Bayes? Example

  16. Bayesian Bayesian parameter estimation parameter estimation max-likelihood is the same for ^ 1 = θ 3 ≡ 1 ≡ 0 case 1. N ( x = 1) = 1, N ( x = 0) = 2 case 2. N ( x = 1) = 100, N ( x = 0) = 200 Example

  17. Bayesian Bayesian parameter estimation parameter estimation max-likelihood is the same for ^ 1 = θ 3 ≡ 1 ≡ 0 case 1. N ( x = 1) = 1, N ( x = 0) = 2 case 2. N ( x = 1) = 100, N ( x = 0) = 200 Example need to model our uncertainty Bayesian approach: assume a prior p ( θ ) estimate the posterior

  18. Bayesian parameter estimation Bayesian parameter estimation max-likelihood is the same for ^ 1 = θ 3 ≡ 1 ≡ 0 case 1. N ( x = 1) = 1, N ( x = 0) = 2 case 2. N ( x = 1) = 100, N ( x = 0) = 200 Example need to model our uncertainty Bayesian approach: assume a prior p ( θ ) estimate the posterior prior likelihood p ( θ ) p ( D ∣ θ ) p ( θ ∣ D ) = ∝ p ( θ ) p ( D ∣ θ ) p ( D ) posterior ∏ x ∈ D p ( x ∣ θ ) marginal likelihood

  19. Bayesian parameter estimation Bayesian parameter estimation { 1 0 ≤ θ ≤ 1 assuming a uniform prior p ( θ ) = ≡ 1 ≡ 0 0 o . w . posterior p ( θ ∣ D ) ∝ p ( θ ) p ( D ∣ θ ) ∝ p ( D ∣ θ )

  20. Bayesian parameter estimation Bayesian parameter estimation { 1 0 ≤ θ ≤ 1 assuming a uniform prior p ( θ ) = ≡ 1 ≡ 0 0 o . w . posterior p ( θ ∣ D ) ∝ p ( θ ) p ( D ∣ θ ) ∝ p ( D ∣ θ ) posterior predictive: predicting heads/tails using the posterior rather than a single MLE value 1 p ( x ∣ D ) = p ( θ ∣ D ) p ( x ∣ θ )d θ ∫ 0 N (1) θ ) N (0) θ ) (1− x ) ∝ θ (1 − θ (1 − x

  21. Bayesian parameter estimation Bayesian parameter estimation { 1 0 ≤ θ ≤ 1 assuming a uniform prior p ( θ ) = ≡ 1 ≡ 0 0 o . w . posterior p ( θ ∣ D ) ∝ p ( θ ) p ( D ∣ θ ) ∝ p ( D ∣ θ ) posterior predictive: predicting heads/tails using the posterior rather than a single MLE value 1 p ( x ∣ D ) = p ( θ ∣ D ) p ( x ∣ θ )d θ ∫ 0 N (1) θ ) N (0) θ ) (1− x ) ∝ θ (1 − θ (1 − x N (1)+1 if we do the integration above: p ( x = 1 ∣ D ) = Laplace correction N (0)+ N (1)+2 (and normalize)

Recommend


More recommend