Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian networks Siamak Ravanbakhsh Fall 2019
Learning objectives Learning objectives likelihood function and MLE role of the sufficient statistics MLE for parameter learning in directed models why is it easy? conjugate priors and Bayesian parameter learning
Likelihood function Likelihood function through an example through an example a thumbtack with unknown prob. of heads & tails ≡ 1 ≡ 0 Bernoulli dist. p ( x ; θ ) = θ (1 − θ ) (1− x ) x
Likelihood function Likelihood function through an example through an example a thumbtack with unknown prob. of heads & tails ≡ 1 ≡ 0 Bernoulli dist. θ ) (1− x ) p ( x ; θ ) = θ (1 − x IID observations D = {1, 0, 0, 1, 1} likelihood of is 3 θ ) 2 L ( θ ; D ) = ∏ x ∈ D P ( x ; θ ) = θ (1 − θ θ likelihood function not a pdf (it does not integrate to 1)
Likelihood function Likelihood function through an example through an example a thumbtack with unknown prob. of heads & tails ≡ 1 ≡ 0 Bernoulli dist. θ ) (1− x ) p ( x ; θ ) = θ (1 − x max-likelihood estimate (MLE) IID observations D = {1, 0, 0, 1, 1} likelihood of is 3 θ ) 2 L ( θ ; D ) = ∏ x ∈ D P ( x ; θ ) = θ (1 − θ log-likelihood: log L ( θ ; D ) = 3 log θ + 2 log(1 − θ ) maximizing the log-likelihood (M-projection of ) P D θ likelihood function ∂ ( ^ 3 2 3−5 θ 3 3 log θ + 2 log(1 − θ ) = ) − = = 0 ⇒ = θ not a pdf (it does not integrate to 1) ∂ θ 1− θ θ (1− θ ) 5 θ
Sufficient statistics Sufficient statistics through an example through an example IID observations D = {1, 0, 0, 1, 1} ≡ 1 ≡ 0 likelihood of is 3 θ ) 2 L ( θ , D ) = P ( x ; θ ) = θ (1 − θ ∏ x ∈ D all we needed to know about the data: number of heads and tails given a distribution P ( x ; θ ) its sufficient statistics is function such that ϕ = [ ϕ , … , ϕ ] 1 K E E ′ 1 1 ′ ′ [ ϕ ( x )] = [ ϕ ( x )] ⇒ L ( θ , D ) = L ( θ , D ) ∀ D , D , θ D ′ D ′ ∣ D ∣ ∣ D ∣ sufficient statistics of the dataset is all that matters about the data
Revisiting Revisiting exponential family exponential family given a distribution P ( x ; θ ) its sufficient statistics is function such that ϕ = [ ϕ , … , ϕ ] 1 K E E ′ 1 1 ′ ′ [ ϕ ( x )] = [ ϕ ( x )] ⇒ L ( θ , D ) = L ( θ , D ) ∀ D , D , θ L ( θ , D ) = p ( x ; θ ) ∏ x ∈ D D ′ D ′ ∣ D ∣ ∣ D ∣ the (linear) exponential family: p ( x ) ∝ exp(⟨ θ , ϕ ( x )⟩) max-entropy distribution subject to E [ ϕ ( x )] = μ p
Revisiting Revisiting exponential family exponential family given a distribution P ( x ; θ ) its sufficient statistics is function such that ϕ = [ ϕ , … , ϕ ] 1 K E E ′ 1 1 ′ ′ [ ϕ ( x )] = [ ϕ ( x )] ⇒ L ( θ , D ) = L ( θ , D ) ∀ D , D , θ L ( θ , D ) = p ( x ; θ ) ∏ x ∈ D D ′ D ′ ∣ D ∣ ∣ D ∣ the (linear) exponential family: p ( x ) ∝ exp(⟨ θ , ϕ ( x )⟩) max-entropy distribution subject to E [ ϕ ( x )] = μ p if are linearly independent, then , … , ϕ θ ↔ μ ϕ 1 k
MLE for Bayesian networks MLE for Bayesian networks an example an example X a simple network p ( x , y ; θ ) = p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X Y
MLE for Bayesian networks MLE for Bayesian networks an example an example X a simple network p ( x , y ; θ ) = p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X likelihood L ( D ; θ ) = ∏ ( x , y )∈ D p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X Y ( ∏ ( x )∈ D X ) ( ∏ ( x , y )∈ D Y ∣ X ) = p ( x ; θ ) p ( y ∣ x ; θ ) likelihood of x cond. likelihood of y
MLE for Bayesian networks MLE for Bayesian networks an example an example X a simple network p ( x , y ; θ ) = p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X likelihood L ( D ; θ ) = ∏ ( x , y )∈ D p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X Y ( ∏ ( x )∈ D X ) ( ∏ ( x , y )∈ D Y ∣ X ) = p ( x ; θ ) p ( y ∣ x ; θ ) likelihood of x cond. likelihood of y for discrete vars. x = ℓ, y = ℓ ′ number of times in the dataset ′ ) number of times in the dataset x = ℓ ( ∏ ℓ∈ V al ( X ) N ( x =ℓ) ) ( ∏ ℓ,ℓ ∈ V al ( X )× V al ( Y ) Y ∣ X ,ℓ,ℓ ′ N ( x =ℓ, y =ℓ ) L ( D ; θ ) = θ θ X ,ℓ ′ p ( X = ℓ) ′ p ( X = ℓ ∣ Y = ℓ )
MLE for Bayesian networks MLE for Bayesian networks an example an example X a simple network p ( x , y ; θ ) = p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X likelihood L ( D ; θ ) = ∏ ( x , y )∈ D p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X Y ( ∏ ( x )∈ D X ) ( ∏ ( x , y )∈ D Y ∣ X ) = p ( x ; θ ) p ( y ∣ x ; θ ) likelihood of x cond. likelihood of y for discrete vars. x = ℓ, y = ℓ ′ number of times in the dataset ′ ) number of times in the dataset x = ℓ ( ∏ ℓ∈ V al ( X ) N ( x =ℓ) ) ( ∏ ℓ,ℓ ∈ V al ( X )× V al ( Y ) Y ∣ X ,ℓ,ℓ ′ N ( x =ℓ, y =ℓ ) L ( D ; θ ) = θ θ X ,ℓ ′ p ( X = ℓ) ′ p ( X = ℓ ∣ Y = ℓ ) MLE : maximize local likelihood terms individually N ( x =ℓ) ′ N ( x =ℓ, y =ℓ ) = = θ θ X ,ℓ Y ∣ X ,ℓ,ℓ ′ ∣ D ∣ ∣ D ∣
MLE for Bayesian networks general case MLE for Bayesian networks general case Bayes-net p ( x ; θ ) = p ( x ∣ ; θ ) ∏ i Pa ∣ Pa i x X i i X i likelihood L ( D ; θ ) = ∏ x ∈ D ∏ i p ( x ∣ ; θ ) Pa i ∣ Pa i x i i = ∏ i ∏ ( x p ( x ∣ ; θ ) Pa i ∣ Pa , Pa )∈ D i x i i i x i local likelihood terms
MLE for Bayesian networks MLE for Bayesian networks general case general case Bayes-net p ( x ; θ ) = p ( x ∣ ; θ ) ∏ i Pa ∣ Pa i x X i i X i likelihood L ( D ; θ ) = ∏ x ∈ D ∏ i p ( x ∣ ; θ ) Pa i ∣ Pa i x i i = ∏ i ∏ ( x p ( x ∣ ; θ ) Pa i ∣ Pa , Pa )∈ D i x i i i x i local likelihood terms maximizing the conditional likelihood for each node similar to solving individual prediction problems
MLE for Bayesian networks general case MLE for Bayesian networks general case Bayes-net p ( x ; θ ) = p ( x ∣ ; θ ) ∏ i Pa ∣ Pa i x X i i X i likelihood L ( D ; θ ) = ∏ x ∈ D ∏ i p ( x ∣ ; θ ) Pa i ∣ Pa i x i i = ∏ i ∏ ( x p ( x ∣ ; θ ) Pa i ∣ Pa , Pa )∈ D i x i i i x i local likelihood terms maximizing the conditional likelihood for each node similar to solving individual prediction problems how to learn a naive Bayes? Example
Bayesian Bayesian parameter estimation parameter estimation max-likelihood is the same for ^ 1 = θ 3 ≡ 1 ≡ 0 case 1. N ( x = 1) = 1, N ( x = 0) = 2 case 2. N ( x = 1) = 100, N ( x = 0) = 200 Example
Bayesian Bayesian parameter estimation parameter estimation max-likelihood is the same for ^ 1 = θ 3 ≡ 1 ≡ 0 case 1. N ( x = 1) = 1, N ( x = 0) = 2 case 2. N ( x = 1) = 100, N ( x = 0) = 200 Example need to model our uncertainty Bayesian approach: assume a prior p ( θ ) estimate the posterior
Bayesian parameter estimation Bayesian parameter estimation max-likelihood is the same for ^ 1 = θ 3 ≡ 1 ≡ 0 case 1. N ( x = 1) = 1, N ( x = 0) = 2 case 2. N ( x = 1) = 100, N ( x = 0) = 200 Example need to model our uncertainty Bayesian approach: assume a prior p ( θ ) estimate the posterior prior likelihood p ( θ ) p ( D ∣ θ ) p ( θ ∣ D ) = ∝ p ( θ ) p ( D ∣ θ ) p ( D ) posterior ∏ x ∈ D p ( x ∣ θ ) marginal likelihood
Bayesian parameter estimation Bayesian parameter estimation { 1 0 ≤ θ ≤ 1 assuming a uniform prior p ( θ ) = ≡ 1 ≡ 0 0 o . w . posterior p ( θ ∣ D ) ∝ p ( θ ) p ( D ∣ θ ) ∝ p ( D ∣ θ )
Bayesian parameter estimation Bayesian parameter estimation { 1 0 ≤ θ ≤ 1 assuming a uniform prior p ( θ ) = ≡ 1 ≡ 0 0 o . w . posterior p ( θ ∣ D ) ∝ p ( θ ) p ( D ∣ θ ) ∝ p ( D ∣ θ ) posterior predictive: predicting heads/tails using the posterior rather than a single MLE value 1 p ( x ∣ D ) = p ( θ ∣ D ) p ( x ∣ θ )d θ ∫ 0 N (1) θ ) N (0) θ ) (1− x ) ∝ θ (1 − θ (1 − x
Bayesian parameter estimation Bayesian parameter estimation { 1 0 ≤ θ ≤ 1 assuming a uniform prior p ( θ ) = ≡ 1 ≡ 0 0 o . w . posterior p ( θ ∣ D ) ∝ p ( θ ) p ( D ∣ θ ) ∝ p ( D ∣ θ ) posterior predictive: predicting heads/tails using the posterior rather than a single MLE value 1 p ( x ∣ D ) = p ( θ ∣ D ) p ( x ∣ θ )d θ ∫ 0 N (1) θ ) N (0) θ ) (1− x ) ∝ θ (1 − θ (1 − x N (1)+1 if we do the integration above: p ( x = 1 ∣ D ) = Laplace correction N (0)+ N (1)+2 (and normalize)
Recommend
More recommend