Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani
Learning problem ο½ Target: true distribution π β that maybe correspond to β³ β = π§ β , πΎ β ο½ Hypothesis space: specified probabilistic graphical models ο½ Data: set of instances sampled from π β ο½ Learning goal: selecting a model β³ to construct the best approximation to π β according to a performance metric 2
Learning tasks on graphical models ο½ Parameter learning / structure learning ο½ Completely observable / partially observable data ο½ Directed model / undirected model 3
Parameter learning in directed models Complete data ο½ We assume that the structure of the model is known ο½ consider learning parameters for a BN with a given structure ο½ Goal: estimate CPDs from a dataset π = {π 1 , . . . , π (π) } of π independent, identically distributed (i.i.d.) training samples. ο½ Each training sample π π = π¦ 1 (π) , β¦ , π¦ π (π) is a vector that every (π) is known (no missing values, no hidden variables) element π¦ π 4
Density estimation review ο½ We use density estimation to solve this learning problem ο½ Density estimation: Estimating the probability density function π π(π) , given a set of data points π π drawn from it. π=1 ο½ Parametric methods: Assume that π(π) in terms of a specific functional form which has a number of adjustable parameters ο½ MLE and Bayesian estimate ο½ MLE : Need to determine πΎ β given {π 1 , β¦ , π (π) } ο½ MLE overfitting problem ο½ Bayesian estimate: Probability distribution π(πΎ) over spectrum of hypotheses ο½ Needs prior distribution on parameters 5
Density estimation: Graphical model ο½ i.i.d assumption πΎ πΎ πΎ π (π) π (2) π (π) π (1) β¦ π = 1, . . . , π π· π· π· hyperparametrs πΎ πΎ πΎ π (π) π (2) π (π) π (1) β¦ π = 1, . . . , π 6
Maximum Likelihood Estimation (MLE) ο½ Likelihood is the conditional probability of observations π = π (1) , π (2) , β¦ , π (π) given the value of parameters πΎ ο½ Assuming i.i.d. ( independent, identically distributed ) samples π π π πΎ = π π (1) , β¦ , π (π) πΎ = π(π (π) |πΎ) π=1 likelihood of πΎ w.r.t. the samples ο½ Maximum Likelihood estimation π π(π (π) |πΎ) πΎ ππ = argmax π π πΎ = argmax πΎ πΎ π=1 π ln π π (π) πΎ πΎ ππ = argmax MLE has closed form solution for πΎ π=1 many parametric distributions 7
MLE: Bernoulli distribution ο½ Given: π = π¦ (1) , π¦ (2) , β¦ , π¦ (π) , π heads (1), π β π tails (0): π π¦ π = π π¦ 1 β π 1βπ¦ π π¦ = 1 π = π π π π π¦ π 1 β π 1βπ¦ π π(π¦ π |π) = π π π = π=1 π=1 π π ln π(π¦ π |π) = {π¦ π ln π + (1 β π¦ π ) ln 1 β π } ln π π π = π=1 π=1 π π¦ (π) = 0 β π ππ = π=1 π ln π π π = π ππ π π
MLE: Multinomial distribution ο½ Multinomial distribution (on variable with πΏ state): πΏ Parameter space: πΎ π¦ π π π πΎ = π π = π 1 , β¦ , π πΏ π=1 π π β 0,1 πΏ π π¦ π = 1 = π π π π = 1 π 2 π=1 Variable: 1-of-K coding π = π¦ 1 , β¦ , π¦ πΏ π¦ π β {0,1} πΏ π 1 π¦ π = 1 π=1 π 3 π 1 + π 2 + π 3 = 1 where π π β 0,1 that is a simplex showing the set of 9 valid parameters
MLE: Multinomial distribution π = π (1) , π (2) , β¦ , π (π) π π πΏ πΏ (π) (π) π π¦ π π=1 π¦ π π(π π |πΎ) = π π πΎ = π π = π π π=1 π=1 π=1 π=1 π (π) π π = π¦ π π=1 πΏ πΏ π π = π β πΎ, π = ln π π πΎ + π(1 β π π ) π=1 π=1 (π) π π=1 π¦ π = π π π π = π π 10
MLE: Gaussian with unknown π ln π(π¦ π |π) = β 1 2ππ β 1 2 2π 2 π¦ π β π 2 ln π π ln π π π = 0 β π ππ π π¦ (π) π = 0 ππ ππ π=1 π π 1 π ππ = 1 π 2 π¦ (π) β π = 0 β π¦ π β π π=1 π=1 12
Bayesian approach ο½ Parameters πΎ as random variables with a priori distribution ο½ utilizes the available prior information about the unknown parameter ο½ As opposed to ML estimation, it does not seek a specific point estimate of the unknown parameter vector πΎ ο½ Samples π convert the prior densities π πΎ into a posterior density π πΎ|π ο½ Keep track of beliefs about πΎ β s values and uses these beliefs for reaching conclusions 13
Maximum A Posteriori (MAP) estimation ο½ MAP estimation πΎ ππ΅π = argmax π πΎ π πΎ ο½ Since π πΎ|π β π π |πΎ π(πΎ) πΎ ππ΅π = argmax π π πΎ π(πΎ) πΎ ο½ Example of prior distribution: π π = πͺ(π 0 , π 2 ) 14
Bayesian approach: Predictive distribution π , a prior distribution on ο½ Given a set of samples π = π π π=1 the parameters π(πΎ) , and the form of the distribution π π πΎ ο½ We find π πΎ|π and use it to specify π π = π(π|π ) on new data as an estimate of π(π) : π π π = π π, πΎ|π ππΎ = π π π , πΎ π πΎ|π ππΎ = π π πΎ π πΎ|π ππΎ Predictive distribution ο½ Analytical solutions exist for very special forms of the involved functions 15
Conjugate Priors ο½ We consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties ο½ Choosing a prior such that the posterior distribution that is proportional to π(π |πΎ)π(πΎ) will have the same functional form as the prior. βπ·, π βπ· β² π(πΎ|π· β² ) β π π πΎ π(πΎ|π·) Having the same functional form 16
Prior for Bernoulli Likelihood π½ 1 πΉ π = π½ 0 + π½ 1 ο½ Beta distribution over π β [0,1] : π½ 1 β 1 π = π½ 0 β 1 + π½ 1 β 1 Beta π π½ 1 , π½ 0 β π π½ 1 β1 1 β π π½ 0 β1 most probable π Beta π π½ 1 , π½ 0 = Ξ(π½ 0 + π½ 1 ) Ξ(π½ 0 )Ξ(π½ 1 ) π π½ 1 β1 1 β π π½ 0 β1 ο½ Beta distribution is the conjugate prior of Bernoulli: π π¦ π = π π¦ 1 β π 1βπ¦ 17
Beta distribution 18
Benoulli likelihood: posterior Given: π = π¦ (1) , π¦ (2) , β¦ , π¦ (π) , π heads (1), π β π tails (0) π π π β π π π π(π) π π π¦ π 1 β π 1βπ¦ π = Beta π π½ 1 , π½ 0 π=1 β π π½ 1 β1 1 β π π½ 0 β1 β π π+π½ 1 β1 1 β π πβπ+π½ 0 β1 π π¦ (π) π = β² , π½ 0 β² β π π π β πΆππ’π π π½ 1 π=1 β² = π½ 1 + π π½ 1 β² = π½ 0 + π β π π½ 0 19
Example π π¦ π = π π¦ 1 β π 1βπ¦ Prior Beta: π½ 0 = π½ 1 = 2 Bernoulli π π¦ = 1 π π π Given: π = π¦ (1) , π¦ (2) , β¦ , π¦ (π) : Posterior π heads (1), π β π tails (0) β² = 5, π½ 0 β² = 2 Beta: π½ 1 π½ 0 = π½ 1 = 2 π = 1,1,1 β π = 3, π = 3 β² β 1 π½ 1 β² β 1 = 4 π ππ΅π = argmax π π π = β² β 1 + π½ 0 π½ 1 5 π π 20
Benoulli: Predictive distribution ο½ Training samples: π = π¦ (1) , β¦ , π¦ (π) π π = πΆππ’π π π½ 1 , π½ 0 β π π½ 1 β1 1 β π π½ 0 β1 π π|π = πΆππ’π π π½ 1 + π, π½ 0 + π β π β π π½ 1 +πβ1 1 β π π½ 0 + πβπ β1 π π¦|π = π π¦|π π π|π ππ = πΉ π π|π π(π¦|π) π½ 1 + π β π π¦ = 1|π = πΉ π π|π π = π½ 0 + π½ 1 + π 21
Dirichlet distribution Input space: πΎ = π 1 , β¦ , π πΏ π π π β 0,1 πΏ π π = 1 πΏ π½ π β1 π πΎ π· β π π π=1 π=1 πΏ Ξ(π½) π½ π β1 = Ξ π½ 1 β¦ Ξ(π½ πΏ ) π π π=1 πΏ πΉ π π = π½ π π½ π½ = π½ π π π = π½ π β 1 π½ β πΏ π=1 22
Dirichlet distribution: Examples π· = [10,10,10] π· = [0.1,0.1,0.1] π· = [1,1,1] Dirichlet parameters determine both the prior beliefs and their strength. The larger values of π½ correspond to more confidence on the prior belief (i.e., more imaginary samples) 23
Dirichlet distribution: Example π· = [2,2,2] π· = [20,2,2] 24
Multinomial distribution: Prior ο½ Dirichlet distribution is the conjugate prior of Multinomial πΏ π π +π½ π β1 π πΎ π , π· β π π πΎ π πΎ π· β π π π=1 π = π 1 , β¦ , π πΏ π sufficient statistics of data π πΎ π , π· = πΈππ πΎ π· + π π πΎ π , π· π πΎ π· πΎ~πΈππ (π½ 1 + π 1 , β¦ , π½ πΏ + π πΏ ) πΎ~πΈππ (π½ 1 , β¦ , π½ πΏ ) 25
Recommend
More recommend