Learning Probabilistic Graphical Parameter Estimation Models Max Likelihood for Log-Linear Models Daphne Koller
Log-Likelihood for Markov Nets A B C • Partition function couples the parameters – No decomposition of likelihood – No closed form solution Daphne Koller
Example: Log-Likelihood Function A B 0 -20 -40 C -60 -80 -100 200 180 160 -120 140 120 100 80 60 40 20 0 -20 60 -40 40 20 0 -20 -40 -60 -80 -100 -60 -120 -140 -160 -180 -200 Daphne Koller
Log-Likelihood for Log-Linear Model Daphne Koller
The Log-Partition Function Theorem: Proof: Daphne Koller
The Log-Partition Function Theorem: • Log likelihood function – No local optima – Easy to optimize Daphne Koller
Maximum Likelihood Estimation Theorem: is the MLE if and only if Daphne Koller
Computation: Gradient Ascent • Use gradient ascent: – typically L-BFGS – a quasi-Newton method • For gradient, need expected feature counts: – in data – relative to current model • Requires inference at each gradient step Daphne Koller
Example: Ising Model Daphne Koller
Summary • Partition function couples parameters in likelihood • No closed form solution, but convex optimization – Solved using gradient ascent (usually L-BFGS) • Gradient computation requires inference at each gradient step to compute expected feature counts • Features are always within clusters in cluster- graph or clique tree due to family preservation – One calibration suffices for all feature expectations Daphne Koller
Learning Probabilistic Graphical Parameter Estimation Models Max Likelihood for CRFs Daphne Koller
Estimation for CRFs Daphne Koller
Y i Example Y j f 1 (Y s , X s ) = 1 (Y s = g) G s average intensity of f 2 (Y s , Y t ) = 1 (Y s = Y t ) green channel for pixels in superpixel s Daphne Koller
Computation MRF • Requires inference at each gradient step CRF • Requires inference for each x [m] at each gradient step Daphne Koller
However… • For inference of P( Y | x ), we need to compute distribution only over Y • If we learn an MRF, need to compute P( Y,X ), which may be much more complex f 1 (Y s , X s ) = 1 (Y s = g) G s average intensity of f 2 (Y s , Y t ) = 1 (Y s = Y t ) green channel for pixels in superpixel i Daphne Koller
Summary • CRF learning very similar to MRF learning – Likelihood function is concave – Optimized using gradient ascent (usually L-BFGS) • Gradient computation requires inference: one per gradient step, data instance – c.f., once per gradient step for MRFs • But conditional model is often much simpler, so inference cost for CRF, MRF is not the same Daphne Koller
Learning Probabilistic Graphical Parameter Estimation Models MAP Estimation for MRFs, CRFs Daphne Koller
Gaussian Parameter Prior 0.5 0.4 0.3 0.2 0.1 0 -10 -5 0 5 10 Daphne Koller
Laplacian Parameter Prior 0.5 0.4 0.3 0.2 0.1 0 -10 -5 0 5 10 Daphne Koller
MAP Estimation & Regularization -log P( ) L 2 L 1 Daphne Koller
Summary • In undirected models, parameter coupling prevents efficient Bayesian estimation • However, can still use parameter priors to avoid overfitting of MLE • Typical priors are L 1 , L 2 – Drive parameters toward zero • L 1 provably induces sparse solutions – Performs feature selection / structure learning Daphne Koller
Recommend
More recommend