max likelihood for log linear
play

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood - PowerPoint PPT Presentation

Learning Probabilistic Graphical Parameter Estimation Models Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C Partition function couples the parameters No decomposition of likelihood No


  1. Learning Probabilistic Graphical Parameter Estimation Models Max Likelihood for Log-Linear Models Daphne Koller

  2. Log-Likelihood for Markov Nets A B C • Partition function couples the parameters – No decomposition of likelihood – No closed form solution Daphne Koller

  3. Example: Log-Likelihood Function A B 0 -20 -40 C -60 -80 -100 200 180 160 -120 140 120 100 80 60 40 20 0 -20 60 -40 40 20 0 -20 -40 -60 -80 -100 -60 -120 -140 -160 -180 -200 Daphne Koller

  4. Log-Likelihood for Log-Linear Model Daphne Koller

  5. The Log-Partition Function Theorem: Proof: Daphne Koller

  6. The Log-Partition Function Theorem: • Log likelihood function – No local optima – Easy to optimize Daphne Koller

  7. Maximum Likelihood Estimation Theorem: is the MLE if and only if Daphne Koller

  8. Computation: Gradient Ascent • Use gradient ascent: – typically L-BFGS – a quasi-Newton method • For gradient, need expected feature counts: – in data – relative to current model • Requires inference at each gradient step Daphne Koller

  9. Example: Ising Model Daphne Koller

  10. Summary • Partition function couples parameters in likelihood • No closed form solution, but convex optimization – Solved using gradient ascent (usually L-BFGS) • Gradient computation requires inference at each gradient step to compute expected feature counts • Features are always within clusters in cluster- graph or clique tree due to family preservation – One calibration suffices for all feature expectations Daphne Koller

  11. Learning Probabilistic Graphical Parameter Estimation Models Max Likelihood for CRFs Daphne Koller

  12. Estimation for CRFs Daphne Koller

  13. Y i Example Y j f 1 (Y s , X s ) = 1 (Y s = g) G s average intensity of f 2 (Y s , Y t ) = 1 (Y s = Y t ) green channel for pixels in superpixel s Daphne Koller

  14. Computation MRF • Requires inference at each gradient step CRF • Requires inference for each x [m] at each gradient step Daphne Koller

  15. However… • For inference of P( Y | x ), we need to compute distribution only over Y • If we learn an MRF, need to compute P( Y,X ), which may be much more complex f 1 (Y s , X s ) = 1 (Y s = g) G s average intensity of f 2 (Y s , Y t ) = 1 (Y s = Y t ) green channel for pixels in superpixel i Daphne Koller

  16. Summary • CRF learning very similar to MRF learning – Likelihood function is concave – Optimized using gradient ascent (usually L-BFGS) • Gradient computation requires inference: one per gradient step, data instance – c.f., once per gradient step for MRFs • But conditional model is often much simpler, so inference cost for CRF, MRF is not the same Daphne Koller

  17. Learning Probabilistic Graphical Parameter Estimation Models MAP Estimation for MRFs, CRFs Daphne Koller

  18. Gaussian Parameter Prior 0.5 0.4 0.3 0.2 0.1 0 -10 -5 0 5 10 Daphne Koller

  19. Laplacian Parameter Prior 0.5 0.4 0.3 0.2 0.1 0 -10 -5 0 5 10 Daphne Koller

  20. MAP Estimation & Regularization -log P( ) L 2 L 1 Daphne Koller

  21. Summary • In undirected models, parameter coupling prevents efficient Bayesian estimation • However, can still use parameter priors to avoid overfitting of MLE • Typical priors are L 1 , L 2 – Drive parameters toward zero • L 1 provably induces sparse solutions – Performs feature selection / structure learning Daphne Koller

Recommend


More recommend