mle map with latent variables
play

MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained - PowerPoint PPT Presentation

Examples: MLE/MAP With Latent Variables CMSC 691 UMBC Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls Lagrange multipliers Assume an original


  1. Examples: MLE/MAP With Latent Variables CMSC 691 UMBC

  2. Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls

  3. Lagrange multipliers Assume an original optimization problem

  4. Lagrange multipliers Assume an original optimization problem We convert it to a new optimization problem:

  5. Lagrange multipliers: an equivalent problem?

  6. Lagrange multipliers: an equivalent problem?

  7. Lagrange multipliers: an equivalent problem?

  8. Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls

  9. Recap: Common Distributions Categorical: A single draw β€’ Finite R.V. taking one of K values: 1, 2, …, K Bernoulli/Binomial π‘Œ ∼ Cat 𝜍 , 𝜍 ∈ ℝ 𝐿 β€’ β€’ π‘ž π‘Œ = 1 = 𝜍 1 , π‘ž π‘Œ = 2 = 𝜍 2 , … π‘žαˆΊ π‘Œ = Categorical/Multinomial 𝐿 = 𝜍 𝐿 ሻ 𝟐[𝑙=π‘˜] β€’ Generally, π‘ž π‘Œ = 𝑙 = Ο‚ π‘˜ 𝜍 π‘˜ Poisson 1 𝑑 = α‰Š1, 𝑑 is true β€’ Normal 0, 𝑑 is false Gamma Multinomial: Sum of N iid Categorical draws β€’ Vector of size K representing how often value k was drawn π‘Œ ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ 𝐿 β€’

  10. Recap: Common Distributions Categorical: A single draw β€’ Finite R.V. taking one of K values: 1, 2, …, K Bernoulli/Binomial β€’ π‘Œ ∼ Cat 𝝇 , 𝜍 ∈ ℝ 𝐿 β€’ π‘ž π‘Œ = 1 = 𝜍 1 , π‘ž π‘Œ = 2 = 𝜍 2 , … π‘žαˆΊ π‘Œ = Categorical/Multinomial What if we ሻ 𝐿 = 𝜍 𝐿 𝟐[𝑙=π‘˜] β€’ Generally, π‘ž π‘Œ = 𝑙 = Ο‚ π‘˜ 𝜍 π‘˜ Poisson want to make 1 𝑑 = α‰Š1, 𝑑 is true β€’ Normal 0, 𝑑 is false 𝝇 a random Gamma Multinomial: Sum of N iid Categorical draws β€’ Vector of size K representing how often variable? value k was drawn β€’ π‘Œ ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ 𝐿

  11. Distribution of (multinomial) distributions If πœ„ is a K-dimensional multinomial parameter

  12. Distribution of (multinomial) distributions If πœ„ is a K-dimensional multinomial parameter πœ„ ∈ Ξ” πΏβˆ’1 , πœ„ 𝑙 β‰₯ 0, ෍ πœ„ 𝑙 = 1 𝑙 we want some density FαˆΊπ›½αˆ» that describes πœ„

  13. Distribution of (multinomial) distributions If πœ„ is a K-dimensional multinomial parameter πœ„ ∈ Ξ” πΏβˆ’1 , πœ„ 𝑙 β‰₯ 0, ෍ πœ„ 𝑙 = 1 𝑙 we want some density FαˆΊπ›½αˆ» that describes πœ„ πœ„ ∼ 𝐺 𝛽 ΰΆ± 𝐺 πœ„; 𝛽 π‘’πœ„ = 1 πœ„ ΰΆ± 𝑒𝐺 πœ„; 𝛽 = 1 πœ„

  14. Two Primary Options Dirichlet Distribution Dir πœ„; 𝛽 = Ξ“ Οƒ 𝑙 𝛽 𝑙 𝛽 𝑙 ΰ·‘ πœ„ 𝑙 Ο‚ 𝑙 Ξ“ 𝛽 𝑙 𝑙 𝐿 𝛽 ∈ ℝ + https://en.wikipedia.org/wiki/Logit-normal_distribution https://en.wikipedia.org/wiki/Dirichlet_distribution

  15. Two Primary Options Dirichlet Distribution Dir πœ„; 𝛽 = Ξ“ Οƒ 𝑙 𝛽 𝑙 𝛽 𝑙 ΰ·‘ πœ„ 𝑙 Ο‚ 𝑙 Ξ“ 𝛽 𝑙 𝑙 𝐿 𝛽 ∈ ℝ + A Beta distribution is the special case when K=2 https://en.wikipedia.org/wiki/Dirichlet_distribution

  16. Two Primary Options Dirichlet Distribution Logistic Normal Dir πœ„; 𝛽 = Ξ“ Οƒ 𝑙 𝛽 𝑙 πœ„ ∼ 𝐺 𝜈, Ξ£ ↔ logit πœ„ ∼ 𝑂 𝜈, Ξ£ 𝛽 𝑙 ΰ·‘ πœ„ 𝑙 𝐺 𝜈, Ξ£ ∝ Ο‚ 𝑙 Ξ“ 𝛽 𝑙 βˆ’1 exp βˆ’1 𝑙 𝑧 βˆ’ 𝜈 π‘ˆ Ξ£ βˆ’1 𝑧 βˆ’ 𝜈 Ο‚πœ„ 𝑙 𝐿 𝛽 ∈ ℝ + 2 𝑧 𝑙 = log πœ„ 𝑙 πœ„ 𝐿 A Beta distribution is the special case when K=2 https://en.wikipedia.org/wiki/Logit-normal_distribution https://en.wikipedia.org/wiki/Dirichlet_distribution

  17. Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls

  18. Generative Story for Rolling a Die N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 β€œ for each ” loop Generative Story π‘₯ 1 = 1 becomes a for roll 𝑗 = 1 to 𝑂: product π‘₯ 𝑗 ∼ CatαˆΊπœ„αˆ» π‘₯ 2 = 5 Calculate π‘ž π‘₯ 𝑗 according to π‘₯ 3 = 4 provided distribution β‹―

  19. Generative Story for Rolling a Die N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 β€œ for each ” loop Generative Story π‘₯ 1 = 1 becomes a for roll 𝑗 = 1 to 𝑂: product π‘₯ 𝑗 ∼ CatαˆΊπœ„αˆ» π‘₯ 2 = 5 Calculate π‘ž π‘₯ 𝑗 according to π‘₯ 3 = 4 a probability provided distribution over 6 distribution sides of the die β‹― 6 0 ≀ πœ„ 𝑙 ≀ 1, βˆ€π‘™ ෍ πœ„ 𝑙 = 1 𝑙=1

  20. Learning Parameters for the Die Model π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- likelihood a reasonable thing to do?

  21. Learning Parameters for the Die Model π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do?

  22. Learning Parameters for the Die Model π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss function do we minimize to maximize log-likelihood?

  23. Learning Parameters for the Die Model π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss A: Cross-entropy function do we minimize to maximize log-likelihood?

  24. Learning Parameters for the Die Model: Maximum Likelihood (Intuition) π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters If you observe …what are β€œreasonable” these 9 rolls… estimates for p(w)? p(1) = ? p(2) = ? p(3) = ? p(4) = ? p(5) = ? p(6) = ?

  25. Learning Parameters for the Die Model: Maximum Likelihood (Intuition) π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters If you observe …what are β€œreasonable” these 9 rolls… estimates for p(w)? p(1) = 2/9 p(2) = 1/9 maximum p(3) = 1/9 p(4) = 3/9 likelihood estimates p(5) = 1/9 p(6) = 1/9

  26. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Q: What’s the generative π‘₯ 1 = 1 story? π‘₯ 2 = 5 π‘₯ 3 = 4 β‹―

  27. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Generative Story π‘₯ 1 = 1 for roll 𝑗 = 1 to 𝑂: π‘₯ 𝑗 ∼ CatαˆΊπœ„αˆ» π‘₯ 2 = 5 π‘₯ 3 = 4 Q: What’s the objective? β‹―

  28. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Generative Story π‘₯ 1 = 1 for roll 𝑗 = 1 to 𝑂: π‘₯ 𝑗 ∼ CatαˆΊπœ„αˆ» π‘₯ 2 = 5 Maximize Log-likelihood π‘₯ 3 = 4 β„’ πœ„ = ෍ log π‘ž πœ„ ሺπ‘₯ 𝑗 ሻ 𝑗 β‹― = ෍ log πœ„ π‘₯ 𝑗 𝑗

  29. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Generative Story Maximize Log-likelihood for roll 𝑗 = 1 to 𝑂: β„’ πœ„ = ෍ log πœ„ π‘₯ 𝑗 π‘₯ 𝑗 ∼ CatαˆΊπœ„αˆ» 𝑗 Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

  30. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Generative Story Maximize Log-likelihood for roll 𝑗 = 1 to 𝑂: β„’ πœ„ = ෍ log πœ„ π‘₯ 𝑗 π‘₯ 𝑗 ∼ CatαˆΊπœ„αˆ» 𝑗 Q: What’s an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing πœ„ 𝑙 ( we know πœ„ must be a distribution, but it’s not specified)

  31. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Maximize Log-likelihood (with distribution constraints) 6 (we can include the inequality constraints β„’ πœ„ = ෍ log πœ„ π‘₯ 𝑗 s. t. ෍ πœ„ 𝑙 = 1 0 ≀ πœ„ 𝑙 , but it complicates the problem and, right 𝑗 𝑙=1 now , is not needed) solve using Lagrange multipliers

Recommend


More recommend