Examples: MLE/MAP With Latent Variables CMSC 691 UMBC
Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls
Lagrange multipliers Assume an original optimization problem
Lagrange multipliers Assume an original optimization problem We convert it to a new optimization problem:
Lagrange multipliers: an equivalent problem?
Lagrange multipliers: an equivalent problem?
Lagrange multipliers: an equivalent problem?
Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls
Recap: Common Distributions Categorical: A single draw β’ Finite R.V. taking one of K values: 1, 2, β¦, K Bernoulli/Binomial π βΌ Cat π , π β β πΏ β’ β’ π π = 1 = π 1 , π π = 2 = π 2 , β¦ παΊ π = Categorical/Multinomial πΏ = π πΏ α» π[π=π] β’ Generally, π π = π = Ο π π π Poisson 1 π = α1, π is true β’ Normal 0, π is false Gamma Multinomial: Sum of N iid Categorical draws β’ Vector of size K representing how often value k was drawn π βΌ Multinomial π, π , π β β πΏ β’
Recap: Common Distributions Categorical: A single draw β’ Finite R.V. taking one of K values: 1, 2, β¦, K Bernoulli/Binomial β’ π βΌ Cat π , π β β πΏ β’ π π = 1 = π 1 , π π = 2 = π 2 , β¦ παΊ π = Categorical/Multinomial What if we α» πΏ = π πΏ π[π=π] β’ Generally, π π = π = Ο π π π Poisson want to make 1 π = α1, π is true β’ Normal 0, π is false π a random Gamma Multinomial: Sum of N iid Categorical draws β’ Vector of size K representing how often variable? value k was drawn β’ π βΌ Multinomial π, π , π β β πΏ
Distribution of (multinomial) distributions If π is a K-dimensional multinomial parameter
Distribution of (multinomial) distributions If π is a K-dimensional multinomial parameter π β Ξ πΏβ1 , π π β₯ 0, ΰ· π π = 1 π we want some density FαΊπ½α» that describes π
Distribution of (multinomial) distributions If π is a K-dimensional multinomial parameter π β Ξ πΏβ1 , π π β₯ 0, ΰ· π π = 1 π we want some density FαΊπ½α» that describes π π βΌ πΊ π½ ΰΆ± πΊ π; π½ ππ = 1 π ΰΆ± ππΊ π; π½ = 1 π
Two Primary Options Dirichlet Distribution Dir π; π½ = Ξ Ο π π½ π π½ π ΰ· π π Ο π Ξ π½ π π πΏ π½ β β + https://en.wikipedia.org/wiki/Logit-normal_distribution https://en.wikipedia.org/wiki/Dirichlet_distribution
Two Primary Options Dirichlet Distribution Dir π; π½ = Ξ Ο π π½ π π½ π ΰ· π π Ο π Ξ π½ π π πΏ π½ β β + A Beta distribution is the special case when K=2 https://en.wikipedia.org/wiki/Dirichlet_distribution
Two Primary Options Dirichlet Distribution Logistic Normal Dir π; π½ = Ξ Ο π π½ π π βΌ πΊ π, Ξ£ β logit π βΌ π π, Ξ£ π½ π ΰ· π π πΊ π, Ξ£ β Ο π Ξ π½ π β1 exp β1 π π§ β π π Ξ£ β1 π§ β π Οπ π πΏ π½ β β + 2 π§ π = log π π π πΏ A Beta distribution is the special case when K=2 https://en.wikipedia.org/wiki/Logit-normal_distribution https://en.wikipedia.org/wiki/Dirichlet_distribution
Outline Constrained Optimization Distributions of distributions Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls
Generative Story for Rolling a Die N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π β for each β loop Generative Story π₯ 1 = 1 becomes a for roll π = 1 to π: product π₯ π βΌ CatαΊπα» π₯ 2 = 5 Calculate π π₯ π according to π₯ 3 = 4 provided distribution β―
Generative Story for Rolling a Die N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π β for each β loop Generative Story π₯ 1 = 1 becomes a for roll π = 1 to π: product π₯ π βΌ CatαΊπα» π₯ 2 = 5 Calculate π π₯ π according to π₯ 3 = 4 a probability provided distribution over 6 distribution sides of the die β― 6 0 β€ π π β€ 1, βπ ΰ· π π = 1 π=1
Learning Parameters for the Die Model π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- likelihood a reasonable thing to do?
Learning Parameters for the Die Model π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do?
Learning Parameters for the Die Model π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss function do we minimize to maximize log-likelihood?
Learning Parameters for the Die Model π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss A: Cross-entropy function do we minimize to maximize log-likelihood?
Learning Parameters for the Die Model: Maximum Likelihood (Intuition) π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π maximize (log-) likelihood to learn the probability parameters If you observe β¦what are βreasonableβ these 9 rollsβ¦ estimates for p(w)? p(1) = ? p(2) = ? p(3) = ? p(4) = ? p(5) = ? p(6) = ?
Learning Parameters for the Die Model: Maximum Likelihood (Intuition) π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π maximize (log-) likelihood to learn the probability parameters If you observe β¦what are βreasonableβ these 9 rollsβ¦ estimates for p(w)? p(1) = 2/9 p(2) = 1/9 maximum p(3) = 1/9 p(4) = 3/9 likelihood estimates p(5) = 1/9 p(6) = 1/9
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Q: Whatβs the generative π₯ 1 = 1 story? π₯ 2 = 5 π₯ 3 = 4 β―
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Generative Story π₯ 1 = 1 for roll π = 1 to π: π₯ π βΌ CatαΊπα» π₯ 2 = 5 π₯ 3 = 4 Q: Whatβs the objective? β―
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Generative Story π₯ 1 = 1 for roll π = 1 to π: π₯ π βΌ CatαΊπα» π₯ 2 = 5 Maximize Log-likelihood π₯ 3 = 4 β π = ΰ· log π π αΊπ₯ π α» π β― = ΰ· log π π₯ π π
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Generative Story Maximize Log-likelihood for roll π = 1 to π: β π = ΰ· log π π₯ π π₯ π βΌ CatαΊπα» π Q: Whatβs an easy way to maximize this, as written exactly (even without calculus)?
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Generative Story Maximize Log-likelihood for roll π = 1 to π: β π = ΰ· log π π₯ π π₯ π βΌ CatαΊπα» π Q: Whatβs an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing π π ( we know π must be a distribution, but itβs not specified)
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Maximize Log-likelihood (with distribution constraints) 6 (we can include the inequality constraints β π = ΰ· log π π₯ π s. t. ΰ· π π = 1 0 β€ π π , but it complicates the problem and, right π π=1 now , is not needed) solve using Lagrange multipliers
Recommend
More recommend