Maximum Likelihood Estimation CS 446
Maximum likelihood: abstract formulation We’ve had one main “meta-algorithm”) this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. 1 / 76
Maximum likelihood: abstract formulation We’ve had one main “meta-algorithm”) this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. We’ve also discussed another: the “Maximum likelihood estimation (MLE)” principle : ◮ Pick a set of probability models for your data: P := { p θ : θ ∈ Θ } . ◮ p θ will denote both densities and masses; the literature is similarly inconsistent. ◮ Given samples ( z i ) n i =1 , pick the model that maximized the likelihood n n � � max θ ∈ Θ L ( θ ) = max θ ∈ Θ ln p θ ( z i ) = max ln p θ ( z i ) , θ ∈ Θ i =1 i =1 where the ln( · ) is for mathematical convenience, and z i can be a labeled pair ( x i , y i ) or just x i . 1 / 76
Connections between ERM and MLE ◮ We can often derive and justify many basic methods with either (e.g., least squares, logistic regression, k -means, . . . ). ◮ MLE ideas were used to verive VAEs, which we’ll cover next week! 2 / 76
Connections between ERM and MLE ◮ We can often derive and justify many basic methods with either (e.g., least squares, logistic regression, k -means, . . . ). ◮ MLE ideas were used to verive VAEs, which we’ll cover next week! ◮ Each perspective suggests some different details and interpretation. 2 / 76
Connections between ERM and MLE ◮ We can often derive and justify many basic methods with either (e.g., least squares, logistic regression, k -means, . . . ). ◮ MLE ideas were used to verive VAEs, which we’ll cover next week! ◮ Each perspective suggests some different details and interpretation. ◮ Both approaches rely upon seemingly arbitrary assumptions and choices. 2 / 76
Connections between ERM and MLE ◮ We can often derive and justify many basic methods with either (e.g., least squares, logistic regression, k -means, . . . ). ◮ MLE ideas were used to verive VAEs, which we’ll cover next week! ◮ Each perspective suggests some different details and interpretation. ◮ Both approaches rely upon seemingly arbitrary assumptions and choices. ◮ The success of MLE seems to often hinge upon an astute choice of model. ◮ Applied scientists often like MLE and its ilk due to interpretability and “usability”: they can easily encode domain knowledge. We’ll return to this. 2 / 76
Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. 3 / 76
Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. ◮ Writing H := � i x i and T := � i (1 − x i ) = n − H for convenience, n � � � L ( θ ) = x i ln θ + (1 − x i ) ln(1 − θ ) = H ln θ + T ln(1 − θ ) . i =1 3 / 76
Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. ◮ Writing H := � i x i and T := � i (1 − x i ) = n − H for convenience, n � � � L ( θ ) = x i ln θ + (1 − x i ) ln(1 − θ ) = H ln θ + T ln(1 − θ ) . i =1 Differentiating and setting to 0, 0 = H T θ − 1 − θ, T + H = H H which gives θ = N . ◮ In this way, we’ve justified a natural algorithm. 3 / 76
Example 2: mean of a Gaussian ◮ Suppose x i ∼ N ( µ, σ 2 ) , so θ = ( µ, σ 2 ) , and � − ( x i − µ ) 2 � exp = − ( x i − µ ) 2 − ln(2 πσ 2 ) 2 σ 2 √ ln p θ ( x i ) = ln . 2 σ 2 2 2 πσ 2 4 / 76
Example 2: mean of a Gaussian ◮ Suppose x i ∼ N ( µ, σ 2 ) , so θ = ( µ, σ 2 ) , and � − ( x i − µ ) 2 � exp = − ( x i − µ ) 2 − ln(2 πσ 2 ) 2 σ 2 √ ln p θ ( x i ) = ln . 2 σ 2 2 2 πσ 2 ◮ Therefore n L ( θ ) = − 1 ( x i − µ ) 2 + stuff without µ ; � 2 σ 2 i =1 applying ∇ µ and setting to zero gives µ = 1 � x i . n i ◮ A similar derivation gives σ 2 = 1 i ( x i − µ ) 2 . � n 4 / 76
Discussion: Bayesian vs. frequentist perspectives Question: � n x i n estimates a Gaussian µ parameter; but isn’t it useful i =1 more generally? 5 / 76
Discussion: Bayesian vs. frequentist perspectives Question: � n x i n estimates a Gaussian µ parameter; but isn’t it useful i =1 more generally? Bayesian perspective: we choose a model and believe it well-approximates reality; learning its parameters determines underlying phenomena. ◮ Bayesian methods can handle model misspecification; LDA is an example which works well despite seemingly impractical assumptions. 5 / 76
Discussion: Bayesian vs. frequentist perspectives Question: � n x i n estimates a Gaussian µ parameter; but isn’t it useful i =1 more generally? Bayesian perspective: we choose a model and believe it well-approximates reality; learning its parameters determines underlying phenomena. ◮ Bayesian methods can handle model misspecification; LDA is an example which works well despite seemingly impractical assumptions. Frequentist perspective: we ask certain questions, and reason about the accuracy of our answers. ◮ For many distributions, � n x i n is a valid estimate of the mean, i =1 moreover with confidence intervals of size 1 / √ n . This approach isn’t free of assumptions: IID is there. . . 5 / 76
Discussion: Bayesian vs. frequentist perspectives (part 2) ◮ Discussion also appears in the form “generative vs discriminative ML”. ◮ As before: both philosophies can justify/derive the same algorithm; they differ on some details (e.g., choosing k in k -means). IMO: it’s nice having more tools ◮ (as mentioned before: VAE derived from MLE perspective). 6 / 76
Example 3: Least squares (recap) If we assume Y | X ∼ N ( w T X , σ 2 ) , then n � L ( w ) = ln p w ( x i , y i ) i =1 n � � � = ln p w ( y i | x i ) + ln p ( x i ) i =1 n � � − ( w T x i − y i ) 2 � = + terms without w . 2 σ 2 i =1 Therefore n � T x i − y i ) 2 . arg max L ( w ) = arg min ( w w ∈ R d w ∈ R d i =1 7 / 76
Example 3: Least squares (recap) If we assume Y | X ∼ N ( w T X , σ 2 ) , then n � L ( w ) = ln p w ( x i , y i ) i =1 n � � � = ln p w ( y i | x i ) + ln p ( x i ) i =1 n � � − ( w T x i − y i ) 2 � = + terms without w . 2 σ 2 i =1 Therefore n � T x i − y i ) 2 . arg max L ( w ) = arg min ( w w ∈ R d w ∈ R d i =1 We can derive/justify the algorithm either way, but some refinements now differ with each perspective (e.g., regularization). 7 / 76
Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) 8 / 76
Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p ( Y | X ) exactly; that’s a pain. 8 / 76
Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p ( Y | X ) exactly; that’s a pain. ◮ Let’s assume coordinates of X = ( X 1 , . . . , X d ) are independent given Y : = p ( X = x | Y = y ) p ( Y = y ) p ( Y = y | X = x ) = p ( Y = y, X = x ) p ( X = x ) p ( X = x ) p ( Y = y ) � d j =1 p ( X j = x j | Y = y ) = , p ( X = x ) and d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 8 / 76
Example 4: Naive Bayes (part 2) d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 9 / 76
Example 4: Naive Bayes (part 2) d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 Examples where this helps: ◮ Suppose X ∈ { 0 , 1 } d has an arbitrary distribution; \ it’s specified with 2 d − 1 numbers. \ The factored form above needs d numbers. To see how this can help, suppose x ∈ { 0 , 1 } d ; instead of having to learn a probability model of 2 d possibilities, we now have to learn d + 1 models each with 2 possibilities (binary labels). ◮ HW5 will use the standard “Iris dataset”. \ This data is continuous, Naive Bayes would approximate univariate distributions. 9 / 76
Mixtures of Gaussians. 10 / 76
k -means has spherical clusters? Recall that k -means baked in spherical clusters. How about we model each cluster with a Gaussian? 11 / 76
k -means has spherical clusters? Recall that k -means baked in spherical clusters. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 How about we model each cluster with a Gaussian? 11 / 76
Recommend
More recommend