FINAL EXAM REVIEW Will cover: • All content from the course (Units 1-5) • Most points concentrated on Units 3-5 (mixture models, HMMs, MCMC) Logistics • Take-home exam, maximum 2 hour time limit • Exam release late afternoon Fri 5/1 • Exam due NOON (11:59am ET) on Fri 5/8 • Can use: Any notes, any textbook, any Python code (run locally) • Cannot use: The internet to search for answers, other people • We will provide most needed formulas or give textbook reference
Takeaway Messages 1) When uncertain about a variable, don’t condition on it, integrate it away! 2) Model performance is only as good as your fitting algorithm, initialization, and hyperparameter selection. 3) MCMC is a powerful way to estimate posterior distributions (and resulting expectations) even when the model is not analytically tractable
<latexit sha1_base64="x2/tMCazDkzDw1uMp72i63SWnY=">ACDnicbVDLTsJAFJ3iC/GFunQzkZAgMaRFE92YEN24xEQeCdRmOp3ChOm0mZkKBPkCN/6KGxca49a1O/GAbpQ8CQ3OTn3tx7jxsxKpVpfhupeWV1bX0emZjc2t7J7u7V5dhLDCp4ZCFoukiSRjlpKaoYqQZCYICl5G27ua+I17IiQN+a0aRsQOUIdTn2KktORk81EBDu6K8AE2j+AFbFOunD6MCgOneAz7M9nrO9mcWTKngIvESkgOJKg62a+2F+I4IFxhqRsWak7BESimJGxpl2LEmEcA91SEtTjgIi7dH0nTHMa8WDfih0cQWn6u+JEQqkHAau7gyQ6sp5byL+57Vi5Z/bI8qjWBGOZ4v8mEVwk20KOCYMWGmiAsqL4V4i4SCudYEaHYM2/vEjq5ZJ1UirfnOYql0kcaXADkEBWOAMVMA1qIawOARPINX8GY8GS/Gu/Exa0Zycw+APj8we+c5jE</latexit> <latexit sha1_base64="lKLrfhc+1Z5XvW67gMKyBwIphE0=">AB+nicbVBNT8JAEJ36ifhV9OhlIzFBD6RFEz0SvXjERD4SqGS7LBhu212tyIp/BQvHjTGq7/Em/GBXpQ8CWTvLw3k5l5fsSZ0o7zba2srq1vbGa2sts7u3v7du6gpsJYElolIQ9lw8eKciZoVTPNaSOSFAc+p3V/cDP1649UKhaKez2KqBfgnmBdRrA2UtvORYWnhzM0Rq0+1slwgk7bdt4pOjOgZeKmJA8pKm37q9UJSRxQoQnHSjVdJ9JegqVmhNJthUrGmEywD3aNFTgCovmZ0+QSdG6aBuKE0JjWbq74kEB0qNAt90Blj31aI3Ff/zmrHuXnkJE1GsqSDzRd2YIx2iaQ6owyQlmo8MwUQycysifSwx0SatrAnBXx5mdRKRfe8WLq7yJev0zgycATHUAXLqEMt1CBKhAYwjO8wps1tl6sd+tj3rpipTOH8AfW5w+0LpL+</latexit> Takeaway 1! When uncertain about a parameter, better to INTEGRATE AWAY than CONDITION ON p ( x ∗ | ˆ w ) OK: Using a point estimate BETTER: Integrate away ”w” via the sum rule Z p ( x ∗ | X ) = p ( x ∗ , w | X ) dw w
Takeaway 2 • Initialization, remember CP3 (GMMs) • as well as CP5 (coming!) • Algorithm, remember the difference between LBFGS and EM in CP3 Difference between purple and blue is 0.01 on log scale * Hyperparameter: Remember the poor When normalized over 400 pixels (20x20) per image performance in CP2 Means purple model says average validation set image is exp(0.01 * 400) = 54.5 times more likely than the blue model
<latexit sha1_base64="yD5xTIqcA75dojf35Uv58djnkRk=">ACl3icbVFda9swFJW9r9b7aNr1ZezlsrCRlBLsrtAxKCvrGH1s6dIG4sTIstyKyh+VrpcEzX9pP2Zv+zeTkzxsyS4Ijs65hyPdG5dSaPT93474OGjx082Nr2nz56/2Gpt71zpolKM91khCzWIqeZS5LyPAiUflIrTLJb8Or47bfTr71xpUeTfcFbyUZvcpEKRtFSUetn2YHpeA9+wKAL74hFDlGEyg702hvHyYLPpl4YeitqFadC1c60kVZSaozWUNoa6yOjoB5fQpO0sI01dPchvL+vaGJvRjedSNmd4tKEyKdohEjq2oRaZHVjbCxNTNRq+z1/XrAOgiVok2WdR61fYVKwKuM5Mkm1HgZ+iSNDFQome2FlealTaY3fGhTjOuR2Y+1xreWiaBtFD25Ahz9m+HoZnWsy2nRnFW72qNeT/tGF6YeREXlZIc/ZIitJGABzZIgEYozlDMLKFPCvhXYLbVzRbtKzw4hWP3yOrg6AXvewcXh+2Tz8txbJDX5A3pkIAckRNyRs5JnzBn1/nonDpf3FfuJ/ere7ZodZ2l5yX5p9yLP1A7wmo=</latexit> Takeaway 3 • Can use MCMC to do posterior predictive Z p ( x ∗ | X ) = p ( x ∗ , w | X ) dw w Z = p ( x ∗ | w ) p ( w | X ) dw w S = 1 w s iid X p ( x ∗ | w s ) , ∼ p ( w s | X ) S s =1
You are capable of so many things now! Given a proposed probabilistic model, you can do: ML estimation of parameters Heldout likelihood computation MAP estimation of parameters Hyperparameter selection via CV EM to estimate parameters Hyperparameter selection via evidence MCMC estimation of posterior
Optimization Skills Unit 1 • Finding extrema by zeros of first derivative • Handling Constraints via Lagrange multipliers Probabilistic Analysis Skills Data analysis • Discrete and continuous r.v. • Beta-Bernoulli for binary data • Sum rule and product rule • ML estimation of ”proba. heads" • Bayes rule (derived from above) • MAP estimation of “proba. heads" • Expectations • Estimating the posterior • Independence • Predicting new data • Dirichlet-Categorical for discrete data Distributions • Bernoulli distribution • ML estimation of unigram probas • MAP estimation of unigram probas • Beta distribution • Estimating the posterior • Gamma function • Predicting new data • Dirichlet distribution
Example Unit 1 Question a) True or False: Bayes Rule can be proved using the Sum Rule and Product Rules a) You’re modeling the wins/losses of your favorite sports team with a Beta-Bernoulli model. a) You assume each game’s binary outcome (win=1/loss=0) is iid. b) You observe in preseason play: 5 wins and 3 losses c) Suggest a prior to use for the win probability d) Identify 2 or more assumptions about this model that may not be valid in the real world (with concrete reasons)
Example Unit 1 Answer
Unit 2 Optimization Skills Probabilistic Analysis Skills • Convexity and second derivatives • Joints, conditionals, marginals • Finding extrema by zeros of first derivative • Covariance matrices (pos. definite, symmetric) • First and second order gradient descent • Gaussian conjugacy rules Linear Algebra Skills Data analysis • Determinants • Positive definite • Gaussian-Gaussian for regression • Invertibility • ML estimation of weights • MAP estimation of weights Distributions • Univariate Gaussian distribution • Estimating the posterior over weights • Multivariate Gaussian distribution • Predicting new data
<latexit sha1_base64="6vUJKUEKlj3EmXfpJftwyboOxw=">ACO3icbVBNSxBFOzRaHT8Ws0xl0cWRUWGSPoRdgYhFyEVxd2BmHnt5et7Hng+436jLO/Lin/DmJZcIiHX3NOzu4hfBQ1FVT36vQpTKTQ6zoM1Nv5hYvLj1LQ9Mzs3v1BZXDrRSaYb7JEJqoVUs2liHkTBUreShWnUSj5aXjxvfRPL7nSIomPsZ9yP6LnsegKRtFIQeXI61HMsQjWYWUXrs7yg2+NAq6DdvzbA+F7PAn14so9sIw3y+CHMHTIoJ0FeGmjG9Aa62ANiD4QaXq1JwB4C1xR6RKRmgElXuvk7As4jEySbVu06Kfk4VCiZ5YXuZ5ilF/Sctw2NacS1nw9uL2DZKB3oJsq8GgPp/IaR1PwpNslxfv/ZK8T2vnWF3x89FnGbIYzb8qJtJwATKIqEjFGco+4ZQpoTZFViPKsrQ1G2bEtzXJ78lJ5s192t83CrWt8b1TFPpMvZJW4ZJvUyQ/SIE3CyC35SX6TR+vO+mX9sf4Oo2PWaOYTeQHr382I6qL</latexit> <latexit sha1_base64="a43bVnPThLqB7flrdowacMns+mI=">ACG3icbVBNSwMxEM36WetX1aOXYBGqSNmtgl4EURFPUsGq0NYlm6ZtaJdklm1rP0fXvwrXjwo4knw4L8x2/bg14OBx3szMwLIsENuO6nMzI6Nj4xmZnKTs/Mzs3nFhbPTRhryio0FKG+DIhgitWAQ6CXUaERkIdhF0DlL/4pw0N1Bt2I1SVpKd7klICV/FwpKoCv8B2+9dUa3sU1YLeQnIRalg+PegV8g9dTawPXDG9JclVaw34u7xbdPvBf4g1JHg1R9nPvtUZIY8kUEGMqXpuBPWEaOBUsF62FhsWEdohLVa1VBHJTD3p/9bDq1Zp4GaobSnAfX7REKkMV0Z2E5JoG1+e6n4n1eNoblT7iKYmCKDhY1Y4EhxGlQuME1oyC6lhCqub0V0zbRhIKNM2tD8H6/Jecl4reZrF0upXf2x/GkUHLaAUVkIe20R46RmVUQRTdo0f0jF6cB+fJeXeBq0jznBmCf2A8/EFs5Sesg=</latexit> Example Unit 2 Question You are doing regression with the following model • Normal prior on the weights p ( t n | x n ) = NormPDF( w ∗ x n , σ 2 ) • Normal likelihood: a. Consider the following two estimators for t_*. What’s the difference? t ∗ = w MAP x ∗ ˆ ˜ t ∗ = E t ∼ p ( t | x ∗ ,X ) [ t ] b. Suggest at least 2 ways to pick a value for the hyperparameter \sigma
Example Unit 2 Answer
Unit 3: K-Means and Mixture Models Distributions Optimization Skills • Mixtures of Gaussians (GMMs) • K-means objective and algorithm • Coordinate ascent / descent algorithms • Mixtures in general • Can use any likelihood (not just Gauss) • Optimization objectives with hidden vars • Complete likelihood: p(x, z | \theta) Numerical Methods • Incomplete likelihood: p( x | \theta) logsumexp • Expectations of complete likelihood • How to derive it Data analysis • Why it is important • K-means or GMM for a dataset • Expectation-Maximization algorithm • How to pick K hyperparameter • Lower bound objective • Why multiple inits matter • What E-step does • What M-step does
Ex Exampl ple Uni nit 3 Que uestion Consider two possible models for clustering 1-dim. data • K-Means • Gaussian mixtures Name ways that the GMM is more flexible as a model: • How is the GMM’s treatment of assignments more flexible? • How is the GMM’s parameterization of a “cluster” more flexible? Under what limit does the GMM likelihood reduce to the K-means objective?
Ex Exampl ple Uni nit 3 An Answer
Unit 4: Markov models and HMMs Probabilistic Analysis Skills Algorithm Skills • Markov conditional independence • Forward algorithm • Backward algorithm • Stationary distributions • Viterbi algorithm • Deriving independence properties (all examples of dynamic programming) • Like HW4 problem 1 Linear Algebra Skills Optimization Skills • Eigenvectors/values for stationary • EM for HMMs distributions • E-step • M-step Distributions • Discrete Markov models
Example Unit 4 Question • Describe how the Viterbi algorithm is an instance of dynamic programming Identify all the key parts: • What is the fundamental problem being solved? • How is the final solution built from solutions to smaller problems? • How to describe all the solutions as a big “table” that should be filled in? • What is the “base case” update (the simplest subproblem)? • What is the recursive update?
Example Unit 4 Answer
Recommend
More recommend