cmsc 691
play

CMSC 691 Probabilistic and Statistical Models of Learning - PowerPoint PPT Presentation

CMSC 691 Probabilistic and Statistical Models of Learning Probabilities, Common Distributions, and Maximum Likelihood Estimation Outline Basics of Learning Probability Maximum Likelihood Estimation What does it mean to learn? Chris has


  1. Probabilistic Independence Q: Are X and Y independent? Independence: when events can occur and not impact the probability of p(x,y) Y=0 Y=1 other events X=“cat” .04 .32 X=“dog” .2 .04 Formally: p(x,y) = p(x)*p(y) X=“bird” .1 .1 X=“human” .1 .1 Generalizable to > 2 random variables

  2. Probabilistic Independence Q: Are X and Y independent? Independence: when events can occur and not impact the probability of p(x,y) Y=0 Y=1 other events X=“cat” .04 .32 X=“dog” .2 .04 Formally: p(x,y) = p(x)*p(y) X=“bird” .1 .1 X=“human” .1 .1 Generalizable to > 2 random variables A: No (find the marginal probabilities of p(x) and p(y))

  3. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Marginal probability Common distributions Probabilistic Independence Expected Value (of a function) of a Random Variable Definition of conditional probability

  4. Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) Conditional Probabilities are Probabilities

  5. Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = marginal probability of Y

  6. Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = ∫ 𝑞(𝑌, 𝑍)𝑒𝑌

  7. Revisiting Marginal Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y 𝑞 𝑧 = ෍ 𝑞(𝑦, 𝑧) 𝑦 = ෍ 𝑞 𝑦 𝑞 𝑧 𝑦) 𝑦

  8. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Marginal probability Common distributions Probabilistic Independence Expected Value (of a function) of a Random Variable Definition of conditional probability

  9. Deriving Bayes Rule Start with conditional p(X | Y)

  10. Deriving Bayes Rule 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) Solve for p(x,y) 𝑞(𝑍)

  11. Deriving Bayes Rule 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) Solve for p(x,y) 𝑞(𝑍) 𝑞 𝑌, 𝑍 = 𝑞 𝑌 𝑍)𝑞(𝑍) p(x,y) = p(y,x) 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

  12. Bayes Rule prior likelihood probability 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) posterior probability marginal likelihood (probability)

  13. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Marginal probability Common distributions Probabilistic Independence Expected Value (of a function) of a Random Variable Definition of conditional probability

  14. Probability Chain Rule 𝑞 𝑦 1 , 𝑦 2 , … , 𝑦 𝑇 = 𝑞 𝑦 1 𝑞 𝑦 2 𝑦 1 )𝑞 𝑦 3 𝑦 1 , 𝑦 2 ) ⋯ 𝑞 𝑦 𝑇 𝑦 1 , … , 𝑦 𝑗 = 𝑇 ෑ 𝑞 𝑦 𝑗 𝑦 1 , … , 𝑦 𝑗−1 ) 𝑗 extension of Bayes rule

  15. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Marginal probability Common distributions Probabilistic Independence Expected Value (of a function) of a Random Variable Definition of conditional probability

  16. Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻

  17. Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻 • 𝐻 often has parameters 𝜍 = (𝜍 1 , 𝜍 2 , … , 𝜍 𝑁 ) that govern its “shape” • Formally written as 𝑌 ∼ 𝐻(𝜍)

  18. Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻 • 𝐻 often has parameters 𝜍 = (𝜍 1 , 𝜍 2 , … , 𝜍 𝑁 ) that govern its “shape” • Formally written as 𝑌 ∼ 𝐻(𝜍) i.i.d. If 𝑌 1 , X 2 , … , X N are all independently sampled from 𝐻(𝜍) , they are i ndependently and i dentically d istributed

  19. Common Distributions Bernoulli: A single draw Bernoulli/Binomial • Binary R.V.: 0 (failure) or 1 (success) • 𝑌 ∼ Bernoulli(𝜍) Categorical/Multinomial • 𝑞 𝑌 = 1 = 𝜍 , 𝑞 𝑌 = 0 = 1 − 𝜍 • Generally, 𝑞 𝑌 = 𝑙 = 𝜍 𝑙 1 − 𝑞 1−𝑙 Poisson Normal Gamma

  20. Common Distributions Bernoulli: A single draw • Binary R.V.: 0 (failure) or 1 (success) Bernoulli/Binomial • 𝑌 ∼ Bernoulli(𝜍) Categorical/Multinomial • 𝑞 𝑌 = 1 = 𝜍 , 𝑞 𝑌 = 0 = 1 − 𝜍 • Generally, 𝑞 𝑌 = 𝑙 = 𝜍 𝑙 1 − 𝑞 1−𝑙 Poisson Normal Binomial: Sum of N iid Bernoulli draws • Values X can take: 0, 1, …, N Gamma • Represents number of successes • 𝑌 ∼ Binomial(𝑂, 𝜍) 𝑙 𝜍 𝑙 1 − 𝜍 𝑂−𝑙 𝑂 • 𝑞 𝑌 = 𝑙 =

  21. Common Distributions Categorical: A single draw • Finite R.V. taking one of K values: 1, 2, …, K Bernoulli/Binomial 𝑌 ∼ Cat 𝜍 , 𝜍 ∈ ℝ 𝐿 • • 𝑞 𝑌 = 1 = 𝜍 1 , 𝑞 𝑌 = 2 = 𝜍 2 , … 𝑞( 𝑌 = Categorical/Multinomial 𝐿 = 𝜍 𝐿 ) 𝟐[𝑙=𝑘] • Generally, 𝑞 𝑌 = 𝑙 = ς 𝑘 𝜍 𝑘 Poisson 1 𝑑 = ቊ1, 𝑑 is true • Normal 0, 𝑑 is false Multinomial: Sum of N iid Categorical draws Gamma • Vector of size K representing how often value k was drawn 𝑌 ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ 𝐿 •

  22. Common Distributions Poisson • Discrete R.V. taking any integer that is >= 0 • 𝑌 ∼ Poisson 𝜇 , 𝜇 ∈ ℝ is the “rate” Bernoulli/Binomial 𝜇 𝑙 exp(−𝜇) • 𝑞 𝑌 = 𝑙 = 𝑙! Categorical/Multinomial Poisson Normal Gamma PMF

  23. Common Distributions Normal • Real R.V. taking any real number • 𝑌 ∼ Normal 𝜈, 𝜏 , 𝜈 is the mean, 𝜏 is the standard deviation Bernoulli/Binomial − 𝑦−𝜈 2 1 • 𝑞 𝑌 = 𝑦 = 2𝜌𝜏 exp( ) 2𝜏 2 Categorical/Multinomial Poisson Normal 𝑞 𝑌 = 𝑦 Gamma https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/192 0px-Normal_Distribution_PDF.svg.png

  24. Common Distributions Multivariate Normal Bernoulli/Binomial • Real vector R.V. 𝑌 ∈ ℝ 𝑙 Categorical/Multinomial • 𝑌 ∼ Normal 𝜈, Σ , 𝜈 ∈ ℝ 𝐿 is the mean, Σ ∈ ℝ 𝐿×𝐿 is the Poisson covariance Normal • 𝑞 𝑌 = 𝑦 ∝ exp(−( 𝑦 − 𝜈 𝑈 Σ(𝑦 − \mu)) ) Gamma

  25. Common Distributions Gamma • Real R.V. taking any positive real number • 𝑌 ∼ Gamma 𝑙, 𝜄 , 𝑙 > 0 is the “shape” (how skewed it is), 𝜄 > 0 is the “scale” (how spread out the distribution is) Bernoulli/Binomial 𝑦 𝑙−1 exp( −𝑙 𝜄 ) • 𝑞 𝑌 = 𝑦 = Categorical/Multinomial 𝜄 𝑙 Γ(𝑙) Poisson Normal Gamma https://en.wikipedia.org/wiki/Gamma_distribution#/media/File:Gamma_distribution_pdf.svg

  26. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Marginal probability Common distributions Probabilistic Independence Expected Value (of a function) of a Random Variable Definition of conditional probability

  27. Expected Value of a Random Variable random variable 𝑌 ~ 𝑞 ⋅

  28. Expected Value of a Random Variable random variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 𝑦 expected value (distribution p is implicit)

  29. Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 𝑦 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

  30. Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 Q: What common 𝑦 distribution is this? 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

  31. Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 Q: What common 𝑦 distribution is this? 1/6 * 1 + 1/6 * 2 + A: Categorical 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

  32. Expected Value: Example 2 non-uniform distribution of number of cats a normal cat person has 1 2 3 4 5 6 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 𝑦 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

  33. Expected Value of a Function of a Random Variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍ 𝑦 𝑞(𝑦) 𝑦 𝔽 𝑔(𝑌) =? ? ?

  34. Expected Value of a Function of a Random Variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍ 𝑦 𝑞(𝑦) 𝑦 𝔽 𝑔(𝑌) = ෍ 𝑔(𝑦) 𝑞 𝑦 𝑦

  35. Expected Value of Function: Example non-uniform distribution of number of cats I start with 1 2 3 4 5 6 What if each cat magically becomes two? 𝑔 𝑙 = 2 𝑙 𝔽 𝑔(𝑌) = ෍ 𝑔(𝑦) 𝑞 𝑦 𝑦

  36. Expected Value of Function: Example non-uniform distribution of number of cats I start with 1 2 3 4 5 6 What if each cat magically becomes two? 𝑔 𝑙 = 2 𝑙 2 𝑦 𝑞(𝑦) 𝔽 𝑔(𝑌) = ෍ 𝑔(𝑦) 𝑞 𝑦 = ෍ 𝑦 𝑦 1/2 * 2 1 + 1/10 * 2 2 + 1/10 * 2 3 + = 13.4 1/10 * 2 4 + 1/10 * 2 5 + 1/10 * 2 6

  37. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Marginal probability Common distributions Probabilistic Independence Expected Value (of a function) of a Random Variable Definition of conditional probability

  38. Example Problem: ITILA Ex. 2.3 ➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a positive result is returned, and in 95% of cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information is that 1% of people of Jo’s age and background have the disease. Q: If Jo’s test is positive, what is the probability Jo has the disease?

  39. Example Problem: ITILA Ex. 2.3 𝑞 𝑏 = 1 𝑐 = 1) Q: If Jo’s test is positive, what is the probability Jo has the disease? ➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a positive result is returned, and in 95% of cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information is that 1% of people of Jo’s age and background have the disease.

  40. Example Problem: ITILA Ex. 2.3 𝑞 𝑏 = 1 𝑐 = 1) Q: If Jo’s test is positive, what is the probability Jo has the disease? = 𝑞 𝑐 = 1 𝑏 = 1 𝑞(𝑏 = 1) 𝑞(𝑐 = 1) ➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a positive result is returned, and in 95% of cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information marginal is that 1% of people of Jo’s age and 𝑞 𝑏 = 1 = 0.01 of a background have the disease.

  41. Example Problem: ITILA Ex. 2.3 𝑞 𝑏 = 1 𝑐 = 1) Q: If Jo’s test is positive, what is the probability Jo has the disease? = 𝑞 𝑐 = 1 𝑏 = 1 𝑞(𝑏 = 1) 𝑞(𝑐 = 1) ➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a 𝑞(𝑐 = 1|𝑏 = 1) = 0.95 Conditionals positive result is returned, and in 95% of p(b|a) 𝑞(𝑐 = 0|𝑏 = 0) = 0.95 cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information marginal is that 1% of people of Jo’s age and 𝑞 𝑏 = 1 = 0.01 of a background have the disease.

  42. Example Problem: ITILA Ex. 2.3 𝑞 𝑏 = 1 𝑐 = 1) Q: If Jo’s test is positive, what is the probability Jo has the disease? = 𝑞 𝑐 = 1 𝑏 = 1 𝑞(𝑏 = 1) 𝑞(𝑐 = 1) ➢ Jo has a test for a nasty disease. We .95 ∗ .01 denote Jo’s state of health by the variable = a (a=1: Jo has the disease; a=0 o/w) and .95 ∗ .01 + .05 ∗ .99 the test result by b. ➢ The result of the test is either ‘positive’ (b = 0.16 = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a 𝑞(𝑐 = 1|𝑏 = 1) = 0.95 Conditionals positive result is returned, and in 95% of p(b|a) 𝑞(𝑐 = 0|𝑏 = 0) = 0.95 cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information marginal is that 1% of people of Jo’s age and 𝑞 𝑏 = 1 = 0.01 of a background have the disease.

  43. Probability Topics (High-Level) Basics of Probability: Prereqs Philosophy of Probability, and Terminology Useful Quantities and Inequalities

  44. A Bit of Philosophy and Terminology What is a probability? Core terminology – Support/domain – Partition function Some principles – Generative story – Forward probability – Inverse probability

  45. Kinds of Statistics The average Descriptive grade on this assignment is 83. Confirmatory Predictive

  46. Interpretations of Probability Past performance 58% of the past 100 flips were heads Hypothetical performance If I flipped the coin in many parallel universes… Subjective strength of belief Would pay up to 58 cents for chance to win $1 Output of some computable formula? p(heads) vs q(heads)

  47. Camps of Probability Past performance 58% of the past 100 flips were heads Hypothetical performance Frequentists If I flipped the coin in many parallel universes… Subjective strength of belief Would pay up to 58 cents Bayesians for chance to win $1 Output of some computable formula? p(heads) vs q(heads) (my grouping, not too far off though)

  48. Camps of Probability Past performance 58% of the past 100 flips were heads Frequentists Hypothetical performance If I flipped the coin in many parallel universes… Bayesians Subjective strength of belief Would pay up to 58 cents for chance to win $1 ML People Output of some computable formula? p(heads) vs q(heads) (my grouping, not too far off though)

  49. Camps of Probability Past performance 58% of the past 100 flips were heads Frequentists “You cannot do inference Hypothetical performance If I flipped the coin in many parallel universes… without making Bayesians Subjective strength of belief assumptions.” Would pay up to 58 cents for chance to win $1 – ITILA, 2.2, pg 26 ML People Output of some computable formula? p(heads) vs q(heads) (my grouping, not too far off though)

  50. General ML Consideration: Inductive Bias What do we know before we see the data, and how does that influence our modeling decisions? Courtesy Hamed Pirsiavash

  51. General ML Consideration: Inductive Bias What do we know before we see the data, and how does that influence our modeling decisions? A B Partition these into two groups… C D Courtesy Hamed Pirsiavash

  52. General ML Consideration: Inductive Bias What do we know before we see the data, and how does that influence our modeling decisions? A B Partition these into two groups Who selected red vs. blue? C D Courtesy Hamed Pirsiavash

  53. General ML Consideration: Inductive Bias What do we know before we see the data, and how does that influence our modeling decisions? A B Partition these into two groups Who selected red vs. blue? C D Who selected vs. ? Courtesy Hamed Pirsiavash

  54. General ML Consideration: Inductive Bias What do we know before we see the data, and how does that influence our modeling decisions? A B Partition these into two groups Who selected red vs. blue? C D Who selected vs. ? Tip: Remember how your own biases/interpretation are influencing your approach Courtesy Hamed Pirsiavash

  55. Some Terminology Support – The valid values a R.V. can take on – The values over which a pmf/pdf is defined

  56. Some Terminology Support – The valid values a R.V. can take on – The values over which a pmf/pdf is defined Partition function/normalization function – The function (or constant) that ensures a p{m,d}f sums to 1

  57. Some Terminology Q: What is the support Support for a Poisson R.V.? – The valid values a R.V. can take on – The values over which a pmf/pdf is defined Partition function/normalization function – The function (or constant) that ensures a p{m,d}f sums to 1

  58. Some Terminology Q: What is the support Support for a Poisson R.V.? – The valid values a R.V. Poisson can take on • 𝑌 ∼ Poisson 𝜇 , 𝜇 ∈ ℝ is the – The values over which a “rate” pmf/pdf is defined 𝜇 𝑙 exp(−𝜇) • 𝑞 𝑌 = 𝑙 = 𝑙! Partition function/normalization function – The function (or constant) that ensures a p{m,d}f sums to 1 PMF

  59. Some Terminology Q: What is the partition Support function/constant? – The valid values a R.V. Poisson can take on • 𝑌 ∼ Poisson 𝜇 , 𝜇 ∈ ℝ is the – The values over which a “rate” pmf/pdf is defined 𝜇 𝑙 exp(−𝜇) • 𝑞 𝑌 = 𝑙 = 𝑙! Partition function/normalization function – The function (or constant) that ensures a p{m,d}f sums to 1 PMF

  60. Some More Terminology (Generative) Probabilistic Modeling Generative Story Forward probability (ITILA) Inverse probability (ITILA)

  61. What is (Generative) Probabilistic Modeling? So far, we’ve (mostly) had labeled data pairs (x, y), and built classifiers p(y | x)

  62. What is (Generative) Probabilistic Modeling? So far, we’ve (mostly) had labeled data pairs (x, y), and built classifiers p(y | x) What if we want to model both x and y together? p(x, y)

  63. What is (Generative) Probabilistic Modeling? So far, we’ve (mostly) had labeled data pairs (x, y), and built classifiers p(y | x) What if we want to model both x and y together? Q/678 Recap: Where p(x, y) have we used p(x,y)?

  64. What is (Generative) Probabilistic Modeling? So far, we’ve (mostly) had labeled data pairs (x, y), and built classifiers p(y | x) What if we want to model both x and y together? Q/678 Recap: Where have we used p(x,y)? p(x, y) A: Linear Discriminant Analysis

  65. What is (Generative) Probabilistic Modeling? So far, we’ve (mostly) had labeled data pairs (x, y), and built classifiers p(y | x) What if we want to model both x and y together? Q: Where have we used p(x,y)? p(x, y) A: Linear Discriminant Analysis Or what if we only have data but no labels? p(x)

Recommend


More recommend