probability decision theory and
play

Probability, Decision Theory, and Loss Functions CMSC 678 UMBC - PowerPoint PPT Presentation

Probability, Decision Theory, and Loss Functions CMSC 678 UMBC Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2019/cmsc678 Course site:


  1. Probabilistic Independence Independence: when events can occur and not impact the probability of Q: Are the results of flipping the same other events coin twice in succession independent? Formally: p(x,y) = p(x)*p(y) A: Yes (assuming no weird effects) Generalizable to > 2 random variables

  2. Probabilistic Independence Q: Are A and B independent? Independence: when events can occur and not impact the probability of other events Formally: p(x,y) = p(x)*p(y) A Generalizable to > 2 random variables B everything

  3. Probabilistic Independence Q: Are A and B independent? Independence: when events can occur and not impact the probability of other events Formally: p(x,y) = p(x)*p(y) A Generalizable to > 2 random variables B everything A: No (work it out from p(A,B)) and the axioms

  4. Probabilistic Independence Q: Are X and Y independent? Independence: when events can occur and not impact the probability of p(x,y) Y=0 Y=1 other events X=“cat” .04 .32 X=“dog” .2 .04 Formally: p(x,y) = p(x)*p(y) X=“bird” .1 .1 X=“human” .1 .1 Generalizable to > 2 random variables

  5. Probabilistic Independence Q: Are X and Y independent? Independence: when events can occur and not impact the probability of p(x,y) Y=0 Y=1 other events X=“cat” .04 .32 X=“dog” .2 .04 Formally: p(x,y) = p(x)*p(y) X=“bird” .1 .1 X=“human” .1 .1 Generalizable to > 2 random variables A: No (find the marginal probabilities of p(x) and p(y))

  6. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability

  7. Marginal(ized) Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y Consider the mutually exclusive ways that different values of x could occur with y Q: How do write this in terms of joint probabilities?

  8. Marginal(ized) Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y Consider the mutually exclusive ways that different values of x could occur with y 𝑞 𝑧 = ෍ 𝑞(𝑦, 𝑧) 𝑦

  9. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability

  10. Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) Conditional Probabilities are Probabilities

  11. Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = marginal probability of Y

  12. Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = ∫ 𝑞(𝑌, 𝑍)𝑒𝑌

  13. Conditional Probabilities: Changing the Right 1 p(A) 0 what happens as we add conjuncts to the right?

  14. Conditional Probabilities: Changing the Right 1 p(A) p(A | B) 0 what happens as we add conjuncts to the right?

  15. Conditional Probabilities: Changing the Right 1 p(A | B) p(A) 0 what happens as we add conjuncts to the right?

  16. Conditional Probabilities: Changing the Right 1 p(A | B) p(A) 0 what happens as we add conjuncts to the right?

  17. Conditional Probabilities Bias vs. Variance Lower bias: More specific to what we care about Higher variance: For fixed observations, estimates become less reliable

  18. Revisiting Marginal Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y 𝑞 𝑧 = ෍ 𝑞(𝑦, 𝑧) 𝑦 = ෍ 𝑞 𝑦 𝑞 𝑧 𝑦) 𝑦

  19. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability

  20. Deriving Bayes Rule Start with conditional p(X | Y)

  21. Deriving Bayes Rule 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) Solve for p(x,y) 𝑞(𝑍)

  22. Deriving Bayes Rule 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) Solve for p(x,y) 𝑞(𝑍) 𝑞 𝑌, 𝑍 = 𝑞 𝑌 𝑍)𝑞(𝑍) p(x,y) = p(y,x) 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

  23. Bayes Rule prior likelihood probability 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) posterior probability marginal likelihood (probability)

  24. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability

  25. Probability Chain Rule 𝑞 𝑦 1 , 𝑦 2 , … , 𝑦 𝑇 = 𝑞 𝑦 1 𝑞 𝑦 2 𝑦 1 )𝑞 𝑦 3 𝑦 1 , 𝑦 2 ) ⋯ 𝑞 𝑦 𝑇 𝑦 1 , … , 𝑦 𝑗 = 𝑇 ෑ 𝑞 𝑦 𝑗 𝑦 1 , … , 𝑦 𝑗−1 ) 𝑗 extension of Bayes rule

  26. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability

  27. Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻

  28. Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻 • 𝐻 often has parameters 𝜍 = (𝜍 1 , 𝜍 2 , … , 𝜍 𝑁 ) that govern its “shape” • Formally written as 𝑌 ∼ 𝐻(𝜍)

  29. Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻 • 𝐻 often has parameters 𝜍 = (𝜍 1 , 𝜍 2 , … , 𝜍 𝑁 ) that govern its “shape” • Formally written as 𝑌 ∼ 𝐻(𝜍) i.i.d. If 𝑌 1 , X 2 , … , X N are all independently sampled from 𝐻(𝜍) , they are i ndependently and i dentically d istributed

  30. Common Distributions Bernoulli: A single draw Bernoulli/Binomial • Binary R.V.: 0 (failure) or 1 (success) • 𝑌 ∼ Bernoulli(𝜍) Categorical/Multinomial • 𝑞 𝑌 = 1 = 𝜍 , 𝑞 𝑌 = 0 = 1 − 𝜍 • Generally, 𝑞 𝑌 = 𝑙 = 𝜍 𝑙 1 − 𝑞 1−𝑙 Poisson Normal (Gamma)

  31. Common Distributions Bernoulli: A single draw • Binary R.V.: 0 (failure) or 1 (success) Bernoulli/Binomial • 𝑌 ∼ Bernoulli(𝜍) Categorical/Multinomial • 𝑞 𝑌 = 1 = 𝜍 , 𝑞 𝑌 = 0 = 1 − 𝜍 • Generally, 𝑞 𝑌 = 𝑙 = 𝜍 𝑙 1 − 𝑞 1−𝑙 Poisson Normal Binomial: Sum of N iid Bernoulli draws • Values X can take: 0, 1, …, N (Gamma) • Represents number of successes • 𝑌 ∼ Binomial(𝑂, 𝜍) 𝑙 𝜍 𝑙 1 − 𝜍 𝑂−𝑙 𝑂 • 𝑞 𝑌 = 𝑙 =

  32. Common Distributions Categorical: A single draw • Finite R.V. taking one of K values: 1, 2, …, K Bernoulli/Binomial 𝑌 ∼ Cat 𝜍 , 𝜍 ∈ ℝ 𝐿 • • 𝑞 𝑌 = 1 = 𝜍 1 , 𝑞 𝑌 = 2 = 𝜍 2 , … 𝑞( 𝑌 = Categorical/Multinomial 𝐿 = 𝜍 𝐿 ) 𝟐[𝑙=𝑘] • Generally, 𝑞 𝑌 = 𝑙 = ς 𝑘 𝜍 𝑘 Poisson 1 𝑑 = ቊ1, 𝑑 is true • Normal 0, 𝑑 is false Multinomial: Sum of N iid Categorical draws (Gamma) • Vector of size K representing how often value k was drawn 𝑌 ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ 𝐿 •

  33. Common Distributions Poisson • Finite R.V. taking any integer that is >= 0 • 𝑌 ∼ Poisson 𝜇 ,𝜇 ∈ ℝ is the “rate” Bernoulli/Binomial 𝜇 𝑙 exp(−𝜇) • 𝑞 𝑌 = 𝑙 = 𝑙! Categorical/Multinomial Poisson Normal (Gamma)

  34. Common Distributions Normal • Real R.V. taking any real number • 𝑌 ∼ Normal 𝜈, 𝜏 , 𝜈 is the mean, 𝜏 is the standard deviation Bernoulli/Binomial − 𝑦−𝜈 2 1 • 𝑞 𝑌 = 𝑦 = 2𝜌𝜏 exp( ) 2𝜏 2 Categorical/Multinomial Poisson Normal 𝑞 𝑌 = 𝑦 (Gamma) https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/192 0px-Normal_Distribution_PDF.svg.png

  35. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability

  36. Expected Value of a Random Variable random variable 𝑌 ~ 𝑞 ⋅

  37. Expected Value of a Random Variable random variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 𝑦 expected value (distribution p is implicit)

  38. Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 𝑦 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

  39. Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 Q: What common 𝑦 distribution is this? 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

  40. Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 Q: What common 𝑦 distribution is this? 1/6 * 1 + 1/6 * 2 + A: Categorical 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

  41. Expected Value: Example 2 non-uniform distribution of number of cats a normal cat person has 1 2 3 4 5 6 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 𝑦 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

  42. Expected Value of a Function of a Random Variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍ 𝑦 𝑞(𝑦) 𝑦 𝔽 𝑔(𝑌) =? ? ?

  43. Expected Value of a Function of a Random Variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍ 𝑦 𝑞(𝑦) 𝑦 𝔽 𝑔(𝑌) = ෍ 𝑔(𝑦) 𝑞 𝑦 𝑦

  44. Expected Value of Function: Example non-uniform distribution of number of cats I start with 1 2 3 4 5 6 What if each cat magically becomes two? 𝑔 𝑙 = 2 𝑙 𝔽 𝑔(𝑌) = ෍ 𝑔(𝑦) 𝑞 𝑦 𝑦

  45. Expected Value of Function: Example non-uniform distribution of number of cats I start with 1 2 3 4 5 6 What if each cat magically becomes two? 𝑔 𝑙 = 2 𝑙 2 𝑦 𝑞(𝑦) 𝔽 𝑔(𝑌) = ෍ 𝑔(𝑦) 𝑞 𝑦 = ෍ 𝑦 𝑦 1/2 * 2 1 + 1/10 * 2 2 + 1/10 * 2 3 + = 13.4 1/10 * 2 4 + 1/10 * 2 5 + 1/10 * 2 6

  46. Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability

  47. Outline Review+Extension Probability Decision Theory Loss Functions

  48. Decision Theory “Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ

  49. Decision Theory “Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ Requirement 1: a decision (hypothesis) function h( x ) to produce ỹ

  50. Decision Theory “Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ Requirement 1: a decision (hypothesis) function h( x ) to produce ỹ Requirement 2: a function ℓ (y, ỹ) telling us how wrong we are

  51. Decision Theory “Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ Requirement 1: a decision (hypothesis) function h( x ) to produce ỹ Requirement 2: a loss function ℓ (y, ỹ) telling us how wrong we are Goal: minimize our expected loss across any possible input

  52. Requirement 1: Decision Function Gold/correct labels h(x) instance 1 instance 2 Machine score Evaluator Learning Predictor instance 3 instance 4 Extra-knowledge h(x) is our predictor (classifier, regression model, clustering model, etc.)

  53. Requirement 2: Loss Function “ell” (fancy l predicted label/result character) optimize ℓ ? ℓ 𝑧, ො 𝑧 ≥ 0 • minimize • maximize “correct” label/result loss: A function that tells you how much to penalize a prediction ỹ from the correct answer y

  54. Requirement 2: Loss Function “ell” (fancy l predicted label/result character) Negative ℓ ( −ℓ ) is ℓ 𝑧, ො 𝑧 ≥ 0 called a utility or reward function “correct” label/result loss: A function that tells you how much to penalize a prediction ỹ from the correct answer y

  55. Decision Theory minimize expected loss across any possible input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] ො

  56. Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] ො a particular , unspecified input pair ( x ,y)… but we want any possible pair

  57. Decision Theory minimize expected loss across any possible input input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = ො arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] = argmin 𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 h Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y

  58. Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = ො arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] = argmin 𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 = h argmin h ∫ ℓ 𝑧, ℎ 𝒚 𝑄 𝒚, 𝑧 𝑒(𝒚, 𝑧)

  59. Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = ො arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] = argmin 𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 = h argmin h ∫ ℓ 𝑧, ℎ 𝒚 𝑄 𝒚, 𝑧 𝑒(𝒚, 𝑧) we don’t know this distribution*! *we could try to approximate it analytically

  60. Empirical Risk Minimization minimize expected loss across our observed input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = ො arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] = argmin 𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 ≈ h 𝑂 1 argmin 𝑂 ෍ ℓ 𝑧 𝑗 , ℎ 𝒚 𝑗 h 𝑗=1

Recommend


More recommend