Probabilistic Independence Independence: when events can occur and not impact the probability of Q: Are the results of flipping the same other events coin twice in succession independent? Formally: p(x,y) = p(x)*p(y) A: Yes (assuming no weird effects) Generalizable to > 2 random variables
Probabilistic Independence Q: Are A and B independent? Independence: when events can occur and not impact the probability of other events Formally: p(x,y) = p(x)*p(y) A Generalizable to > 2 random variables B everything
Probabilistic Independence Q: Are A and B independent? Independence: when events can occur and not impact the probability of other events Formally: p(x,y) = p(x)*p(y) A Generalizable to > 2 random variables B everything A: No (work it out from p(A,B)) and the axioms
Probabilistic Independence Q: Are X and Y independent? Independence: when events can occur and not impact the probability of p(x,y) Y=0 Y=1 other events X=“cat” .04 .32 X=“dog” .2 .04 Formally: p(x,y) = p(x)*p(y) X=“bird” .1 .1 X=“human” .1 .1 Generalizable to > 2 random variables
Probabilistic Independence Q: Are X and Y independent? Independence: when events can occur and not impact the probability of p(x,y) Y=0 Y=1 other events X=“cat” .04 .32 X=“dog” .2 .04 Formally: p(x,y) = p(x)*p(y) X=“bird” .1 .1 X=“human” .1 .1 Generalizable to > 2 random variables A: No (find the marginal probabilities of p(x) and p(y))
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability
Marginal(ized) Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y Consider the mutually exclusive ways that different values of x could occur with y Q: How do write this in terms of joint probabilities?
Marginal(ized) Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y Consider the mutually exclusive ways that different values of x could occur with y 𝑞 𝑧 = 𝑞(𝑦, 𝑧) 𝑦
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability
Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) Conditional Probabilities are Probabilities
Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = marginal probability of Y
Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = ∫ 𝑞(𝑌, 𝑍)𝑒𝑌
Conditional Probabilities: Changing the Right 1 p(A) 0 what happens as we add conjuncts to the right?
Conditional Probabilities: Changing the Right 1 p(A) p(A | B) 0 what happens as we add conjuncts to the right?
Conditional Probabilities: Changing the Right 1 p(A | B) p(A) 0 what happens as we add conjuncts to the right?
Conditional Probabilities: Changing the Right 1 p(A | B) p(A) 0 what happens as we add conjuncts to the right?
Conditional Probabilities Bias vs. Variance Lower bias: More specific to what we care about Higher variance: For fixed observations, estimates become less reliable
Revisiting Marginal Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y 𝑞 𝑧 = 𝑞(𝑦, 𝑧) 𝑦 = 𝑞 𝑦 𝑞 𝑧 𝑦) 𝑦
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability
Deriving Bayes Rule Start with conditional p(X | Y)
Deriving Bayes Rule 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) Solve for p(x,y) 𝑞(𝑍)
Deriving Bayes Rule 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) Solve for p(x,y) 𝑞(𝑍) 𝑞 𝑌, 𝑍 = 𝑞 𝑌 𝑍)𝑞(𝑍) p(x,y) = p(y,x) 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)
Bayes Rule prior likelihood probability 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) posterior probability marginal likelihood (probability)
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability
Probability Chain Rule 𝑞 𝑦 1 , 𝑦 2 , … , 𝑦 𝑇 = 𝑞 𝑦 1 𝑞 𝑦 2 𝑦 1 )𝑞 𝑦 3 𝑦 1 , 𝑦 2 ) ⋯ 𝑞 𝑦 𝑇 𝑦 1 , … , 𝑦 𝑗 = 𝑇 ෑ 𝑞 𝑦 𝑗 𝑦 1 , … , 𝑦 𝑗−1 ) 𝑗 extension of Bayes rule
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability
Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻
Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻 • 𝐻 often has parameters 𝜍 = (𝜍 1 , 𝜍 2 , … , 𝜍 𝑁 ) that govern its “shape” • Formally written as 𝑌 ∼ 𝐻(𝜍)
Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻 • 𝐻 often has parameters 𝜍 = (𝜍 1 , 𝜍 2 , … , 𝜍 𝑁 ) that govern its “shape” • Formally written as 𝑌 ∼ 𝐻(𝜍) i.i.d. If 𝑌 1 , X 2 , … , X N are all independently sampled from 𝐻(𝜍) , they are i ndependently and i dentically d istributed
Common Distributions Bernoulli: A single draw Bernoulli/Binomial • Binary R.V.: 0 (failure) or 1 (success) • 𝑌 ∼ Bernoulli(𝜍) Categorical/Multinomial • 𝑞 𝑌 = 1 = 𝜍 , 𝑞 𝑌 = 0 = 1 − 𝜍 • Generally, 𝑞 𝑌 = 𝑙 = 𝜍 𝑙 1 − 𝑞 1−𝑙 Poisson Normal (Gamma)
Common Distributions Bernoulli: A single draw • Binary R.V.: 0 (failure) or 1 (success) Bernoulli/Binomial • 𝑌 ∼ Bernoulli(𝜍) Categorical/Multinomial • 𝑞 𝑌 = 1 = 𝜍 , 𝑞 𝑌 = 0 = 1 − 𝜍 • Generally, 𝑞 𝑌 = 𝑙 = 𝜍 𝑙 1 − 𝑞 1−𝑙 Poisson Normal Binomial: Sum of N iid Bernoulli draws • Values X can take: 0, 1, …, N (Gamma) • Represents number of successes • 𝑌 ∼ Binomial(𝑂, 𝜍) 𝑙 𝜍 𝑙 1 − 𝜍 𝑂−𝑙 𝑂 • 𝑞 𝑌 = 𝑙 =
Common Distributions Categorical: A single draw • Finite R.V. taking one of K values: 1, 2, …, K Bernoulli/Binomial 𝑌 ∼ Cat 𝜍 , 𝜍 ∈ ℝ 𝐿 • • 𝑞 𝑌 = 1 = 𝜍 1 , 𝑞 𝑌 = 2 = 𝜍 2 , … 𝑞( 𝑌 = Categorical/Multinomial 𝐿 = 𝜍 𝐿 ) 𝟐[𝑙=𝑘] • Generally, 𝑞 𝑌 = 𝑙 = ς 𝑘 𝜍 𝑘 Poisson 1 𝑑 = ቊ1, 𝑑 is true • Normal 0, 𝑑 is false Multinomial: Sum of N iid Categorical draws (Gamma) • Vector of size K representing how often value k was drawn 𝑌 ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ 𝐿 •
Common Distributions Poisson • Finite R.V. taking any integer that is >= 0 • 𝑌 ∼ Poisson 𝜇 ,𝜇 ∈ ℝ is the “rate” Bernoulli/Binomial 𝜇 𝑙 exp(−𝜇) • 𝑞 𝑌 = 𝑙 = 𝑙! Categorical/Multinomial Poisson Normal (Gamma)
Common Distributions Normal • Real R.V. taking any real number • 𝑌 ∼ Normal 𝜈, 𝜏 , 𝜈 is the mean, 𝜏 is the standard deviation Bernoulli/Binomial − 𝑦−𝜈 2 1 • 𝑞 𝑌 = 𝑦 = 2𝜌𝜏 exp( ) 2𝜏 2 Categorical/Multinomial Poisson Normal 𝑞 𝑌 = 𝑦 (Gamma) https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/192 0px-Normal_Distribution_PDF.svg.png
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability
Expected Value of a Random Variable random variable 𝑌 ~ 𝑞 ⋅
Expected Value of a Random Variable random variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = 𝑦 𝑞 𝑦 𝑦 expected value (distribution p is implicit)
Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = 𝑦 𝑞 𝑦 𝑦 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6
Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = 𝑦 𝑞 𝑦 Q: What common 𝑦 distribution is this? 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6
Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = 𝑦 𝑞 𝑦 Q: What common 𝑦 distribution is this? 1/6 * 1 + 1/6 * 2 + A: Categorical 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6
Expected Value: Example 2 non-uniform distribution of number of cats a normal cat person has 1 2 3 4 5 6 𝔽 𝑌 = 𝑦 𝑞 𝑦 𝑦 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6
Expected Value of a Function of a Random Variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = 𝑦 𝑞(𝑦) 𝑦 𝔽 𝑔(𝑌) =? ? ?
Expected Value of a Function of a Random Variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = 𝑦 𝑞(𝑦) 𝑦 𝔽 𝑔(𝑌) = 𝑔(𝑦) 𝑞 𝑦 𝑦
Expected Value of Function: Example non-uniform distribution of number of cats I start with 1 2 3 4 5 6 What if each cat magically becomes two? 𝑔 𝑙 = 2 𝑙 𝔽 𝑔(𝑌) = 𝑔(𝑦) 𝑞 𝑦 𝑦
Expected Value of Function: Example non-uniform distribution of number of cats I start with 1 2 3 4 5 6 What if each cat magically becomes two? 𝑔 𝑙 = 2 𝑙 2 𝑦 𝑞(𝑦) 𝔽 𝑔(𝑌) = 𝑔(𝑦) 𝑞 𝑦 = 𝑦 𝑦 1/2 * 2 1 + 1/10 * 2 2 + 1/10 * 2 3 + = 13.4 1/10 * 2 4 + 1/10 * 2 5 + 1/10 * 2 6
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability
Outline Review+Extension Probability Decision Theory Loss Functions
Decision Theory “Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ
Decision Theory “Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ Requirement 1: a decision (hypothesis) function h( x ) to produce ỹ
Decision Theory “Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ Requirement 1: a decision (hypothesis) function h( x ) to produce ỹ Requirement 2: a function ℓ (y, ỹ) telling us how wrong we are
Decision Theory “Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ Requirement 1: a decision (hypothesis) function h( x ) to produce ỹ Requirement 2: a loss function ℓ (y, ỹ) telling us how wrong we are Goal: minimize our expected loss across any possible input
Requirement 1: Decision Function Gold/correct labels h(x) instance 1 instance 2 Machine score Evaluator Learning Predictor instance 3 instance 4 Extra-knowledge h(x) is our predictor (classifier, regression model, clustering model, etc.)
Requirement 2: Loss Function “ell” (fancy l predicted label/result character) optimize ℓ ? ℓ 𝑧, ො 𝑧 ≥ 0 • minimize • maximize “correct” label/result loss: A function that tells you how much to penalize a prediction ỹ from the correct answer y
Requirement 2: Loss Function “ell” (fancy l predicted label/result character) Negative ℓ ( −ℓ ) is ℓ 𝑧, ො 𝑧 ≥ 0 called a utility or reward function “correct” label/result loss: A function that tells you how much to penalize a prediction ỹ from the correct answer y
Decision Theory minimize expected loss across any possible input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] ො
Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] ො a particular , unspecified input pair ( x ,y)… but we want any possible pair
Decision Theory minimize expected loss across any possible input input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = ො arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] = argmin 𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 h Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y
Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = ො arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] = argmin 𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 = h argmin h ∫ ℓ 𝑧, ℎ 𝒚 𝑄 𝒚, 𝑧 𝑒(𝒚, 𝑧)
Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = ො arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] = argmin 𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 = h argmin h ∫ ℓ 𝑧, ℎ 𝒚 𝑄 𝒚, 𝑧 𝑒(𝒚, 𝑧) we don’t know this distribution*! *we could try to approximate it analytically
Empirical Risk Minimization minimize expected loss across our observed input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = ො arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] = argmin 𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 ≈ h 𝑂 1 argmin 𝑂 ℓ 𝑧 𝑗 , ℎ 𝒚 𝑗 h 𝑗=1
Recommend
More recommend