Probabilistic Independence Q: Are X and Y independent? Independence: when events can occur and not impact the probability of p(x,y) Y=0 Y=1 other events X=“cat” .04 .32 X=“dog” .2 .04 Formally: p(x,y) = p(x)*p(y) X=“bird” .1 .1 X=“human” .1 .1 Generalizable to > 2 random variables
Probabilistic Independence Q: Are X and Y independent? Independence: when events can occur and not impact the probability of p(x,y) Y=0 Y=1 other events X=“cat” .04 .32 X=“dog” .2 .04 Formally: p(x,y) = p(x)*p(y) X=“bird” .1 .1 X=“human” .1 .1 Generalizable to > 2 random variables A: No (find the marginal probabilities of p(x) and p(y))
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Marginal probability Common distributions Probabilistic Independence Expected Value (of a function) of a Random Variable Definition of conditional probability
Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) Conditional Probabilities are Probabilities
Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = marginal probability of Y
Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = ∫ 𝑞(𝑌, 𝑍)𝑒𝑌
Revisiting Marginal Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y 𝑞 𝑧 = 𝑞(𝑦, 𝑧) 𝑦 = 𝑞 𝑦 𝑞 𝑧 𝑦) 𝑦
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Marginal probability Common distributions Probabilistic Independence Expected Value (of a function) of a Random Variable Definition of conditional probability
Deriving Bayes Rule Start with conditional p(X | Y)
Deriving Bayes Rule 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) Solve for p(x,y) 𝑞(𝑍)
Deriving Bayes Rule 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) Solve for p(x,y) 𝑞(𝑍) 𝑞 𝑌, 𝑍 = 𝑞 𝑌 𝑍)𝑞(𝑍) p(x,y) = p(y,x) 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)
Bayes Rule prior likelihood probability 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) posterior probability marginal likelihood (probability)
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Marginal probability Common distributions Probabilistic Independence Expected Value (of a function) of a Random Variable Definition of conditional probability
Probability Chain Rule 𝑞 𝑦 1 , 𝑦 2 , … , 𝑦 𝑇 = 𝑞 𝑦 1 𝑞 𝑦 2 𝑦 1 )𝑞 𝑦 3 𝑦 1 , 𝑦 2 ) ⋯ 𝑞 𝑦 𝑇 𝑦 1 , … , 𝑦 𝑗 = 𝑇 ෑ 𝑞 𝑦 𝑗 𝑦 1 , … , 𝑦 𝑗−1 ) 𝑗 extension of Bayes rule
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Marginal probability Common distributions Probabilistic Independence Expected Value (of a function) of a Random Variable Definition of conditional probability
Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻
Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻 • 𝐻 often has parameters 𝜍 = (𝜍 1 , 𝜍 2 , … , 𝜍 𝑁 ) that govern its “shape” • Formally written as 𝑌 ∼ 𝐻(𝜍)
Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻 • 𝐻 often has parameters 𝜍 = (𝜍 1 , 𝜍 2 , … , 𝜍 𝑁 ) that govern its “shape” • Formally written as 𝑌 ∼ 𝐻(𝜍) i.i.d. If 𝑌 1 , X 2 , … , X N are all independently sampled from 𝐻(𝜍) , they are i ndependently and i dentically d istributed
Common Distributions Bernoulli: A single draw Bernoulli/Binomial • Binary R.V.: 0 (failure) or 1 (success) • 𝑌 ∼ Bernoulli(𝜍) Categorical/Multinomial • 𝑞 𝑌 = 1 = 𝜍 , 𝑞 𝑌 = 0 = 1 − 𝜍 • Generally, 𝑞 𝑌 = 𝑙 = 𝜍 𝑙 1 − 𝑞 1−𝑙 Poisson Normal Gamma
Common Distributions Bernoulli: A single draw • Binary R.V.: 0 (failure) or 1 (success) Bernoulli/Binomial • 𝑌 ∼ Bernoulli(𝜍) Categorical/Multinomial • 𝑞 𝑌 = 1 = 𝜍 , 𝑞 𝑌 = 0 = 1 − 𝜍 • Generally, 𝑞 𝑌 = 𝑙 = 𝜍 𝑙 1 − 𝑞 1−𝑙 Poisson Normal Binomial: Sum of N iid Bernoulli draws • Values X can take: 0, 1, …, N Gamma • Represents number of successes • 𝑌 ∼ Binomial(𝑂, 𝜍) 𝑙 𝜍 𝑙 1 − 𝜍 𝑂−𝑙 𝑂 • 𝑞 𝑌 = 𝑙 =
Common Distributions Categorical: A single draw • Finite R.V. taking one of K values: 1, 2, …, K Bernoulli/Binomial 𝑌 ∼ Cat 𝜍 , 𝜍 ∈ ℝ 𝐿 • • 𝑞 𝑌 = 1 = 𝜍 1 , 𝑞 𝑌 = 2 = 𝜍 2 , … 𝑞( 𝑌 = Categorical/Multinomial 𝐿 = 𝜍 𝐿 ) 𝟐[𝑙=𝑘] • Generally, 𝑞 𝑌 = 𝑙 = ς 𝑘 𝜍 𝑘 Poisson 1 𝑑 = ቊ1, 𝑑 is true • Normal 0, 𝑑 is false Multinomial: Sum of N iid Categorical draws Gamma • Vector of size K representing how often value k was drawn 𝑌 ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ 𝐿 •
Common Distributions Poisson • Discrete R.V. taking any integer that is >= 0 • 𝑌 ∼ Poisson 𝜇 , 𝜇 ∈ ℝ is the “rate” Bernoulli/Binomial 𝜇 𝑙 exp(−𝜇) • 𝑞 𝑌 = 𝑙 = 𝑙! Categorical/Multinomial Poisson Normal Gamma PMF
Common Distributions Normal • Real R.V. taking any real number • 𝑌 ∼ Normal 𝜈, 𝜏 , 𝜈 is the mean, 𝜏 is the standard deviation Bernoulli/Binomial − 𝑦−𝜈 2 1 • 𝑞 𝑌 = 𝑦 = 2𝜌𝜏 exp( ) 2𝜏 2 Categorical/Multinomial Poisson Normal 𝑞 𝑌 = 𝑦 Gamma https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/192 0px-Normal_Distribution_PDF.svg.png
Common Distributions Multivariate Normal Bernoulli/Binomial • Real vector R.V. 𝑌 ∈ ℝ 𝑙 Categorical/Multinomial • 𝑌 ∼ Normal 𝜈, Σ , 𝜈 ∈ ℝ 𝐿 is the mean, Σ ∈ ℝ 𝐿×𝐿 is the Poisson covariance Normal • 𝑞 𝑌 = 𝑦 ∝ exp(−( 𝑦 − 𝜈 𝑈 Σ(𝑦 − \mu)) ) Gamma
Common Distributions Gamma • Real R.V. taking any positive real number • 𝑌 ∼ Gamma 𝑙, 𝜄 , 𝑙 > 0 is the “shape” (how skewed it is), 𝜄 > 0 is the “scale” (how spread out the distribution is) Bernoulli/Binomial 𝑦 𝑙−1 exp( −𝑙 𝜄 ) • 𝑞 𝑌 = 𝑦 = Categorical/Multinomial 𝜄 𝑙 Γ(𝑙) Poisson Normal Gamma https://en.wikipedia.org/wiki/Gamma_distribution#/media/File:Gamma_distribution_pdf.svg
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Marginal probability Common distributions Probabilistic Independence Expected Value (of a function) of a Random Variable Definition of conditional probability
Expected Value of a Random Variable random variable 𝑌 ~ 𝑞 ⋅
Expected Value of a Random Variable random variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = 𝑦 𝑞 𝑦 𝑦 expected value (distribution p is implicit)
Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = 𝑦 𝑞 𝑦 𝑦 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6
Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = 𝑦 𝑞 𝑦 Q: What common 𝑦 distribution is this? 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6
Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = 𝑦 𝑞 𝑦 Q: What common 𝑦 distribution is this? 1/6 * 1 + 1/6 * 2 + A: Categorical 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6
Expected Value: Example 2 non-uniform distribution of number of cats a normal cat person has 1 2 3 4 5 6 𝔽 𝑌 = 𝑦 𝑞 𝑦 𝑦 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6
Expected Value of a Function of a Random Variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = 𝑦 𝑞(𝑦) 𝑦 𝔽 𝑔(𝑌) =? ? ?
Expected Value of a Function of a Random Variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = 𝑦 𝑞(𝑦) 𝑦 𝔽 𝑔(𝑌) = 𝑔(𝑦) 𝑞 𝑦 𝑦
Expected Value of Function: Example non-uniform distribution of number of cats I start with 1 2 3 4 5 6 What if each cat magically becomes two? 𝑔 𝑙 = 2 𝑙 𝔽 𝑔(𝑌) = 𝑔(𝑦) 𝑞 𝑦 𝑦
Expected Value of Function: Example non-uniform distribution of number of cats I start with 1 2 3 4 5 6 What if each cat magically becomes two? 𝑔 𝑙 = 2 𝑙 2 𝑦 𝑞(𝑦) 𝔽 𝑔(𝑌) = 𝑔(𝑦) 𝑞 𝑦 = 𝑦 𝑦 1/2 * 2 1 + 1/10 * 2 2 + 1/10 * 2 3 + = 13.4 1/10 * 2 4 + 1/10 * 2 5 + 1/10 * 2 6
Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Marginal probability Common distributions Probabilistic Independence Expected Value (of a function) of a Random Variable Definition of conditional probability
Example Problem: ITILA Ex. 2.3 ➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a positive result is returned, and in 95% of cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information is that 1% of people of Jo’s age and background have the disease. Q: If Jo’s test is positive, what is the probability Jo has the disease?
Example Problem: ITILA Ex. 2.3 𝑞 𝑏 = 1 𝑐 = 1) Q: If Jo’s test is positive, what is the probability Jo has the disease? ➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a positive result is returned, and in 95% of cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information is that 1% of people of Jo’s age and background have the disease.
Example Problem: ITILA Ex. 2.3 𝑞 𝑏 = 1 𝑐 = 1) Q: If Jo’s test is positive, what is the probability Jo has the disease? = 𝑞 𝑐 = 1 𝑏 = 1 𝑞(𝑏 = 1) 𝑞(𝑐 = 1) ➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a positive result is returned, and in 95% of cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information marginal is that 1% of people of Jo’s age and 𝑞 𝑏 = 1 = 0.01 of a background have the disease.
Example Problem: ITILA Ex. 2.3 𝑞 𝑏 = 1 𝑐 = 1) Q: If Jo’s test is positive, what is the probability Jo has the disease? = 𝑞 𝑐 = 1 𝑏 = 1 𝑞(𝑏 = 1) 𝑞(𝑐 = 1) ➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a 𝑞(𝑐 = 1|𝑏 = 1) = 0.95 Conditionals positive result is returned, and in 95% of p(b|a) 𝑞(𝑐 = 0|𝑏 = 0) = 0.95 cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information marginal is that 1% of people of Jo’s age and 𝑞 𝑏 = 1 = 0.01 of a background have the disease.
Example Problem: ITILA Ex. 2.3 𝑞 𝑏 = 1 𝑐 = 1) Q: If Jo’s test is positive, what is the probability Jo has the disease? = 𝑞 𝑐 = 1 𝑏 = 1 𝑞(𝑏 = 1) 𝑞(𝑐 = 1) ➢ Jo has a test for a nasty disease. We .95 ∗ .01 denote Jo’s state of health by the variable = a (a=1: Jo has the disease; a=0 o/w) and .95 ∗ .01 + .05 ∗ .99 the test result by b. ➢ The result of the test is either ‘positive’ (b = 0.16 = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a 𝑞(𝑐 = 1|𝑏 = 1) = 0.95 Conditionals positive result is returned, and in 95% of p(b|a) 𝑞(𝑐 = 0|𝑏 = 0) = 0.95 cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information marginal is that 1% of people of Jo’s age and 𝑞 𝑏 = 1 = 0.01 of a background have the disease.
Probability Topics (High-Level) Basics of Probability: Prereqs Philosophy of Probability, and Terminology Useful Quantities and Inequalities
A Bit of Philosophy and Terminology What is a probability? Core terminology – Support/domain – Partition function Some principles – Generative story – Forward probability – Inverse probability
Kinds of Statistics The average Descriptive grade on this assignment is 83. Confirmatory Predictive
Interpretations of Probability Past performance 58% of the past 100 flips were heads Hypothetical performance If I flipped the coin in many parallel universes… Subjective strength of belief Would pay up to 58 cents for chance to win $1 Output of some computable formula? p(heads) vs q(heads)
Camps of Probability Past performance 58% of the past 100 flips were heads Hypothetical performance Frequentists If I flipped the coin in many parallel universes… Subjective strength of belief Would pay up to 58 cents Bayesians for chance to win $1 Output of some computable formula? p(heads) vs q(heads) (my grouping, not too far off though)
Camps of Probability Past performance 58% of the past 100 flips were heads Frequentists Hypothetical performance If I flipped the coin in many parallel universes… Bayesians Subjective strength of belief Would pay up to 58 cents for chance to win $1 ML People Output of some computable formula? p(heads) vs q(heads) (my grouping, not too far off though)
Camps of Probability Past performance 58% of the past 100 flips were heads Frequentists “You cannot do inference Hypothetical performance If I flipped the coin in many parallel universes… without making Bayesians Subjective strength of belief assumptions.” Would pay up to 58 cents for chance to win $1 – ITILA, 2.2, pg 26 ML People Output of some computable formula? p(heads) vs q(heads) (my grouping, not too far off though)
General ML Consideration: Inductive Bias What do we know before we see the data, and how does that influence our modeling decisions? Courtesy Hamed Pirsiavash
General ML Consideration: Inductive Bias What do we know before we see the data, and how does that influence our modeling decisions? A B Partition these into two groups… C D Courtesy Hamed Pirsiavash
General ML Consideration: Inductive Bias What do we know before we see the data, and how does that influence our modeling decisions? A B Partition these into two groups Who selected red vs. blue? C D Courtesy Hamed Pirsiavash
General ML Consideration: Inductive Bias What do we know before we see the data, and how does that influence our modeling decisions? A B Partition these into two groups Who selected red vs. blue? C D Who selected vs. ? Courtesy Hamed Pirsiavash
General ML Consideration: Inductive Bias What do we know before we see the data, and how does that influence our modeling decisions? A B Partition these into two groups Who selected red vs. blue? C D Who selected vs. ? Tip: Remember how your own biases/interpretation are influencing your approach Courtesy Hamed Pirsiavash
Some Terminology Support – The valid values a R.V. can take on – The values over which a pmf/pdf is defined
Some Terminology Support – The valid values a R.V. can take on – The values over which a pmf/pdf is defined Partition function/normalization function – The function (or constant) that ensures a p{m,d}f sums to 1
Some Terminology Q: What is the support Support for a Poisson R.V.? – The valid values a R.V. can take on – The values over which a pmf/pdf is defined Partition function/normalization function – The function (or constant) that ensures a p{m,d}f sums to 1
Some Terminology Q: What is the support Support for a Poisson R.V.? – The valid values a R.V. Poisson can take on • 𝑌 ∼ Poisson 𝜇 , 𝜇 ∈ ℝ is the – The values over which a “rate” pmf/pdf is defined 𝜇 𝑙 exp(−𝜇) • 𝑞 𝑌 = 𝑙 = 𝑙! Partition function/normalization function – The function (or constant) that ensures a p{m,d}f sums to 1 PMF
Some Terminology Q: What is the partition Support function/constant? – The valid values a R.V. Poisson can take on • 𝑌 ∼ Poisson 𝜇 , 𝜇 ∈ ℝ is the – The values over which a “rate” pmf/pdf is defined 𝜇 𝑙 exp(−𝜇) • 𝑞 𝑌 = 𝑙 = 𝑙! Partition function/normalization function – The function (or constant) that ensures a p{m,d}f sums to 1 PMF
Some More Terminology (Generative) Probabilistic Modeling Generative Story Forward probability (ITILA) Inverse probability (ITILA)
What is (Generative) Probabilistic Modeling? So far, we’ve (mostly) had labeled data pairs (x, y), and built classifiers p(y | x)
What is (Generative) Probabilistic Modeling? So far, we’ve (mostly) had labeled data pairs (x, y), and built classifiers p(y | x) What if we want to model both x and y together? p(x, y)
What is (Generative) Probabilistic Modeling? So far, we’ve (mostly) had labeled data pairs (x, y), and built classifiers p(y | x) What if we want to model both x and y together? Q/678 Recap: Where p(x, y) have we used p(x,y)?
What is (Generative) Probabilistic Modeling? So far, we’ve (mostly) had labeled data pairs (x, y), and built classifiers p(y | x) What if we want to model both x and y together? Q/678 Recap: Where have we used p(x,y)? p(x, y) A: Linear Discriminant Analysis
What is (Generative) Probabilistic Modeling? So far, we’ve (mostly) had labeled data pairs (x, y), and built classifiers p(y | x) What if we want to model both x and y together? Q: Where have we used p(x,y)? p(x, y) A: Linear Discriminant Analysis Or what if we only have data but no labels? p(x)
Recommend
More recommend