15-780 – Graduate Artificial Intelligence: Probabilistic modeling J. Zico Kolter (this lecture) and Nihar Shah Carnegie Mellon University Spring 2020 1
Outline Probability in AI Background on probability Common distributions Maximum likelihood estimation Probabilistic graphical models 2
Outline Probability in AI Background on probability Common distributions Maximum likelihood estimation Probabilistic graphical models 3
Probability in AI Basic idea: the real world is probabilistic (at least at the level we can observe it), and our reasoning about it needs to be too The shift from “logical” to “probabilistic” AI systems (circa 80s, 90s) represented a revolution in AI Probabilistic approaches are now intertwined with virtually all areas of AI, though research e.g. in “pure” probabilistic graphical models has declined a bit in recent years in favor of neural-network-based generative models 4
Example: topic modeling Can we learn about the content of text documents just be reading through them and see what sorts of words “co-occur” “Genetics” “Genetics” “Evolution” “Evolution” “Disease” “Disease” “Computers” “Computers” human human evolution evolution disease disease computer computer genome genome evolutionary evolutionary host host models models 0.4 0.4 dna dna species species bacteria bacteria information information genetic genetic organisms organisms diseases diseases data data 0.3 0.3 genes genes life life resistance resistance computers computers sequence sequence origin origin bacterial bacterial system system Probability Probability gene gene biology biology new new network network 0.2 0.2 molecular molecular groups groups strains strains systems systems sequencing sequencing phylogenetic phylogenetic control control model model 0.1 0.1 map map living living infectious infectious parallel parallel information information diversity diversity malaria malaria methods methods genetics genetics group group parasite parasite networks networks 0.0 0.0 mapping mapping new new parasites parasites software software 1 8 16 1 8 26 16 36 26 46 36 56 46 66 56 76 66 86 76 96 86 96 project project two two united united new new Topics Topics sequences sequences common common tuberculosis tuberculosis simulations simulations Figure from (Blei et al., 2011) demonstrates words and topics recovered from reading 17,000 Science articles 5
Example: biological networks Can we automatically determine how the presence or absence of some proteins in a cell affect others? Figure from (Sachs et al., 2005) shows automatically inferred protein probability network, which captured most of the known interactions using data-driven methods (far less manual effort than previous methods) 6
Outline Probability in AI Background on probability Common distributions Maximum likelihood estimation Probabilistic graphical models 7
Basics of probability A probability space is a tuple (Ω, ℱ, 𝑄 ) where • The sample space Ω is a set of outcomes • The event space ℱ is a 𝜏 -algebra of subsets of Ω • The probability measure 𝑄 : ℱ → [0,1] is a countably additive positive measure with 𝑄 Ω = 1 A random variable is a measurable function 𝑌: Ω → ℝ (or ℝ 푛 for a random vector), such that for all Borel sets 𝐶 , 𝑄 𝑌 ∈ 𝐶 = 𝑄 ( 𝜕 ∈ Ω 𝑌 𝜕 ∈ 𝐶 }) 8
Basics of probability A probability space is a tuple (Ω, ℱ, 𝑄 ) where • The sample space Ω is a set of outcomes • The event space ℱ is a 𝜏 -algebra of subsets of Ω • The probability measure 𝑄 : ℱ → [0,1] is a countably additive positive measure with 𝑄 Ω = 1 A random variable is a measurable function 𝑌: Ω → ℝ (or ℝ 푛 for a random vector), such that for all Borel sets 𝐶 , 𝑄 𝑌 ∈ 𝐶 = 𝑄 ( 𝜕 ∈ Ω 𝑌 𝜕 ∈ 𝐶 }) 😲😴😲😴 9
Random variables A random variable (informally) is a variable whose value is not initial known Instead, these variables can take on different values (including a possibly infinite number), and must take on exactly one of these values, each with an associated probability, which all together sum to one “Weather” takes values sunny, rainy, cloudy, snowy 𝑞 Weather = sunny = 0.3 𝑞 Weather = rainy = 0.2 … Slightly different notation for continuous random variables, which we will discuss shortly 10
Notation for random variables In this lecture, we use upper case letters, 𝑌 푖 to denote random variables For a random variable 𝑌 푖 taking values 1,2,3 0.1 𝑞 𝑌 푖 = 0.5 0.4 represents the set of probabilities for each value that 𝑌 푖 can take on (this is a function mapping values of 𝑌 푖 to numbers that sum to one) Conversely, we will use lower case 𝑦 푖 to denote a specific value of 𝑌 푖 (i.e., for above example 𝑦 푖 ∈ 1,2,3 ), and 𝑞 𝑌 푖 = 𝑦 푖 or just 𝑞 𝑦 푖 refers to a number (the corresponding entry of 𝑞 𝑌 푖 ) 11
Examples of probability notation Given two random variables: 𝑌 1 with values in {1,2,3} and 𝑌 2 with values in 1,2 : 𝑞(𝑌 1 , 𝑌 2 ) refers to the joint distribution, i.e., a set of 6 possible values for each setting of variables, i.e. a function mapping 1,1 , 1,2 , 2,1 , … to corresponding probabilities) 𝑞(𝑦 1 , 𝑦 2 ) is a number: probability that 𝑌 1 = 𝑦 1 and 𝑌 2 = 𝑦 2 𝑞(𝑌 1 , 𝑦 2 ) is a set of 3 values, the probabilities for all values of 𝑌 1 for the given value 𝑌 2 = 𝑦 2 , i.e., it is a function mapping 0,1,2 to numbers (note: not probability distribution, it will not sum to one) We generally call all of these terms factors (functions mapping values to numbers, even if they do not sum to one) 12
Operations on probabilities/factors We can perform operations on probabilities/factors by performing the operation on every corresponding value in the probabilities/factors For example, given three random variables 𝑌 1 , 𝑌 2 , 𝑌 3 : 𝑞 𝑌 1 , 𝑌 2 op 𝑞 𝑌 2 , 𝑌 3 denotes a factor over 𝑌 1 , 𝑌 2 , 𝑌 3 (i.e., a function over all possible combinations of values these three random variables can take), where the value for 𝑦 1 , 𝑦 2 , 𝑦 3 is given by 𝑞 𝑦 1 , 𝑦 2 op 𝑞 𝑦 2 , 𝑦 3 13
Conditional probability The conditional probability 𝑞 𝑌 1 𝑌 2 (the conditional probability of 𝑌 1 given 𝑌 2 ) is defined as 𝑞 𝑌 1 𝑌 2 = 𝑞 𝑌 1 , 𝑌 2 𝑞 𝑌 2 Can also be written 𝑞 𝑌 1 , 𝑌 2 = 𝑞 𝑌 1 𝑌 2 )𝑞(𝑌 2 ) More generally, leads to the chain rule : 푛 𝑞 𝑌 1 , … , 𝑌 푛 = ∏ 𝑞 𝑌 푖 𝑌 1 , … 𝑌 푖−1 푖=1 14
Marginalization For random variables 𝑌 1 , 𝑌 2 with joint distribution 𝑞 𝑌 1 , 𝑌 2 𝑞 𝑌 1 = ∑ 𝑞 𝑌 1 , 𝑦 2 = ∑ 𝑞 𝑌 1 𝑦 2 𝑞 𝑦 2 푥 2 푥 2 Generalizes to joint distributions over multiple random variables 𝑞 𝑌 1 , … , 𝑌 푖 = ∑ 𝑞 𝑌 1 , … , 𝑌 푖 , 𝑦 푖+1 , … , 𝑦 푛 푥 푖+1 ,…,푥 푛 For 𝑞 to be a probability distribution, the marginalization over all variables must be one ∑ 𝑞 𝑦 1 , … , 𝑦 푛 = 1 푥 1 ,…,푥 푛 15
Bayes’ rule A straightforward manipulation of probabilities: 𝑞 𝑌 1 𝑌 2 = 𝑞 𝑌 1 , 𝑌 2 = 𝑞 𝑌 2 𝑌 1 )𝑞(𝑌 1 ) 𝑞 𝑌 2 𝑌 1 )𝑞(𝑌 1 ) = 𝑞 𝑌 2 𝑞 𝑌 2 ∑ 푥 1 𝑞(𝑌 2 |𝑦 1 ) 𝑞 𝑦 1 Poll: I want to know if I have come down with a rare strain of flu (occurring in only 1/10,000 people). There is an “accurate” test for the flu: if I have the flu, it will tell me I have 99% of the time, and if I do not have it, it will tell me I do not have it 99% of the time. I go to the doctor and test positive. What is the probability I have this flu? • ≈ 99% • ≈ 10% • ≈ 1% • ≈ 0.1% 16
Independence We say that random variables 𝑌 1 and 𝑌 2 are (marginally) independent if their joint distribution is the product of their marginals 𝑞 𝑌 1 , 𝑌 2 = 𝑞 𝑌 1 𝑞 𝑌 2 Equivalently, can also be stated as the condition that = 𝑞 𝑌 1 , 𝑌 2 = 𝑞 𝑌 1 𝑞 𝑌 2 𝑞 𝑌 1 𝑌 2 ) = 𝑞 𝑌 1 𝑞 𝑌 2 𝑞 𝑌 2 and similarly 𝑞 𝑌 2 𝑌 1 = 𝑞 𝑌 2 17
Conditional independence We say that random variables 𝑌 1 and 𝑌 2 are conditionally independent given 𝑌 3 , if 𝑞 𝑌 1 , 𝑌 2 |𝑌 3 = 𝑞 𝑌 1 𝑌 3 𝑞 𝑌 2 𝑌 3 ) Again, can be equivalently written: = 𝑞 𝑌 1 , 𝑌 2 𝑌 3 = 𝑞 𝑌 1 𝑌 3 𝑞 𝑌 2 𝑌 3 ) 𝑞 𝑌 1 𝑌 2 , X 3 = 𝑞(𝑌 1 |𝑌 3 ) 𝑞 𝑌 2 𝑌 3 𝑞 𝑌 2 𝑌 3 And similarly 𝑞 𝑌 2 𝑌 1 , 𝑌 3 = 𝑞 𝑌 2 𝑌 3 Important: Marginal independence does not imply conditional independence or vice versa 18
Expectation The expectation of a random variable is denoted: 𝐅 𝑌 = ∑ 𝑦 ⋅ 𝑞 𝑦 푥 where we use upper case 𝑌 to emphasize that this is a function of the entire random variable (but unlike 𝑞(𝑌) is a number) Note that this only makes sense when the values that the random variable takes on are numerical (i.e., We can’t ask for the expectation of the random variable “Weather”) Also generalizes to conditional expectation: 𝐅 𝑌 1 |𝑦 2 = ∑ 𝑦 1 ⋅ 𝑞 𝑦 1 |𝑦 2 푥 1 19
Recommend
More recommend