Probability Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh August 2014 (All of the slides in this course have been adapted from previous versions by Charles Sutton, Amos Storkey, David Barber.) 1 / 31
Outline ◮ What is probability? ◮ Random Variables (discrete and continuous) ◮ Expectation ◮ Joint Distributions ◮ Marginal Probability ◮ Conditional Probability ◮ Chain Rule ◮ Bayes’ Rule ◮ Independence ◮ Conditional Independence ◮ Some Probability Distributions (for reference) ◮ Reading: Murphy secs 2.1-2.4 2 / 31
What is probability? ◮ Quantification of uncertainty ◮ Frequentist interpretation: long run frequenies of events ◮ Example: The probability of a particular coin landing heads up is 0.43 ◮ Bayesian interpretation: quantify our degrees of belief about something ◮ Example: the probability of it raining tomorrow is 0.3 ◮ Not possible to repeat “tomorrow” many times ◮ Basic rules of probability are the same, no matter which interpretation is adopted 3 / 31
Random Variables ◮ A random variable (RV) X denotes a quantity that is subject to variations due to chance ◮ May denote the result of an experiment (e.g. flipping a coin) or the measurement of a real-world fluctuating quantity (e.g. temperature) ◮ Use capital letters to denote random variables and lower case letters to denote values that they take, e.g. p ( X = x ) ◮ An RV may be discrete or continuous ◮ A discrete variable takes on values from a finite or countably infinite set ◮ Probability mass function p ( X = x ) for discrete random variables 4 / 31
◮ Examples: ◮ Colour of a car blue, green, red ◮ Number of children in a family 0 , 1 , 2 , 3 , 4 , 5 , 6 , > 6 ◮ Toss two coins, let X = ( number of heads ) 2 . X can take on the values 0, 1 and 4. ◮ Example p ( Colour = red ) = 0 . 3 ◮ � x p ( x ) = 1 5 / 31
Continuous RVs ◮ Continuous RVs take on values that vary continuously within one or more real intervals ◮ Probability density function (pdf) p ( x ) for a continuous random variable X � b p ( a ≤ X ≤ b ) = p ( x ) dx a therefore p ( x ≤ X ≤ x + δx ) ≃ p ( x ) δx ◮ � p ( x ) dx = 1 (but values of p ( x ) can be greater than 1) ◮ Examples (coming soon): Gaussian, Gamma, Exponential, Beta 6 / 31
Expectation ◮ Consider a function f ( x ) mapping from x onto numerical values � E [ f ( x )] = f ( x ) p ( x ) x � = f ( x ) p ( x ) dx for discrete and continuous variables resp. ◮ f ( x ) = x , we obtain the mean, µ x ◮ f ( x ) = ( x − µ x ) 2 we obtain the variance 7 / 31
Joint distributions ◮ Properties of several random variables are important for modelling complex problems ◮ p ( X 1 = x 1 , X 2 = x 2 , . . . , X D = x D ) ◮ “,” is read as “and” ◮ Examples about Grade and Intelligence (from Koller and Friedman, 2009) Intelligence = low Intelligence = high Grade = A 0 . 07 0 . 18 Grade = B 0 . 28 0 . 09 Grade = C 0 . 35 0 . 03 8 / 31
Marginal Probability ◮ The sum rule � p ( x ) = p ( x, y ) y ◮ p ( Grade = A ) ?? ◮ Replace sum by an integral for continuous RVs 9 / 31
Conditional Probability ◮ Let X and Y be two disjoint groups of variables, such that p ( Y = y ) > 0 . Then the conditional probability distribution (CPD) of X given Y = y is given by p ( X = x | Y = y ) = p ( x | y ) = p ( x , y ) p ( y ) ◮ Product rule p ( X , Y ) = p ( X ) p ( Y | X ) = p ( Y ) p ( X | Y ) ◮ Example : In the grades example, what is p ( Intelligence = high | Grade = A ) ? ◮ � x p ( X = x | Y = y ) = 1 for all y ◮ Can we say anything about � y p ( X = x | Y = y ) ? 10 / 31
Chain Rule The chain rule is derived by repeated application of the product rule p ( X 1 , . . . , X D ) = p ( X 1 , . . . , X D − 1 ) p ( X D | X 1 , . . . , X D − 1 ) = p ( X 1 , . . . , X D − 2 ) p ( X D − 1 | X 1 , . . . , X D − 2 ) p ( X D | X 1 , . . . , X D − 1 ) = . . . D � = p ( X 1 ) p ( X i | X 1 , . . . , X i − 1 ) i =2 ◮ Exercise: give six decompositions of p ( x, y, z ) using the chain rule 11 / 31
Bayes’ Rule ◮ From the product rule, p ( X | Y ) = p ( Y | X ) p ( X ) p ( Y ) ◮ From the sum rule the denominator is � p ( Y ) = p ( Y | X ) p ( X ) X 12 / 31
Probabilistic Inference using Bayes’ Rule ◮ Tuberculosis (TB) and a skin test (Test) ◮ p ( TB = yes ) = 0 . 001 (for subjects who get tested) ◮ p ( Test = yes | TB = yes ) = 0 . 95 ◮ p ( Test = no | TB = no ) = 0 . 95 ◮ Person gets a positive test result. What is p ( TB = yes | Test = yes ) ? p ( TB = yes | Test = yes ) = p ( Test = yes | TB = yes ) p ( TB = yes ) p ( Test = yes ) 0 . 95 × 0 . 001 = 0 . 95 × 0 . 001 + 0 . 05 × 0 . 999 ≃ 0 . 0187 NB: These are fictitious numbers 13 / 31
Independence ◮ Let X and Y be two disjoint groups of variables. Then X is said to be independent of Y if and only if p ( X | Y ) = p ( X ) for all possible values x and y of X and Y ; otherwise X is said to be dependent on Y ◮ Using the definition of conditional probability, we get an equivalent expression for the independence condition p ( X , Y ) = p ( X ) p ( Y ) ◮ X independent of Y ⇔ Y independent of X ◮ Independence of a set of variables. X 1 , . . . . , X D are independent iff D � p ( X 1 , . . . , X D ) = p ( X i ) i =1 14 / 31
Conditional Independence ◮ Let X , Y and Z be three disjoint groups of variables. X is said to be conditionally independent of Y given Z iff p ( x | y , z ) = p ( x | z ) for all possible values of x , y and z . ◮ Equivalently p ( x , y | z ) = p ( x | z ) p ( y | z ) [show this] ◮ Notation, I ( X , Y | Z ) 15 / 31
Bernoulli Distribution 1 ◮ X is a random variable that either 0.8 takes the value 0 or the value 1 . 0.6 ◮ Let p ( X = 1 | p ) = p and so 0.4 p ( X = 0 | p ) = 1 − p . 0.2 ◮ Then X has a Bernoulli distribution. 0 0 1 16 / 31
Categorical Distribution ◮ X is a random variable that takes one 1 of the values 1 , 2 , . . . , D . 0.8 ◮ Let p ( X = i | p ) = p i , with 0.6 � D i =1 p i = 1 . 0.4 ◮ Then X has a catgorical (aka 0.2 multinoulli) distribution (see Murphy 0 1 2 3 4 2012, p. 35)) 17 / 31
Binomial Distribution ◮ The binomial distribution is obtained 1 from the total number of 1 ’s in n 0.8 independent Bernoulli trials. 0.6 ◮ X is a random variable that takes one 0.4 of the values 0 , 1 , 2 , . . . , n . � n � 0.2 ◮ Let p ( X = r | p ) = p r (1 − p ) ( n − r ) . r 0 0 1 2 3 4 ◮ Then X is binomially distributed. 18 / 31
Multinomial Distribution ◮ The multinomial distribution is obtained from the total count for each outcome in n independent multivariate trials with D possible outcomes. ◮ X is a random vector of length D taking values x with x i ∈ Z + (non-negative integers) and � D i =1 x i = n . ◮ Let n ! x 1 ! . . . x D ! p x 1 1 . . . p x D p ( X = x | p ) = m ◮ Then X is multinomially distributed. 19 / 31
Poisson Distribution ◮ The Poisson distribution is obtained from binomial distribution in the limit 0.4 n → ∞ with p/n = λ . 0.35 0.3 ◮ X is a random variable taking 0.25 non-negative integer values 0 , 1 , 2 , . . . . 0.2 0.15 ◮ Let 0.1 0.05 p ( X = x | λ ) = λ x exp( − λ ) 0 0 5 10 15 x ! ◮ Then X is Poisson distributed. 20 / 31
Uniform Distribution ◮ X is a random variable taking values 1 x ∈ [ a, b ] . 0.8 ◮ Let p ( X = x ) = 1 / [ b − a ] 0.6 ◮ Then X is uniformly distributed. 0.4 Note 0.2 Cannot have a uniform distribution on an 0 0 2 4 6 8 10 unbounded region. 21 / 31
Gaussian Distribution ◮ X is a random variable taking values 0.4 x ∈ R (real values). ◮ Let p ( X = x | µ, σ 2 ) = 0.3 0.2 − ( x − µ ) 2 � � 1 √ 2 πσ 2 exp 2 σ 2 0.1 0 ◮ Then X is Gaussian distributed with −4 −2 0 2 4 mean µ and variance σ 2 . 22 / 31
Gamma Distribution ◮ The Gamma distribution has a rate parameter β > 0 (or a scale parameter 1 /β ) and a shape parameter α > 0 . 0.35 0.3 ◮ X is a random variable taking values x ∈ R + (non-negative real values). 0.25 0.2 ◮ Let 0.15 0.1 1 Γ( α ) x α − 1 β α exp( − βx ) 0.05 p ( X = x | α, β ) = 0 0 2 4 6 8 10 12 ◮ Then X is Gamma distributed. ◮ Note the Gamma function. 23 / 31
Exponential Distribution ◮ The exponential distribution is a Gamma distribution with α = 1 . 0.5 ◮ The exponential distribution is often 0.4 used for arrival times. 0.3 ◮ X is a random variable taking values 0.2 x ∈ R + . 0.1 ◮ Let p ( X = x | λ ) = λ exp( − λx ) 0 0 5 10 15 ◮ Then X is exponentially distributed. 24 / 31
Laplace Distribution ◮ The Laplace distribution is obtained from the difference between two 0.25 independent identically exponentially 0.2 distributed variables. 0.15 ◮ X is a random variable taking values 0.1 x ∈ R . 0.05 ◮ Let p ( X = x | λ ) = ( λ/ 2) exp( − λ | x | ) 0 −10 −5 0 5 10 ◮ Then X is Laplace distributed. 25 / 31
Beta Distribution 3 2.5 2 1.5 1 ◮ X is a random variable taking values 0.5 x ∈ [0 , 1] . 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ◮ Let a = b = 0 . 5 p ( X = x | a, b ) = Γ( a + b ) 1.8 Γ( a )Γ( b ) x a − 1 (1 − x ) b − 1 1.6 1.4 1.2 ◮ Then X is β ( a, b ) distributed. 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 a = 2 , b = 3 26 / 31
Recommend
More recommend