Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 – Bayesian Inference Kfir Bar Yaniv Bar Marcelo Bacher Based on notes by Shahar Yifrah, Keren Yizhak, Hadas Zur (2012)
Bayesian Inference Bayesian inference is a method of statistical inference that uses prior probability over some hypothesis to determine the likelihood of that hypothesis be true based on an observed evidence. Three methods: ML - Maximum Likelihood rule MAP - Maximum A Posteriori rule Bayes Posterior rule
Bayes Rule In general: In Bayesian inference: data - a known information h - an hypothesis/classification regarding the data distribution We use Bayes Rule to compute the likelihood that our hypothesis is true:
Example 1: Cancer Detection A hospital is examining a new cancer detection kit. The known information (prior) is as followed: A patient with cancer has a 98% chance for a positive result A healthy patient has a 97% chance for a negative result The Cancer probability in normal population is 1% How reliable is this kit?
Example 1: Cancer Detection Let’s calculate Pr[cancer|+]: According to Bayes rule we get:
Example 1: Cancer Detection Surprisingly, the test, although it seems very accurate, with high detection probabilities of 97-98%, is almost useless 3 out of 4 patients found sick in the test, are actually not. For a low error, we can just tell everyone they do not have cancer, which is right in 99% of the cases The low detection rate comes from the low probability of cancer in the general population = 1%
Example 2: Normal Distribution A random variable Z is distributed normally with mean and variance Recall -
Example 2: Normal Distribution We have m i.i.d samples of a random variable Z where is a normalization factor
Example 2: Normal Distribution Maximum Likelihood (ML): We aim to choose the hypothesis which best explains the sample, independent of the prior over the hypothesis space (the parameters that maximize the likelihood of the sample) in our case -
Example 2: Normal Distribution Maximum Likelihood (ML): We take a log to simplify the computation - now we find the maximum for : It's easy to see that the second derivative is negative, thus it's a maximum
Example 2: Normal Distribution Maximum Likelihood (ML): Note that this value of is independent of the value of and it is simply the average of the observations Now we compute the maximum for given that is :
Example 2: Normal Distribution Maximum A Posteriori (MAP): MAP adds the priors to the hypothesis. In this example, the prior distributions of μ and σ are N (0,1) and are now taken into account and since Pr [ D ] is constant for all we can omit it, and have the following:
Example 2: Normal Distribution Maximum A Posteriori (MAP): How will the result we got in the ML approach change for MAP? We added the knowledge that σ and μ are small and around zero, since the prior is σ , μ ∼ N (0,1) Therefore, the result (the hypothesis regarding σ and μ ) should be closer to 0 than the one we got in ML
Example 2: Normal Distribution Maximum A Posteriori (MAP): Now we should maximize both equations simultaneously: it can be easily seen that μ and σ will be closer to zero than in the ML approach, since
Example 2: Normal Distribution Posterior (Bayes): Assume μ ~ N ( η ,1) and Z~N ( μ ,1) and σ = 1. We see only one sample of Z . What is the new posterior distribution of μ ? Pr [ Z ] is a normalizing factor, so we can drop it for the calculations:
Example 2: Normal Distribution Posterior (Bayes): normalization factor, does not depend on μ
Example 2: Normal Distribution Posterior (Bayes): The new posterior distribution has: after taking into account the sample z, μ moves towards Z and the variance is reduced
Example 2: Normal Distribution Posterior (Bayes): In general, for: given m samples we have:
Example 2: Normal Distribution Posterior (Bayes): And if we assume S = σ , we get: which is like starting with an additional sample of value μ , i.e.,
Learning A Concept Family (1/2) We are given a Concept Family H. Our information consist of examples , where is an , ( ) f x f x H unknown target function that classifies all samples. Assumptions: (1) The functions in H are deterministic functions ( ). Pr[ ( ) 1 ] { 1 , 0 } h x (2) The process that generates the input is independent of the target function f . For each we will calculate where h Pr[ | ] { , , 1 } S h S x b i n H i i and . b ( ) f x i i Case 1: : ( ) Pr[ , | ] 0 Pr[ | ] 0 b h x x b h S h i i i i i Case 2: : ( ) Pr[ , | ] Pr[ ] Pr[ | , ] Pr[ ] b h x x b h x b h x x i i i i i i i i i m Pr[ | ] Pr[ ] Pr[ ] S h x S i 1 i
Learning A Concept Family (2/2) Definition: A consistent functio n classifies all the samples h H S correctly ( ). ( ) h x b , x b S i i i i Let be all the functions that are consistent with S. H ' H There are three methods to choose H’: ML - choose any consistent function, each one has the same probability. MAP - choose the consistent function with the highest prior probability. Bayes - weighted combination of all consistent functions to one predictor, ( ) Pr[ ] h y h ( ) B y . Pr[ ' ] H ' h H
Example (Biased Coins) We toss a coin n times and the coin ends up heads k times. We want to estimate the probability p that the coin will come up heads in the next toss. The probability that k out of n coin tosses will come up heads is: n k n k Pr[( , ) | ] ( 1 ) k n p p p k With the Maximum Likelihood approach, one would choose p that k maximize which is . p Pr[( , ) | ] k n p n Yet this result seems unreasonable when n is small. (For example, if you toss the coin only once and get a tail, should you believe that it is impossible to get a head on the next toss?)
Laplace Rule (1/3) Let us suppose a uniform prior distribution on p . That is, the prior distribution on all the possible coins is uniform, Pr[ ] p dp 0 We will calculate the probability to see k heads out of n tosses. 1 1 n k n k Pr[( , )] Pr[ | ] Pr[ ] ( 1 ) k n k p p dp x x dx k Integraion 0 0 by parts 1 1 k 1 k 1 n n x x 1 n k n k ( 1 ) ( )( 1 ) x n k x dx 1 1 k k k k n ( ) n 0 n k 0 1 1 k k k 1 1 n 1 1 k n k ( 1 ) Pr[ 1 | ] Pr[ ] Pr[( 1 , )] x x dx k p p dp k n 1 k 0 0
Laplace Rule (2/3) Comparing both ends of the above sequence of equalities we realize that all the probabilities are equal, and therefore, for any k , 1 n Pr[( , )] k n 1 Intuitively, it means that for a random choice of the bias p , any possible number of heads in a sequence of n coin tosses is equally likely. We want to calculate the posterior expectation, where [ | ( , )] ( , ) E p s k n s k n is a specific sequence with k heads out of n . We have, k n k Pr[ ( , ) | ] ( 1 ) s k n p p p 1 1 1 k n k Pr[ ( , )] ( 1 ) s k n p p dp 1 n n 0 k
Laplace Rule (3/3) Hence, 1 k n k ( 1 ) p p p dp 1 Pr[ ( , ) | ] Pr[ ] s k n p p 0 [ | ( , )] E p k n p dp 1 1 Pr[ ( , )] s k n 0 1 n n k 1 1 1 2 n n 1 k 1 k 1 1 2 n 1 n n k 1 k Intuitively, Laplace correction is like adding two samples to the ML 2 n estimator, one with value 0 and one with value 1.
Recommend
More recommend