machine learning foundations
play

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 - PowerPoint PPT Presentation

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 Bayesian Inference Kfir Bar Yaniv Bar Marcelo Bacher Based on notes by Shahar Yifrah, Keren Yizhak, Hadas Zur (2012) Bayesian Inference Bayesian inference is a method of


  1. Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 – Bayesian Inference Kfir Bar Yaniv Bar Marcelo Bacher Based on notes by Shahar Yifrah, Keren Yizhak, Hadas Zur (2012)

  2. Bayesian Inference  Bayesian inference is a method of statistical inference that uses prior probability over some hypothesis to determine the likelihood of that hypothesis be true based on an observed evidence.  Three methods:  ML - Maximum Likelihood rule  MAP - Maximum A Posteriori rule  Bayes Posterior rule

  3. Bayes Rule In general:  In Bayesian inference:  data - a known information  h - an hypothesis/classification regarding the data distribution We use Bayes Rule to compute the likelihood that our hypothesis is true:

  4. Example 1: Cancer Detection  A hospital is examining a new cancer detection kit. The known information (prior) is as followed:  A patient with cancer has a 98% chance for a positive result  A healthy patient has a 97% chance for a negative result  The Cancer probability in normal population is 1% How reliable is this kit?

  5. Example 1: Cancer Detection  Let’s calculate Pr[cancer|+]: According to Bayes rule we get:

  6. Example 1: Cancer Detection  Surprisingly, the test, although it seems very accurate, with high detection probabilities of 97-98%, is almost useless  3 out of 4 patients found sick in the test, are actually not. For a low error, we can just tell everyone they do not have cancer, which is right in 99% of the cases  The low detection rate comes from the low probability of cancer in the general population = 1%

  7. Example 2: Normal Distribution  A random variable Z is distributed normally with mean and variance Recall -

  8. Example 2: Normal Distribution  We have m i.i.d samples of a random variable Z where is a normalization factor

  9. Example 2: Normal Distribution  Maximum Likelihood (ML): We aim to choose the hypothesis which best explains the sample, independent of the prior over the hypothesis space (the parameters that maximize the likelihood of the sample) in our case -

  10. Example 2: Normal Distribution  Maximum Likelihood (ML): We take a log to simplify the computation - now we find the maximum for : It's easy to see that the second derivative is negative, thus it's a maximum

  11. Example 2: Normal Distribution  Maximum Likelihood (ML):  Note that this value of is independent of the value of and it is simply the average of the observations  Now we compute the maximum for given that is :

  12. Example 2: Normal Distribution  Maximum A Posteriori (MAP): MAP adds the priors to the hypothesis. In this example, the prior distributions of μ and σ are N (0,1) and are now taken into account and since Pr [ D ] is constant for all we can omit it, and have the following:

  13. Example 2: Normal Distribution  Maximum A Posteriori (MAP):  How will the result we got in the ML approach change for MAP? We added the knowledge that σ and μ are small and around zero, since the prior is σ , μ ∼ N (0,1)  Therefore, the result (the hypothesis regarding σ and μ ) should be closer to 0 than the one we got in ML

  14. Example 2: Normal Distribution  Maximum A Posteriori (MAP): Now we should maximize both equations simultaneously: it can be easily seen that μ and σ will be closer to zero than in the ML approach, since

  15. Example 2: Normal Distribution  Posterior (Bayes): Assume μ ~ N ( η ,1) and Z~N ( μ ,1) and σ = 1. We see only one sample of Z . What is the new posterior distribution of μ ? Pr [ Z ] is a normalizing factor, so we can drop it for the calculations:

  16. Example 2: Normal Distribution  Posterior (Bayes): normalization factor, does not depend on μ

  17. Example 2: Normal Distribution  Posterior (Bayes): The new posterior distribution has: after taking into account the sample z, μ moves towards Z and the variance is reduced

  18. Example 2: Normal Distribution  Posterior (Bayes): In general, for: given m samples we have:

  19. Example 2: Normal Distribution  Posterior (Bayes): And if we assume S = σ , we get: which is like starting with an additional sample of value μ , i.e.,

  20. Learning A Concept Family (1/2)  We are given a Concept Family H.    Our information consist of examples , where is an , ( ) f  x f x H unknown target function that classifies all samples.  Assumptions: (1) The functions in H are deterministic functions ( ).   Pr[ ( ) 1 ] { 1 , 0 } h x (2) The process that generates the input is independent of the target function f .  For each we will calculate where      h  Pr[ | ] { , , 1 } S h S x b i n H i i and . b  ( ) f x i i         Case 1: : ( ) Pr[ , | ] 0 Pr[ | ] 0 b h x x b h S h i i i i i Case 2:         : ( ) Pr[ , | ] Pr[ ] Pr[ | , ] Pr[ ] b h x x b h x b h x x i i i i i i i i i m     Pr[ | ] Pr[ ] Pr[ ] S h x S i  1 i

  21. Learning A Concept Family (2/2)  Definition: A consistent functio n classifies all the samples h  H S correctly ( ).   ( ) h x b   , x b S i i i i  Let be all the functions that are consistent with S. H  ' H There are three methods to choose H’: ML - choose any consistent function, each one has the same probability. MAP - choose the consistent function with the highest prior probability. Bayes - weighted combination of all consistent functions to one predictor,  ( ) Pr[ ] h y h   ( ) B y . Pr[ ' ] H  ' h H

  22. Example (Biased Coins)  We toss a coin n times and the coin ends up heads k times.  We want to estimate the probability p that the coin will come up heads in the next toss.  The probability that k out of n coin tosses will come up heads is:   n      k n k Pr[( , ) | ] ( 1 ) k n p p p     k  With the Maximum Likelihood approach, one would choose p that k maximize which is . p  Pr[( , ) | ] k n p n  Yet this result seems unreasonable when n is small. (For example, if you toss the coin only once and get a tail, should you believe that it is impossible to get a head on the next toss?)

  23. Laplace Rule (1/3)  Let us suppose a uniform prior distribution on p . That is, the prior distribution on all the possible coins is uniform,        Pr[ ] p dp 0  We will calculate the probability to see k heads out of n tosses.   1 1 n           k n k Pr[( , )] Pr[ | ] Pr[ ] ( 1 ) k n k p p dp x x dx      k Integraion 0 0 by parts 1         1 k 1 k 1 n n x x              1 n k n k  ( 1 )  ( )( 1 ) x n k x dx          1   1  k  k k k      n ( ) n 0 n k 0              1  1  k k k   1 1 n              1 1 k n k ( 1 ) Pr[ 1 | ] Pr[ ] Pr[( 1 , )] x x dx k p p dp k n      1 k 0 0

  24. Laplace Rule (2/3)  Comparing both ends of the above sequence of equalities we realize that all the probabilities are equal, and therefore, for any k , 1  n Pr[( , )] k n  1 Intuitively, it means that for a random choice of the bias p , any possible number of heads in a sequence of n coin tosses is equally likely.  We want to calculate the posterior expectation, where [ | ( , )] ( , ) E p s k n s k n is a specific sequence with k heads out of n .  We have,    k n k Pr[ ( , ) | ] ( 1 ) s k n p p p 1 1 1       k n k Pr[ ( , )] ( 1 ) s k n p p dp    1 n n   0     k

  25. Laplace Rule (3/3)  Hence, 1     k n k ( 1 ) p p p dp  1   Pr[ ( , ) | ] Pr[ ] s k n p p    0 [ | ( , )] E p k n p dp 1 1 Pr[ ( , )] s k n  0    1 n n       k 1 1      1 2 n n        1  k 1 k   1 1 2 n     1 n n       k  1 k  Intuitively, Laplace correction is like adding two samples to the ML  2 n estimator, one with value 0 and one with value 1.

Recommend


More recommend