Foundatjons of Machine Learning CentraleSupélec — Fall 2017 5. Bayesian decision theory Chloé-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr
Practjcal maters... ● I do not grade homework that is sent as .docx ● (Partjal) solutjons to Lab 2 are at the end of the slides of Chap 4.
Learning objectjves Afuer this lecture, you should be able to ● Apply Bayes rule for simple inference and decision problems; ● Explain the connectjon between Bayes decision rule , empirical risk minimizatjon , maximum a priori and maximum likelihood; ● Apply the Naive Bayes algorithm. 3
Let's start by tossing coins... 4
Probability and inference ● Result of tossing a coin: x in {heads, tails} – x = f( z ) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the random variable X in {0, 1} drawn from a probability distributjon P(X=x). ● Bernouilli distributjon ● We do not know P but a sample ● Goal: approximate P (from which X is drawn) p 0 = # heads / # tosses ● Predictjon of next toss: heads if p 0 > 0.5 , tails otherwise 5
Probability and inference ● Result of tossing a coin: x in {heads, tails} – x = f( z ) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the random variable X in {0, 1} drawn from a probability distributjon P(X=x). ● We need to model P ● We do not know P but a sample ● Goal: approximate P (from which X is drawn) E.g: a complex physical functjon of the compositjon of the coin, p 0 = # heads / # tosses the force that is applied to it, ● Predictjon of next toss: initjal conditjons, etc. heads if p 0 > 0.5 , tails otherwise 6
Probability and inference ● Result of tossing a coin: x in {heads, tails} – x = f( z ) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the random variable X in {0, 1} drawn from a probability distributjon P(X=x). ● We need to model P ? ● We do not know P but a sample ● Goal: approximate P (from which X is drawn) E.g: a complex physical functjon of the compositjon of the coin, p 0 = # heads / # tosses the force that is applied to it, ● Predictjon of next toss: initjal conditjons, etc. heads if p 0 > 0.5 , tails otherwise 7
Probability and inference ● Result of tossing a coin: x in {heads, tails} – x = f( z ) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the random variable X in {0, 1} drawn from a probability distributjon P(X=x). ● Bernouilli distributjon ● We do not know P but a sample ● Goal: approximate P (from which X is drawn) ? p 0 = # heads / # tosses ● Predictjon of next toss: heads if p 0 > 0.5 , tails otherwise 8
Probability and inference ● Result of tossing a coin: x in {heads, tails} – x = f( z ) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the random variable X in {0, 1} drawn from a probability distributjon P(X=x). ● Bernouilli distributjon ● We do not know P but a sample ● Goal: approximate P (from which X is drawn) p 0 = # heads / # tosses ● Predictjon of next toss: ? heads if p 0 > 0.5 , tails otherwise 9
Probability and inference ● Result of tossing a coin: x in {heads, tails} – x = f( z ) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the random variable X in {0, 1} drawn from a probability distributjon P(X=x). ● Bernouilli distributjon ● We do not know P but a sample ● Goal: approximate P (from which X is drawn) p 0 = # heads / # tosses ● Predictjon of next toss: heads if p 0 > 0.5 , tails otherwise 10
Classifjcatjon ● Cat vs. dog – Cat = 1 (positjve) Dog good eater – Dog = 0 (negatjve) Cat – x 1 = human contact – x 2 = good eater ● Predictjon: human contact 11
Bayes rule 12
Reverend Thomas Bayes 170?-1761 … possibly 13
Bayes rule 14
Example: rare disease testjng – test is correct 99% of the tjme – disease prevalence = 1 out of 10,000 What is the probability that a patjent that tested positjve actually has the disease? 99% ? 90% ? 10% ? 1% ? 15
Example: rare disease testjng – test is correct 99% of the tjme – disease prevalence = 1 out of 10,000 What is the probability that a patjent that tested positjve actually has the disease? ? ? 16
Example: rare disease testjng – test is correct 99% of the tjme – disease prevalence = 1 out of 10,000 What is the probability that a patjent that tested positjve actually has the disease? 0.99 0.0001 ? 17
Example: rare disease testjng – test is correct 99% of the tjme – disease prevalence = 1 out of 10,000 What is the probability that a patjent that tested positjve actually has the disease? 0.99 0.0001 ? 0.99 0.0001 ? 18
Example: rare disease testjng – test is correct 99% of the tjme – disease prevalence = 1 out of 10,000 What is the probability that a patjent that tested positjve actually has the disease? 0.99 0.0001 (1-0.99) (1-0.0001) 0.99 0.0001 19
Example: rare disease testjng – test is correct 99% of the tjme – disease prevalence = 1 out of 10,000 What is the probability that a patjent that tested positjve actually has the disease? 0.99 0.0001 (1-0.99) (1-0.0001) 0.99 0.0001 20
Bayes rule prior likelihood posterior evidence Bayes' decision rule: 21
Maximum A Posteriori criterion ● MAP decision rule: – pick the hypothesis that is most probable – i.e. maximize the posterior ? ● Decision rule: If Λ MAP ( x ) > 1 then choose y=1 else choose y=0. 22
Maximum A Posteriori criterion ● MAP decision rule: – pick the hypothesis that is most probable – i.e. maximize the posterior ● Decision rule: If Λ MAP ( x ) > 1 then choose y=1 else choose y=0. 23
Likelihood ratjo test (LRT) p( x ) does not afgect the decision rule. ● Likelihood ratjo test: ? test whether the likelihood ratjo Λ( x ) is larger than decision rule: 24
Likelihood ratjo test (LRT) p( x ) does not afgect the decision rule. ● Likelihood ratjo test: test whether the likelihood ratjo Λ( x ) is larger than decision rule: 25
Example: LRT decision rule ? Assuming the likelihoods below and equal priors, derive a decision rule based on the LRT. 26
● Likelihood ratjo: ● Simplifying the equatjon and taking the log: ● Equal priors mean we're testjng whether log(LR) > 0 Hence: If x < 7 then assign y=1 else assign y=0 C=0 C=1 7 27
● Likelihood ratjo: ● Simplifying the equatjon and taking the log: ● Equal priors mean we're testjng whether log(LR) > 0 Hence: If x < 7 then assign y=1 else assign y=0 Now assume P(y=1) = 2 P(y=0) ? C=0 C=1 7 28
● Likelihood ratjo: ● Simplifying the equatjon and taking the log: ● Equal priors mean we're testjng whether log(LR) > 0 Hence: If x < 7 then assign y=1 else assign y=0 Now assume P(y=1) = 2 P(y=0) x < 7 – log(1/2) ≈ 7.69 y=1 is more likely. C=1 C=0 7.69 29
Maximum likelihood criterion ● Consider equal priors P(y=1) = P(y=0) 1 ● Bayes decision rule seeks to maximize P(x|y=c) and is hence called the Maximum Likelihood criterion – Decision rule: If Λ ML (x) > 1 then choose y=1 else choose y=0 30
Bayes rule for K > 2 ● Bayes rule: ? ? ● ● What is the decision rule? 31
Bayes rule for K > 2 ● Bayes rule: ● ● Decision ? 32
Bayes rule for K > 2 ● Bayes rule: ● ● Decision 33
Risk minimizatjon 34
Losses and risks ● So far we've assumed all errors were equally costly. But misclassfying a cancer sufgerer as a healthy patjent is much more problematjc than the other way around. ● Actjon α k : assigining class c k ● Loss: quantjfy the cost λ kl of taking actjon α k when the true class is c l ● Expected risk: ● Decision (Bayes Classifjer): 35
Discriminant functjons Classifjcatjon = fjnd K discriminant functjons f k s.t. x is assigned class c k if k = argmax f l ( x ) ● Bayes classifjer: 36
Discriminant functjons Classifjcatjon = fjnd K discriminant functjons f k s.t. x is assigned class c k if k = argmax f l ( x ) ● Bayes classifjer: ● Defjnes K decision regions x 2 Sports car Engine power Luxury sedan Family car x 1 Price 37
Bayes risk minimizatjon ● Bayes risk: overall expected risk ● Bayes decision rule: use the discriminant functjons that minimize the Bayes risk. 38
Bayes risk minimizatjon ● Bayes risk: overall expected risk ● Bayes decision rule: use the discriminant functjons that minimize the Bayes risk. ● This is also a LRT. For 2 classes, let us show that Bayes decision rule is equivalent to: ? 39
0/1 Loss ● All misclassifjcatjons are equally costly. ● λ kl = 0 if k=l and 1 otherwise ● Minimizing the risk: – choose the most probable class (MAP) – this is equivalent to the Bayes decision rule. 40
Maximum likelihood criterion ● Consider equal priors P(y=1) = P(y=0) ● Consider the 0/1 loss functjon ? ? 41
Maximum likelihood criterion ● Consider equal priors P(y=1) = P(y=0) ● Consider the 0/1 loss functjon =1 (equal priors) =1 (0/1 loss) 42
Recommend
More recommend