Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 - PowerPoint PPT Presentation

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 – Bayesian Inference Kfir Bar Yaniv Bar Marcelo Bacher Based on notes by Shahar Yifrah, Keren Yizhak, Hadas Zur (2012)

Bayesian Inference  Bayesian inference is a method of statistical inference that uses prior probability over some hypothesis to determine the likelihood of that hypothesis be true based on an observed evidence.  Three methods:  ML - Maximum Likelihood rule  MAP - Maximum A Posteriori rule  Bayes Posterior rule

Bayes Rule In general:  In Bayesian inference:  data - a known information  h - an hypothesis/classification regarding the data distribution We use Bayes Rule to compute the likelihood that our hypothesis is true:

Example 1: Cancer Detection  A hospital is examining a new cancer detection kit. The known information (prior) is as followed:  A patient with cancer has a 98% chance for a positive result  A healthy patient has a 97% chance for a negative result  The Cancer probability in normal population is 1% How reliable is this kit?

Example 1: Cancer Detection  Let’s calculate Pr[cancer|+]: According to Bayes rule we get:

Example 1: Cancer Detection  Surprisingly, the test, although it seems very accurate, with high detection probabilities of 97-98%, is almost useless  3 out of 4 patients found sick in the test, are actually not. For a low error, we can just tell everyone they do not have cancer, which is right in 99% of the cases  The low detection rate comes from the low probability of cancer in the general population = 1%

Example 2: Normal Distribution  A random variable Z is distributed normally with mean and variance Recall -

Example 2: Normal Distribution  We have m i.i.d samples of a random variable Z where is a normalization factor

Example 2: Normal Distribution  Maximum Likelihood (ML): We aim to choose the hypothesis which best explains the sample, independent of the prior over the hypothesis space (the parameters that maximize the likelihood of the sample) in our case -

Example 2: Normal Distribution  Maximum Likelihood (ML): We take a log to simplify the computation - now we find the maximum for : It's easy to see that the second derivative is negative, thus it's a maximum

Example 2: Normal Distribution  Maximum Likelihood (ML):  Note that this value of is independent of the value of and it is simply the average of the observations  Now we compute the maximum for given that is :

Example 2: Normal Distribution  Maximum A Posteriori (MAP): MAP adds the priors to the hypothesis. In this example, the prior distributions of μ and σ are N (0,1) and are now taken into account and since Pr [ D ] is constant for all we can omit it, and have the following:

Example 2: Normal Distribution  Maximum A Posteriori (MAP):  How will the result we got in the ML approach change for MAP? We added the knowledge that σ and μ are small and around zero, since the prior is σ , μ ∼ N (0,1)  Therefore, the result (the hypothesis regarding σ and μ ) should be closer to 0 than the one we got in ML

Example 2: Normal Distribution  Maximum A Posteriori (MAP): Now we should maximize both equations simultaneously: it can be easily seen that μ and σ will be closer to zero than in the ML approach, since

Example 2: Normal Distribution  Posterior (Bayes): Assume μ ~ N ( η ,1) and Z~N ( μ ,1) and σ = 1. We see only one sample of Z . What is the new posterior distribution of μ ? Pr [ Z ] is a normalizing factor, so we can drop it for the calculations:

Example 2: Normal Distribution  Posterior (Bayes): normalization factor, does not depend on μ

Example 2: Normal Distribution  Posterior (Bayes): The new posterior distribution has: after taking into account the sample z, μ moves towards Z and the variance is reduced

Example 2: Normal Distribution  Posterior (Bayes): In general, for: given m samples we have:

Example 2: Normal Distribution  Posterior (Bayes): And if we assume S = σ , we get: which is like starting with an additional sample of value μ , i.e.,

Learning A Concept Family (1/2)  We are given a Concept Family H.    Our information consist of examples , where is an , ( ) f  x f x H unknown target function that classifies all samples.  Assumptions: (1) The functions in H are deterministic functions ( ).   Pr[ ( ) 1 ] { 1 , 0 } h x (2) The process that generates the input is independent of the target function f .  For each we will calculate where      h  Pr[ | ] { , , 1 } S h S x b i n H i i and . b  ( ) f x i i         Case 1: : ( ) Pr[ , | ] 0 Pr[ | ] 0 b h x x b h S h i i i i i Case 2:         : ( ) Pr[ , | ] Pr[ ] Pr[ | , ] Pr[ ] b h x x b h x b h x x i i i i i i i i i m     Pr[ | ] Pr[ ] Pr[ ] S h x S i  1 i

Learning A Concept Family (2/2)  Definition: A consistent functio n classifies all the samples h  H S correctly ( ).   ( ) h x b   , x b S i i i i  Let be all the functions that are consistent with S. H  ' H There are three methods to choose H’: ML - choose any consistent function, each one has the same probability. MAP - choose the consistent function with the highest prior probability. Bayes - weighted combination of all consistent functions to one predictor,  ( ) Pr[ ] h y h   ( ) B y . Pr[ ' ] H  ' h H

Example (Biased Coins)  We toss a coin n times and the coin ends up heads k times.  We want to estimate the probability p that the coin will come up heads in the next toss.  The probability that k out of n coin tosses will come up heads is:   n      k n k Pr[( , ) | ] ( 1 ) k n p p p     k  With the Maximum Likelihood approach, one would choose p that k maximize which is . p  Pr[( , ) | ] k n p n  Yet this result seems unreasonable when n is small. (For example, if you toss the coin only once and get a tail, should you believe that it is impossible to get a head on the next toss?)

Laplace Rule (1/3)  Let us suppose a uniform prior distribution on p . That is, the prior distribution on all the possible coins is uniform,        Pr[ ] p dp 0  We will calculate the probability to see k heads out of n tosses.   1 1 n           k n k Pr[( , )] Pr[ | ] Pr[ ] ( 1 ) k n k p p dp x x dx      k Integraion 0 0 by parts 1         1 k 1 k 1 n n x x              1 n k n k  ( 1 )  ( )( 1 ) x n k x dx          1   1  k  k k k      n ( ) n 0 n k 0              1  1  k k k   1 1 n              1 1 k n k ( 1 ) Pr[ 1 | ] Pr[ ] Pr[( 1 , )] x x dx k p p dp k n      1 k 0 0

Laplace Rule (2/3)  Comparing both ends of the above sequence of equalities we realize that all the probabilities are equal, and therefore, for any k , 1  n Pr[( , )] k n  1 Intuitively, it means that for a random choice of the bias p , any possible number of heads in a sequence of n coin tosses is equally likely.  We want to calculate the posterior expectation, where [ | ( , )] ( , ) E p s k n s k n is a specific sequence with k heads out of n .  We have,    k n k Pr[ ( , ) | ] ( 1 ) s k n p p p 1 1 1       k n k Pr[ ( , )] ( 1 ) s k n p p dp    1 n n   0     k

Laplace Rule (3/3)  Hence, 1     k n k ( 1 ) p p p dp  1   Pr[ ( , ) | ] Pr[ ] s k n p p    0 [ | ( , )] E p k n p dp 1 1 Pr[ ( , )] s k n  0    1 n n       k 1 1      1 2 n n        1  k 1 k   1 1 2 n     1 n n       k  1 k  Intuitively, Laplace correction is like adding two samples to the ML  2 n estimator, one with value 0 and one with value 1.

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 - PowerPoint PPT Presentation

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 Bayesian Inference Kfir Bar Yaniv Bar Marcelo Bacher Based on notes by Shahar Yifrah, Keren Yizhak, Hadas Zur (2012) Bayesian Inference Bayesian inference is a method of

recap to this point foundations foundations foundations foundations genetics =

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Foundations of Tidy Machine Learning Dmitriy (Dima) Gorenshteyn Lead Data Scientist, Memorial

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Comparison of Ordinal and Metric Gaussian Process Regression as Surrogate Models for CMA

Statistics, Error Analysis Hypothesis Testing PHY517 / AST443, Lecture 5 Remote Login Issues

Bayesian Inference for Normal Mean Al Nosedal. University of Toronto. November 18, 2015 Al

Super-resolution using Gaussian Process Regression Final Year Project Interim Report He He

(Still) Hunting for Primordial Non-Gaussianity: Current Status and Future Prospects Eiichiro

Multimodality in the Kalman Filter and Ensemble Kalman Filter Maxime Conjard, Henning Omre

An Unified Parametric-Nonparametric Uncertainty Quantification Approach for Linear Dynamical

Comparison between test field data and Gaussian plume model Laura Urso Helmholtz Zentrum