Bayesian machine learning: a tutorial R´ emi Bardenet CNRS & CRIStAL, Univ. Lille, France R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 1
Outline The what Typical statistical problems Statistical decision theory Posterior expected utility and Bayes rules The why The philosophical why The practical why The how Conjugacy Monte Carlo methods Metropolis-Hastings Variational approximations In depth with Gaussian processes in ML From linear regression to GPs Modeling and learning More applications References and open issues R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 2
Outline The what Typical statistical problems Statistical decision theory Posterior expected utility and Bayes rules The why The philosophical why The practical why The how Conjugacy Monte Carlo methods Metropolis-Hastings Variational approximations In depth with Gaussian processes in ML From linear regression to GPs Modeling and learning More applications References and open issues R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 3
Typical jobs for statisticians Estimation ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n | θ ⋆ ), with θ ⋆ ∈ R d . θ ( x 1 , . . . , x n ) of θ ⋆ ∈ R d . ◮ You want an estimate ˆ Confidence regions ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n · | θ ⋆ ), with θ ⋆ ∈ R d . ◮ You want a region A ( x 1 , . . . , x n ) ⊂ R d and make a statement that θ ∈ A ( x 1 , . . . , x n ) with some certainty. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 4
Typical jobs for statisticians Estimation ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n | θ ⋆ ), with θ ⋆ ∈ R d . θ ( x 1 , . . . , x n ) of θ ⋆ ∈ R d . ◮ You want an estimate ˆ Confidence regions ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n · | θ ⋆ ), with θ ⋆ ∈ R d . ◮ You want a region A ( x 1 , . . . , x n ) ⊂ R d and make a statement that θ ∈ A ( x 1 , . . . , x n ) with some certainty. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 4
Statistical decision theory 1 Figure: Abraham Wald (1902–1950) 1 A. Wald. Statistical decision functions . Wiley, 1950. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 5
Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6
Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6
Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6
Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6
Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6
Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6
Illustration with a simple estimation problem ◮ You have data x 1 , . . . , x n that you assume drawn from n � p ( x 1 , . . . , x n | θ ⋆ ) = N ( x i | θ ⋆ , σ 2 ) , i =1 and you know σ 2 . ◮ You choose a loss function, say L (ˆ θ, θ ) = � ˆ θ − θ � 2 . ◮ You restrict your decision space to unbiased estimators. θ := n − 1 � n ◮ The sample mean ˜ i =1 x i is unbiased, and has minimum variance among unbiased estimators. ◮ Since R (˜ θ, θ ) = Var˜ θ, ˜ θ is the best decision you can make in Wald’s framework. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 7
Wald’s view of frequentist estimation Estimation ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n | θ ⋆ ), with θ ⋆ ∈ R d . θ ( x 1 , . . . , x n ) of θ ⋆ ∈ R d . ◮ You want an estimate ˆ A Waldian answer ◮ Our decisions are estimates d ( x 1 , . . . , x n ) = ˆ θ ( x 1 , . . . , x n ). ◮ We pick a loss, say L ( d , θ ) = L (ˆ θ, θ ) = � ˆ θ − θ � 2 . ◮ If you have an unbiased estimator with minimum variance, then this is the best decision among unbiased estimators. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 8
Wald’s view of frequentist estimation Estimation ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n | θ ⋆ ), with θ ⋆ ∈ R d . θ ( x 1 , . . . , x n ) of θ ⋆ ∈ R d . ◮ You want an estimate ˆ A Waldian answer ◮ Our decisions are estimates d ( x 1 , . . . , x n ) = ˆ θ ( x 1 , . . . , x n ). ◮ In general, the loss can be more complex and unbiased estimors unknown/irrelevant. ◮ In these cases, you may settle for a minimax estimator ˆ θ ( x 1 , . . . , x n ) = arg min sup R ( d , θ ) . d θ R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 8
Wald’s is only one view of frequentist statistics... ◮ On estimation, some would argue in favour of the maximum likelihood 2 . Figure: Ronald Fisher (1890–1962) 2 S. M. Stigler. “The epic story of maximum likelihood”. In: Statistical Science (2007), pp. 598–620. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 9
... but bear with me, since it is predominant in machine learning For instance, supervised learning is usually formalized as g ⋆ = arg min E L ( y , g ( x )) . (2) g which you approximate by n � g = arg min ˆ L ( y i , g ( x i )) + penalty( g ) , g i =1 while trying to control the excess risk g ( x )) − E L ( y , g ⋆ ( x )) . E L ( y , ˆ R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 10
Wald’s view of frequentist confidence regions Confidence regions ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n · | θ ⋆ ), with θ ⋆ ∈ R d . ◮ You want a region A ( x 1 , . . . , x n ) ⊂ R d and make a statement that θ ∈ A ( x 1 , . . . , x n ) with some certainty. A Waldian answer ◮ Our decisions are subsets of R d : d ( x 1: n ) = A ( x 1: n ). ◮ A common loss is L ( d , θ ) = L ( A , θ ) = 1 θ/ ∈ A + γ | A | . ◮ So you want to find A ( x 1: n ) that minimizes � ∈ A p ( x 1: n | θ ⋆ ) + γ | A | ] dx 1: n . L ( A , θ ) = [1 θ ⋆ / R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 11
Recommend
More recommend