Machine Learning - MT 2017 7 Bayesian Approach to Machine Learning - PowerPoint PPT Presentation

Machine Learning - MT 2017 7 Bayesian Approach to Machine Learning Christoph Haase University of Oxford October 23, 2017

Frequentist vs Bayesian Approaches Different views on probability: ◮ Frequentists: Probability of an event represents long-run frequency over a large number of repetitions of an experiment ◮ Bayesians: Probability of an event represents a degree of belief about the event 1

Frequentist vs Bayesian Approaches Different views on probability: ◮ Frequentists: Probability of an event represents long-run frequency over a large number of repetitions of an experiment ◮ Bayesians: Probability of an event represents a degree of belief about the event Different views on statistics: ◮ Frequentists: Parameters are fixed, data are a repeatable random sample, underlying parameters remain constant at every repetition ◮ Bayesians: Data are fixed, parameters are unknown and described probabilistically, repetition adds knowledge about parameters 1

Frequentist vs Bayesian Approaches 2

Bayes’ Theorem Recall basic laws of probability: p ( A ∩ B ) 3

Bayes’ Theorem Recall basic laws of probability: p ( A ∩ B ) = p ( A | B ) · p ( B ) 3

Bayes’ Theorem Recall basic laws of probability: p ( B | A ) · p ( A ) = p ( A ∩ B ) = p ( A | B ) · p ( B ) 3

Bayes’ Theorem Recall basic laws of probability: p ( B | A ) · p ( A ) = p ( A ∩ B ) = p ( A | B ) · p ( B ) Bayes’ Theorem: p ( A | B ) = p ( B | A ) · p ( A ) P ( B ) 3

Bayes’ Theorem Recall basic laws of probability: p ( B | A ) · p ( A ) = p ( A ∩ B ) = p ( A | B ) · p ( B ) Bayes’ Theorem: p ( A | B ) = p ( B | A ) · p ( A ) P ( B ) Viewing A as a proposition and B as evidence: ◮ p ( A ) is the prior representing initial belief about A ◮ p ( A | B ) is the posterior representing belief about A after learning about B ◮ Posterior is proportional to prior times likelihood if we fix B : p ( A | B ) ∝ p ( B | A ) · p ( A ) 3

Priors Matter Suppose we have a test for a disease: ◮ test is 95% effective, i.e., p ( T | D ) = 0 . 95 ◮ rate of false positives is 1% , i.e., p ( T | ¯ D ) = 0 . 01 ◮ the disease occurs in 0 . 5% of the population, i.e., p ( D ) = 0 . 005 4

Priors Matter Suppose we have a test for a disease: ◮ test is 95% effective, i.e., p ( T | D ) = 0 . 95 ◮ rate of false positives is 1% , i.e., p ( T | ¯ D ) = 0 . 01 ◮ the disease occurs in 0 . 5% of the population, i.e., p ( D ) = 0 . 005 Suppose the test is positive, what is p ( D | T ) : 4

Priors Matter Suppose we have a test for a disease: ◮ test is 95% effective, i.e., p ( T | D ) = 0 . 95 ◮ rate of false positives is 1% , i.e., p ( T | ¯ D ) = 0 . 01 ◮ the disease occurs in 0 . 5% of the population, i.e., p ( D ) = 0 . 005 Suppose the test is positive, what is p ( D | T ) : p ( D | T ) = p ( T | D ) · p ( D ) p ( T ) p ( T | D ) · p ( D ) = p ( T | D ) · p ( D ) + p ( T | ¯ D ) · p ( ¯ D )) 0 . 95 · 0 . 005 = 0 . 95 · 0 . 005 + 0 . 01 · 0 . 995 ≈ 0 . 32 4

Bayesian Machine Learning In the discriminative framework, we model the output y as a probability distribution given the input x and the parameters w , say p ( y | w , x ) In Bayesian machine learning, we assume a prior on the parameters w , say p ( w ) This prior represents a ‘‘belief’’ about the model; the uncertainty in our knowledge is expressed mathematically as a probability distribution 5

Bayesian Machine Learning In the discriminative framework, we model the output y as a probability distribution given the input x and the parameters w , say p ( y | w , x ) In Bayesian machine learning, we assume a prior on the parameters w , say p ( w ) This prior represents a ‘‘belief’’ about the model; the uncertainty in our knowledge is expressed mathematically as a probability distribution When observations, D = � ( x i , y i ) � N i =1 are made the belief about the parameters w is updated using Bayes’ rule As before, the posterior distribution on w given the data D is: p ( w | D ) ∝ p ( y | w , X ) · p ( w ) 5

Coin Toss Example Let us consider the Bernoulli model for a coin toss, for θ ∈ [0 , 1] p ( H | θ ) = θ Suppose after three independent coin tosses, you get T, T, T. What is the maximum likelihood estimate for θ ? 6

Coin Toss Example Let us consider the Bernoulli model for a coin toss, for θ ∈ [0 , 1] p ( H | θ ) = θ Suppose after three independent coin tosses, you get T, T, T. What is the maximum likelihood estimate for θ ? What is the posterior distribution over θ , assuming a uniform prior on θ ? 6

Coin Toss Example Let us consider the Bernoulli model for a coin toss, for θ ∈ [0 , 1] p ( H | θ ) = θ Suppose after three independent coin tosses, you get T, T, T. What is the maximum likelihood estimate for θ ? What is the posterior distribution over θ , assuming a Beta(2 , 2) prior on θ ? 6

Least Squares and MLE (Gaussian Noise) Least Squares MLE (Gaussian Noise) Objective Function Likelihood � � � N N � − ( y i − w · x i ) 2 1 ( y i − w · x i ) 2 L ( w ) = p ( y | X , w ) = exp (2 πσ 2 ) N/ 2 2 σ 2 i =1 i =1 7

Least Squares and MLE (Gaussian Noise) Least Squares MLE (Gaussian Noise) Objective Function Likelihood � � � N N � − ( y i − w · x i ) 2 1 ( y i − w · x i ) 2 L ( w ) = p ( y | X , w ) = exp (2 πσ 2 ) N/ 2 2 σ 2 i =1 i =1 For estimating w , the negative log-likelihood under Gaussian noise has the same form as the least squares objective 7

Least Squares and MLE (Gaussian Noise) Least Squares MLE (Gaussian Noise) Objective Function Likelihood � � � N N � − ( y i − w · x i ) 2 1 ( y i − w · x i ) 2 L ( w ) = p ( y | X , w ) = exp (2 πσ 2 ) N/ 2 2 σ 2 i =1 i =1 For estimating w , the negative log-likelihood under Gaussian noise has the same form as the least squares objective Alternatively, we can model the data (only y i -s) as being generated from a distribution defined by exponentiating the negative of the objective function 7

What Data Model Produces the Ridge Objective? We have the Ridge Regression Objective, let D = � ( x i , y i ) � N i =1 denote the data L ridge ( w ; D ) = ( y − Xw ) T ( y − Xw ) + λ w T w 8

What Data Model Produces the Ridge Objective? We have the Ridge Regression Objective, let D = � ( x i , y i ) � N i =1 denote the data L ridge ( w ; D ) = ( y − Xw ) T ( y − Xw ) + λ w T w 2 σ 2 and setting λ = σ 2 Let’s rewrite this objective slightly, scaling by 1 τ 2 . To avoid ambiguity, we’ll denote this by � L 1 1 � 2 σ 2 ( y − Xw ) T ( y − Xw ) + 2 τ 2 w T w L ridge ( w ; D ) = 8

What Data Model Produces the Ridge Objective? We have the Ridge Regression Objective, let D = � ( x i , y i ) � N i =1 denote the data L ridge ( w ; D ) = ( y − Xw ) T ( y − Xw ) + λ w T w 2 σ 2 and setting λ = σ 2 Let’s rewrite this objective slightly, scaling by 1 τ 2 . To avoid ambiguity, we’ll denote this by � L 1 1 � 2 σ 2 ( y − Xw ) T ( y − Xw ) + 2 τ 2 w T w L ridge ( w ; D ) = Let Σ = σ 2 I N and Λ = τ 2 I D , where I m denotes the m × m identity matrix L ridge ( w ) = 1 2( y − Xw ) T Σ − 1 ( y − Xw ) + 1 � 2 w T Λ − 1 w 8

What Data Model Produces the Ridge Objective? We have the Ridge Regression Objective, let D = � ( x i , y i ) � N i =1 denote the data L ridge ( w ; D ) = ( y − Xw ) T ( y − Xw ) + λ w T w 2 σ 2 and setting λ = σ 2 Let’s rewrite this objective slightly, scaling by 1 τ 2 . To avoid ambiguity, we’ll denote this by � L 1 1 � 2 σ 2 ( y − Xw ) T ( y − Xw ) + 2 τ 2 w T w L ridge ( w ; D ) = Let Σ = σ 2 I N and Λ = τ 2 I D , where I m denotes the m × m identity matrix L ridge ( w ) = 1 2( y − Xw ) T Σ − 1 ( y − Xw ) + 1 � 2 w T Λ − 1 w Taking the negation of � L ridge ( w ; D ) and exponentiating gives us a non-negative function of w and D which after normalisation gives a density function � � � � − 1 − 1 2( y − Xw ) T Σ − 1 ( y − Xw ) 2 w T Λ − 1 w f ( w ; D ) = exp · exp 8

Bayesian Linear Regression (and connections to Ridge) Let’s start with the form of the density function we had on the previous slide and factor it. � � � � − 1 − 1 2( y − Xw ) T Σ − 1 ( y − Xw ) 2 w T Λ − 1 w f ( w ; D ) = exp · exp 9

Bayesian Linear Regression (and connections to Ridge) Let’s start with the form of the density function we had on the previous slide and factor it. � � � � − 1 − 1 2( y − Xw ) T Σ − 1 ( y − Xw ) 2 w T Λ − 1 w f ( w ; D ) = exp · exp We’ll treat σ as fixed and not as a parameter. Up to a constant factor (which does’t matter when optimising w.r.t. w ), we can rewrite this as p ( w | X , y ) ∝ N ( y | Xw , Σ) · N ( w | 0 , Λ ) � �� posterior Likelihood prior where N ( · | µ , Σ ) denotes the density of the multivariate normal distribution with mean µ and covariance matrix Σ ◮ What the ridge objective is actually finding is the maximum a posteriori or (MAP) estimate which is a mode of the posterior distribution ◮ The linear model is as described before with Gaussian noise ◮ The prior distribution on w is assumed to be a spherical Gaussian 9

Machine Learning - MT 2017 7 Bayesian Approach to Machine Learning - PowerPoint PPT Presentation

Machine Learning - MT 2017 7 Bayesian Approach to Machine Learning Christoph Haase University of Oxford October 23, 2017 Frequentist vs Bayesian Approaches Different views on probability: Frequentists: Probability of an event represents

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

A Bayesian Approach to A Bayesian Approach to Unsupervised One- Unsupervised One -Shot Shot

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

For Friday Read chapter 8, section 3 No homework Program 3 Any questions? Active

Introduction to Mobile Robotics Summary Wolfram Burgard, Maren Bennewitz, Diego Tipaldi, Luciano

Table of contents 1. Introduction: You are already an experimentalist 2. Conditions 3. Items

Data Mining Techniques: Statistical Decision Theory Nearest Neighbor Classification and

The Paradoxes of Confirmation Suppose we give up counting how often we see an event as the measure

Zoonotic Infections I have nothing to disclose Carol Glaser, DVM, MPVM, MD Pediatric Infectious

Linux Systems Performance Brendan Gregg Senior Performance Engineer USENIX LISA 2019, Portland,

Linux 4.x Tracing: Performance Analysis with bcc/BPF Brendan Gregg Senior Performance Architect