Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course: lecture 2) Nov 05, 2015 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 1
Recap Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 2
Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data { x 1 , . . . , x N } MLE does this by finding θ that maximizes the (log)likelihood p ( X | θ ) N N ˆ � � log p ( X | θ ) = arg max p ( x n | θ ) = arg max log p ( x n | θ ) θ = arg max log θ θ θ n =1 n =1 MLE now reduces to solving an optimization problem w.r.t. θ Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 3
Maximum-a-Posteriori (MAP) Estimation Incorporating prior knowledge p ( θ ) about the parameters MAP estimation finds θ that maximizes the posterior p ( θ | X ) ∝ p ( X | θ ) p ( θ ) N N ˆ � � p ( x n | θ ) p ( θ ) = arg max log p ( x n | θ ) + log p ( θ ) θ = arg max log θ θ n =1 n =1 MAP now reduces to solving an optimization problem w.r.t. θ Objective function very similar to MLE, except for the log p ( θ ) term In some sense, MAP is just a “regularized” MLE Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 4
Bayesian Learning Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ ? Need to infer the full posterior distribution p ( θ | X ) = p ( X | θ ) p ( θ ) p ( X | θ ) p ( θ ) = θ p ( X | θ ) p ( θ ) d θ ∝ Likelihood × Prior � p ( X ) Requires doing a “fully Bayesian” inference Inference sometimes a somewhat easy and sometimes a (very) hard problem Conjugate priors often make life easy when doing inference Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 5
Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6
Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6
Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6
Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Assume linear (in the parameters) function: f ( x n , w ) = w ⊤ φ ( x n ) Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6
Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Assume linear (in the parameters) function: f ( x n , w ) = w ⊤ φ ( x n ) Sum of squared error function N E ( w ) = 1 � | f ( x n , w ) − y n | 2 2 n =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6
Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Assume linear (in the parameters) function: f ( x n , w ) = w ⊤ φ ( x n ) Sum of squared error function N E ( w ) = 1 � | f ( x n , w ) − y n | 2 2 n =1 w = arg min w E ( w ) = ( Φ ⊤ Φ ) − 1 Φ ⊤ y Classical solution: ˆ Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6
Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Assume linear (in the parameters) function: f ( x n , w ) = w ⊤ φ ( x n ) Sum of squared error function N E ( w ) = 1 � | f ( x n , w ) − y n | 2 2 n =1 w = arg min w E ( w ) = ( Φ ⊤ Φ ) − 1 Φ ⊤ y Classical solution: ˆ Classification: replace the least squares by some other loss (e.g., logistic) Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6
Regularization Want functions that are “simple” (and hence “generalize” to future data) How: penalize “complex” functions. Use a regularized loss function ˜ E ( w ) = E ( w ) + λ Ω( w ) Ω( w ): a measure of how complex w is (want it small) Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 7
Regularization Want functions that are “simple” (and hence “generalize” to future data) How: penalize “complex” functions. Use a regularized loss function ˜ E ( w ) = E ( w ) + λ Ω( w ) Ω( w ): a measure of how complex w is (want it small) Regularization parameter λ trades off data fit vs model simplicity Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 7
Regularization Want functions that are “simple” (and hence “generalize” to future data) How: penalize “complex” functions. Use a regularized loss function ˜ E ( w ) = E ( w ) + λ Ω( w ) Ω( w ): a measure of how complex w is (want it small) Regularization parameter λ trades off data fit vs model simplicity w = arg min w ˜ For Ω( w ) = || w || 2 , the solution ˆ E ( w ) = ( Φ ⊤ Φ + λ I ) − 1 Φ ⊤ y Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 7
A Probabilistic Framework for Regression Recall: y n = f ( x n , w ) + ǫ n Assume a zero-mean Gaussian error p ( ǫ | σ 2 ) = N ( ǫ | 0 , σ 2 ) Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 8
A Probabilistic Framework for Regression Recall: y n = f ( x n , w ) + ǫ n Assume a zero-mean Gaussian error p ( ǫ | σ 2 ) = N ( ǫ | 0 , σ 2 ) Leads to a Gaussian likelihood model p ( y n | x n , w ) = N ( y n | f ( x n , w ) , σ 2 ) � 1 / 2 � 1 � − 1 � 2 σ 2 ( f ( x n , w ) − y n ) 2 p ( y n | x n , w ) = exp 2 πσ 2 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 8
A Probabilistic Framework for Regression Recall: y n = f ( x n , w ) + ǫ n Assume a zero-mean Gaussian error p ( ǫ | σ 2 ) = N ( ǫ | 0 , σ 2 ) Leads to a Gaussian likelihood model p ( y n | x n , w ) = N ( y n | f ( x n , w ) , σ 2 ) � 1 / 2 � 1 � − 1 � 2 σ 2 ( f ( x n , w ) − y n ) 2 p ( y n | x n , w ) = exp 2 πσ 2 Joint probability of the data (likelihood) � N / 2 N � � N � 1 − 1 � � ( f ( x n , w ) − y n ) 2 L ( w ) = p ( y n | x n , w ) = exp 2 πσ 2 2 σ 2 n =1 n =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 8
A Probabilistic Framework for Regression Let’s look at the negative log-likelihood N − log L ( w ) = N 2 log σ 2 + N 1 � ( f ( x n , w ) − y n ) 2 2 log 2 π + 2 σ 2 n =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 9
A Probabilistic Framework for Regression Let’s look at the negative log-likelihood N − log L ( w ) = N 2 log σ 2 + N 1 � ( f ( x n , w ) − y n ) 2 2 log 2 π + 2 σ 2 n =1 Minimizing w.r.t. w leads to the same answer as the unregularized case w = ( Φ ⊤ Φ ) − 1 Φ ⊤ y ˆ Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 9
A Probabilistic Framework for Regression Let’s look at the negative log-likelihood N − log L ( w ) = N 2 log σ 2 + N 1 � ( f ( x n , w ) − y n ) 2 2 log 2 π + 2 σ 2 n =1 Minimizing w.r.t. w leads to the same answer as the unregularized case w = ( Φ ⊤ Φ ) − 1 Φ ⊤ y ˆ Also get an estimate of error variance N σ 2 = 1 1 � w ) − y n ) 2 ( f ( x n , ˆ ˆ N n =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 9
Specifying a Prior and Computing the Posterior Let’s assume a Gaussian prior on the weight vector w = [ w 1 , . . . , w M ] � α M M � 1 / 2 − α � � � � 2 w 2 p ( w | α ) = p ( w m | α ) = exp m 2 π m =1 m =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 10
Recommend
More recommend