bayesian interpretations of regularization
play

Bayesian Interpretations of Regularization Charlie Frogner 9.520 - PowerPoint PPT Presentation

Bayesian Interpretations of Regularization Charlie Frogner 9.520 Class 17 April 6, 2011 C. Frogner Bayesian Interpretations of Regularization The Plan Regularized least squares maps { ( x i , y i ) } n i = 1 to a function that minimizes the


  1. Bayesian Interpretations of Regularization Charlie Frogner 9.520 Class 17 April 6, 2011 C. Frogner Bayesian Interpretations of Regularization

  2. The Plan Regularized least squares maps { ( x i , y i ) } n i = 1 to a function that minimizes the regularized loss: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 f S = arg min H 2 f ∈H i = 1 Can we interpret RLS from a probabilistic point of view? C. Frogner Bayesian Interpretations of Regularization

  3. Some notation Training set: S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . Inputs: X = { x 1 , . . . , x n } . Labels: Y = { y 1 , . . . , y n } . Parameters: θ ∈ R p . p ( Y | X , θ ) is the joint distribution over labels Y given inputs X and the parameters. C. Frogner Bayesian Interpretations of Regularization

  4. Where do probabilities show up? n 1 V ( y i , f ( x i )) + λ � 2 � f � 2 H 2 i = 1 becomes p ( Y | f , X ) · p ( f ) Likelihood , a.k.a. noise model : p ( Y | f , X ) . � f ∗ ( x i ) , σ 2 � Gaussian: y i ∼ N i Poisson: y i ∼ Pois ( f ∗ ( x i )) Prior : p ( f ) . C. Frogner Bayesian Interpretations of Regularization

  5. Where do probabilities show up? n 1 V ( y i , f ( x i )) + λ � 2 � f � 2 H 2 i = 1 becomes p ( Y | f , X ) · p ( f ) Likelihood , a.k.a. noise model : p ( Y | f , X ) . � f ∗ ( x i ) , σ 2 � Gaussian: y i ∼ N i Poisson: y i ∼ Pois ( f ∗ ( x i )) Prior : p ( f ) . C. Frogner Bayesian Interpretations of Regularization

  6. Estimation The estimation problem: i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N Find a good f to explain data. C. Frogner Bayesian Interpretations of Regularization

  7. The Plan Maximum likelihood estimation for ERM MAP estimation for linear RLS MAP estimation for kernel RLS Transductive model Infinite dimensions get more complicated C. Frogner Bayesian Interpretations of Regularization

  8. Maximum likelihood estimation i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N A good f is one that maximizes p ( Y | f , X ) . C. Frogner Bayesian Interpretations of Regularization

  9. Maximum likelihood and least squares For least squares, noise model is: � f ( x i ) , σ 2 � y i | f , x i ∼ N a.k.a. � � Y | f , X ∼ N f ( X ) , σ 2 I So N � � 1 1 p ( Y | f , X ) = � σ 2 ( y i − f ( x i )) 2 − ( 2 πσ 2 ) N / 2 exp i = 1 C. Frogner Bayesian Interpretations of Regularization

  10. Maximum likelihood and least squares For least squares, noise model is: � f ( x i ) , σ 2 � y i | f , x i ∼ N a.k.a. � � Y | f , X ∼ N f ( X ) , σ 2 I So N � � 1 1 p ( Y | f , X ) = � σ 2 ( y i − f ( x i )) 2 − ( 2 πσ 2 ) N / 2 exp i = 1 C. Frogner Bayesian Interpretations of Regularization

  11. Maximum likelihood and least squares Maximum likelihood: maximize � N � 1 1 p ( Y | f , X ) = � σ 2 ( y i − f ( x i ))) 2 ( 2 πσ 2 ) N / 2 exp − i = 1 Empirical risk minimization: minimize N � ( y i − f ( x i )) 2 i = 1 C. Frogner Bayesian Interpretations of Regularization

  12. ... N � ( y i − f ( x i )) 2 i = 1 C. Frogner Bayesian Interpretations of Regularization

  13. ... N 1 σ 2 ( y i − f ( x i )) 2 − � e i = 1 C. Frogner Bayesian Interpretations of Regularization

  14. What about regularization? RLS: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 arg min H 2 f i = 1 Is there a model of Y and f that yields RLS? Yes. � n � 1 ( y i − f ( x i )) 2 − λ 2 � f � 2 � − 2 σ 2 H e ε i = 1 p ( Y | f , X ) · p ( f ) C. Frogner Bayesian Interpretations of Regularization

  15. What about regularization? RLS: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 arg min H 2 f i = 1 Is there a model of Y and f that yields RLS? Yes. � n � 1 ( y i − f ( x i )) 2 − λ 2 � f � 2 � − 2 σ 2 H e ε i = 1 p ( Y | f , X ) · p ( f ) C. Frogner Bayesian Interpretations of Regularization

  16. What about regularization? RLS: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 arg min H 2 f i = 1 Is there a model of Y and f that yields RLS? Yes. � n � 1 ( y i − f ( x i )) 2 � − · e − λ 2 σ 2 2 � f � 2 e ε i = 1 H p ( Y | f , X ) · p ( f ) C. Frogner Bayesian Interpretations of Regularization

  17. What about regularization? RLS: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 arg min H 2 f i = 1 Is there a model of Y and f that yields RLS? Yes. � n � 1 ( y i − f ( x i )) 2 � − · e − λ 2 σ 2 2 � f � 2 e ε i = 1 H p ( Y | f , X ) · p ( f ) C. Frogner Bayesian Interpretations of Regularization

  18. Posterior function estimates i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N Find a good f to explain data. (If we can get p ( f | Y , X ) ) Bayes least squares estimate : ˆ f BLS = E ( f | X , Y ) [ f ] i.e. the mean of the posterior. MAP estimate : f MAP ( Y | X ) = arg max p ( f | X , Y ) ˆ f i.e. a mode of the posterior. C. Frogner Bayesian Interpretations of Regularization

  19. Posterior function estimates i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N Find a good f to explain data. (If we can get p ( f | Y , X ) ) Bayes least squares estimate : ˆ f BLS = E ( f | X , Y ) [ f ] i.e. the mean of the posterior. MAP estimate : f MAP ( Y | X ) = arg max p ( f | X , Y ) ˆ f i.e. a mode of the posterior. C. Frogner Bayesian Interpretations of Regularization

  20. Posterior function estimates i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N Find a good f to explain data. (If we can get p ( f | Y , X ) ) Bayes least squares estimate : ˆ f BLS = E ( f | X , Y ) [ f ] i.e. the mean of the posterior. MAP estimate : f MAP ( Y | X ) = arg max p ( f | X , Y ) ˆ f i.e. a mode of the posterior. C. Frogner Bayesian Interpretations of Regularization

  21. A posterior on functions? How to find p ( f | Y , X ) ? Bayes’ rule : p ( f | X , Y ) = p ( Y | X , f ) · p ( f ) p ( Y | X ) = p ( Y | X , f ) · p ( f ) p ( Y | X , f ) dp ( f ) � When is this well-defined? C. Frogner Bayesian Interpretations of Regularization

  22. A posterior on functions? How to find p ( f | Y , X ) ? Bayes’ rule : p ( f | X , Y ) = p ( Y | X , f ) · p ( f ) p ( Y | X ) = p ( Y | X , f ) · p ( f ) p ( Y | X , f ) dp ( f ) � When is this well-defined? C. Frogner Bayesian Interpretations of Regularization

  23. A posterior on functions? Functions vs. parameters: H ∼ = R p Represent functions in H by their coordinates w.r.t. a basis: f ∈ H ↔ θ ∈ R p Assume (for the moment): p < ∞ C. Frogner Bayesian Interpretations of Regularization

  24. A posterior on functions? Functions vs. parameters: H ∼ = R p Represent functions in H by their coordinates w.r.t. a basis: f ∈ H ↔ θ ∈ R p Assume (for the moment): p < ∞ C. Frogner Bayesian Interpretations of Regularization

  25. A posterior on functions? Mercer’s theorem : � K ( x i , x j ) = ν k ψ k ( x i ) ψ k ( x j ) k � where ν k ψ k ( · ) = K ( · , y ) ψ k ( y ) dy for all k . The functions {√ ν k ψ k ( · ) } form an orthonormal basis for H K . Let φ ( · ) = [ √ ν 1 ψ 1 ( · ) , . . . , √ ν p ψ p ( · )] . Then: H K = { φ ( · ) θ | θ ∈ R p } C. Frogner Bayesian Interpretations of Regularization

  26. Prior on infinite-dimensional space Problem: there’s no such thing as θ ∼ N ( 0 , I ) when θ ∈ R ∞ ! C. Frogner Bayesian Interpretations of Regularization

  27. Posterior for linear RLS Linear function: f ( x ) = � x , θ � Noise model: � � Y | X , θ ∼ N X θ, σ 2 ε I Add a prior : θ ∼ N ( 0 , I ) C. Frogner Bayesian Interpretations of Regularization

  28. Posterior for linear RLS Model: Y | X , θ ∼ N � X θ, σ 2 � ε I , θ ∼ N ( 0 , I ) Joint over Y and θ : � Y � XX T + σ 2 �� 0 X � � �� ε I ∼ N , X T θ 0 I Condition on Y . C. Frogner Bayesian Interpretations of Regularization

  29. Posterior for linear RLS Posterior: θ | X , Y ∼ N � � µ θ | X , Y , Σ θ | X , Y where µ θ | X , Y = X T ( XX T + σ 2 ε I ) − 1 Y Σ θ | X , Y = I − X T ( XX T + σ 2 ε I ) − 1 X This is Gaussian, so θ BLS ( Y | X ) = X T ( XX T + σ 2 θ MAP ( Y | X ) = ˆ ε I ) − 1 Y ˆ C. Frogner Bayesian Interpretations of Regularization

  30. Posterior for linear RLS Posterior: θ | X , Y ∼ N � � µ θ | X , Y , Σ θ | X , Y where µ θ | X , Y = X T ( XX T + σ 2 ε I ) − 1 Y Σ θ | X , Y = I − X T ( XX T + σ 2 ε I ) − 1 X This is Gaussian, so θ BLS ( Y | X ) = X T ( XX T + σ 2 θ MAP ( Y | X ) = ˆ ε I ) − 1 Y ˆ C. Frogner Bayesian Interpretations of Regularization

  31. Linear RLS as a MAP estimator Model: � � Y | X , θ ∼ N X θ, σ 2 θ ∼ N ( 0 , I ) ε I , θ MAP ( Y | X ) = X T ( XX T + σ 2 ε I ) − 1 Y ˆ Recall the linear RLS solution: N 1 ( y i − � x i , θ � ) 2 + λ θ RLS ( Y | X ) = arg min ˆ � 2 � θ � 2 2 θ i = 1 = X T ( XX T + λ 2 I ) − 1 Y So what’s λ ? C. Frogner Bayesian Interpretations of Regularization

Recommend


More recommend