on the properties of variational approximations in
play

On the Properties of Variational Approximations in Statistical - PowerPoint PPT Presentation

Introduction Variational Approximations On the Properties of Variational Approximations in Statistical Learning. Pierre Alquier UCD Dublin - Statistics Seminar - 29/10/15 Pierre Alquier Properties of Variational Approximations Introduction


  1. Introduction Variational Approximations On the Properties of Variational Approximations in Statistical Learning. Pierre Alquier UCD Dublin - Statistics Seminar - 29/10/15 Pierre Alquier Properties of Variational Approximations

  2. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Learning vs. estimation In many applications one would like to learn from a sample without being able to write the likelihood. Pierre Alquier Properties of Variational Approximations

  3. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Learning vs. estimation In many applications one would like to learn from a sample without being able to write the likelihood. Pierre Alquier Properties of Variational Approximations

  4. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Typical machine learning problem Main ingredients : Pierre Alquier Properties of Variational Approximations

  5. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Typical machine learning problem Main ingredients : observations object-label : ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ... Pierre Alquier Properties of Variational Approximations

  6. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Typical machine learning problem Main ingredients : observations object-label : ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ... → either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, ( X 1 , Y 1 ) , ..., ( X n , Y n ) i.i.d. Pierre Alquier Properties of Variational Approximations

  7. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Typical machine learning problem Main ingredients : observations object-label : ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ... → either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, ( X 1 , Y 1 ) , ..., ( X n , Y n ) i.i.d. a restricted set of predictors ( f θ , θ ∈ Θ) . Pierre Alquier Properties of Variational Approximations

  8. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Typical machine learning problem Main ingredients : observations object-label : ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ... → either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, ( X 1 , Y 1 ) , ..., ( X n , Y n ) i.i.d. a restricted set of predictors ( f θ , θ ∈ Θ) . → f θ ( X ) meant to predict Y . Pierre Alquier Properties of Variational Approximations

  9. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Typical machine learning problem Main ingredients : observations object-label : ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ... → either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, ( X 1 , Y 1 ) , ..., ( X n , Y n ) i.i.d. a restricted set of predictors ( f θ , θ ∈ Θ) . → f θ ( X ) meant to predict Y . a criterion of success, R ( θ ) : Pierre Alquier Properties of Variational Approximations

  10. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Typical machine learning problem Main ingredients : observations object-label : ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ... → either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, ( X 1 , Y 1 ) , ..., ( X n , Y n ) i.i.d. a restricted set of predictors ( f θ , θ ∈ Θ) . → f θ ( X ) meant to predict Y . a criterion of success, R ( θ ) : → for example R ( θ ) = P ( f θ ( X ) � = Y ) (classification error). In this talk R ( θ ) = E [ ℓ ( Y , f θ ( X ))] . We want to minimize R ( θ ) . But note that it is unknown in practice. Pierre Alquier Properties of Variational Approximations

  11. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Typical machine learning problem Main ingredients : observations object-label : ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ... → either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, ( X 1 , Y 1 ) , ..., ( X n , Y n ) i.i.d. a restricted set of predictors ( f θ , θ ∈ Θ) . → f θ ( X ) meant to predict Y . a criterion of success, R ( θ ) : → for example R ( θ ) = P ( f θ ( X ) � = Y ) (classification error). In this talk R ( θ ) = E [ ℓ ( Y , f θ ( X ))] . We want to minimize R ( θ ) . But note that it is unknown in practice. an empirical proxy r ( θ ) for this criterion of success : Pierre Alquier Properties of Variational Approximations

  12. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Typical machine learning problem Main ingredients : observations object-label : ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ... → either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, ( X 1 , Y 1 ) , ..., ( X n , Y n ) i.i.d. a restricted set of predictors ( f θ , θ ∈ Θ) . → f θ ( X ) meant to predict Y . a criterion of success, R ( θ ) : → for example R ( θ ) = P ( f θ ( X ) � = Y ) (classification error). In this talk R ( θ ) = E [ ℓ ( Y , f θ ( X ))] . We want to minimize R ( θ ) . But note that it is unknown in practice. an empirical proxy r ( θ ) for this criterion of success : � n → here r ( θ ) = 1 i = 1 ℓ ( Y i , f θ ( X i )) . n Pierre Alquier Properties of Variational Approximations

  13. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Empirical risk minimization (ERM) ˆ θ n = arg min θ ∈ Θ r ( θ ) . Pierre Alquier Properties of Variational Approximations

  14. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach Empirical risk minimization (ERM) ˆ θ n = arg min θ ∈ Θ r ( θ ) . Theorem (Vapnik and Chervonenkis, in the 70’s) Vapnik, V. (1998). Statistical Learning Theory , Springer. Classification setting. Let d Θ denote the VC-dim. of Θ . � � d Θ log ( n + 1 ) + log ( 2 ) R (ˆ θ n ) ≤ inf θ ∈ Θ R ( θ ) + 4 P n � � log ( 2 /ε ) ≥ 1 − ε. + 2 n Pierre Alquier Properties of Variational Approximations

  15. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach ERM with linear classifiers Table: Linear classifiers in R p : d Θ = p + 1. Source : http ://mlpy.sourceforge.net/ Pierre Alquier Properties of Variational Approximations

  16. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach ERM with linear classifiers Here d Θ = 3, n = 500. Table: Linear classifiers in R p : d Θ = p + 1. Source : http ://mlpy.sourceforge.net/ Pierre Alquier Properties of Variational Approximations

  17. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach ERM with linear classifiers Here d Θ = 3, n = 500. With probability at least 90 % , R (ˆ θ n ) ≤ inf θ ∈ Θ R ( θ )+ 0 . 842 . Table: Linear classifiers in R p : d Θ = p + 1. Source : http ://mlpy.sourceforge.net/ Pierre Alquier Properties of Variational Approximations

  18. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach ERM with linear classifiers Here d Θ = 3, n = 500. With probability at least 90 % , R (ˆ θ n ) ≤ inf θ ∈ Θ R ( θ )+ 0 . 842 . With n = 5000 we would have Table: Linear classifiers in R p : d Θ = p + 1. Source : R (ˆ θ n ) ≤ inf http ://mlpy.sourceforge.net/ θ ∈ Θ R ( θ )+ 0 . 301 . Pierre Alquier Properties of Variational Approximations

  19. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach The PAC-Bayesian approach : origins Idea : combine these tools with a prior π on Θ . Shawe-Taylor, J. & Williamson, R. C. (1997). A PAC Analysis of a Bayesian Estimator. COLT’97 . McAllester, D. A. (1998). Some PAC-Bayesian Theorems. COLT’98 . “A PAC performance guarantee theorem applies to a broad class of experimental settings. A Bayesian correctness theorem applies to only experimental settings consistent with the prior used in the algorithm. However, in this restricted class of settings the Bayesian learning algorithm can be optimal and will generally outperform PAC learning algorithms. (...) The PAC-Bayesian theorems and algorithms (...) attempt to get the best of both PAC and Bayesian approaches by combining the ability to be tuned with an informal prior with PAC guarantees that hold in all i.i.d experimental settings.” Pierre Alquier Properties of Variational Approximations

  20. Introduction Statistical Learning Setting Variational Approximations (Pseudo)-Bayesian Approach The PAC-Bayesian approach EWA / pseudo-posterior / Gibbs estimator / ... ρ λ ( d θ ) ∝ exp [ − λ r ( θ )] π ( d θ ) . ˆ Pierre Alquier Properties of Variational Approximations

Recommend


More recommend