bayesian nonparametrics
play

Bayesian Nonparametrics Peter Orbanz Columbia University P - PowerPoint PPT Presentation

Bayesian Nonparametrics Peter Orbanz Columbia University P ARAMETERS AND P ATTERNS Parameters P ( X | ) = Probability [ data | pattern ] 3 2 1 output, y 0 1 2 3 5 0 5 input, x Inference idea data = underlying


  1. Bayesian Nonparametrics Peter Orbanz Columbia University

  2. P ARAMETERS AND P ATTERNS Parameters P ( X | θ ) = Probability [ data | pattern ] 3 2 1 output, y 0 − 1 − 2 − 3 − 5 0 5 input, x Inference idea data = underlying pattern + independent randomness Bayesian statistics tries to compute the posterior probability P [ pattern | data ] . Peter Orbanz 2 / 16

  3. N ONPARAMETRIC MODELS Parametric model ◮ Number of parameters fixed (or constantly bounded) w.r.t. sample size Nonparametric model ◮ Number of parameters grows with sample size ◮ ∞ -dimensional parameter space Example: Density estimation x 2 p(x) µ x 1 Parametric Nonparametric Peter Orbanz 3 / 16

  4. N ONPARAMETRIC B AYESIAN MODEL Definition A nonparametric Bayesian model is a Bayesian model on an ∞ -dimensional parameter space. Interpretation Parameter space T = set of possible patterns, for example: Problem T Density estimation Probability distributions Regression Smooth functions Clustering Partitions Solution to Bayesian problem = posterior distribution on patterns Peter Orbanz [Sch95] 4 / 16

  5. (N ONPARAMETRIC ) B AYESIAN S TATISTICS Task ◮ Define prior distribution Q (Θ ∈ . ) and observation model P [ X ∈ . | Θ] ◮ Compute posterior distribution Q [Θ ∈ . | X 1 = x 1 , . . . , X n = x n ] Parametric case: Bayes’ theorem � n j = 1 p ( x j | θ ) Q [ d θ | x 1 , . . . , x n ] = p ( x 1 , . . . , x n ) Q ( d θ ) Condition: Q [ . | X = x ] ≪ Q for all x . Nonparametric case ◮ Bayes’ theorem (often) not applicable. ◮ Parameter space not locally compact. ◮ Hence: No density representations. Peter Orbanz 5 / 16

  6. E XCHANGEABILITY Can we justify our assumptions? Recall: data = pattern + noise In Bayes’ theorem: � n j = 1 p ( x j | θ ) Q ( d θ | x 1 , . . . , x n ) = p ( x 1 , . . . , x n ) Q ( d θ ) de Finetti’s theorem � ∞ � � � P ( X 1 = x 1 , X 2 = x 2 , . . . ) = θ ( X j = x j ) Q ( d θ ) M ( X ) j = 1 � X 1 , X 2 , . . . exchangeable where: ◮ M ( X ) is the set of probability measures on X ◮ θ are values of a random probability measure Θ with distribution Q Peter Orbanz [Sch95, Kal05] 6 / 16

  7. E XAMPLES

  8. G AUSSIAN P ROCESSES 2 Nonparametric regression 1 Patterns = continuous functions, say on [ a , b ] : 0 θ : [ a , b ] → R T = C ([ a , b ] , R ) − 1 − 2 Hyperparameter − 5 a 0 5 b Kernel function; controls smoothness of Θ . 2 Inference 1 ◮ On data (sample size n ): n × n kernel matrix Θ( s ) 0 ◮ Posterior again Gaussian process − 1 ◮ Posterior computation reduces to matrix computation − 2 − 5 a 0 5 s b Peter Orbanz [RW06] 8 / 16

  9. R ANDOM D ISCRETE M EASURES Random discrete probability measure ∞ � Θ = C i δ Φ i i = 1 Application: Mixture models ∞ � � p ( x | φ ) d Θ( φ ) = C i p ( x | Φ i ) i = 1 Example: Dirichlet Process ◮ Sample Φ 1 , Φ 2 , . . . ∼ iid G ◮ Sample V 1 , V 2 , . . . ∼ iid Beta ( 1 , α ) � i − 1 and set C i := V i j = 1 ( 1 − V j ) Peter Orbanz 9 / 16

  10. M ORE E XAMPLES Applications Pattern Bayesian nonparametric model Classification & regression Function Gaussian process Clustering Partition Chinese restaurant process Density estimation Density Dirichlet process mixture Hierarchical clustering Hierarchical partition Dirichlet/Pitman-Yor diffusion tree, Kingman’s coalescent, Nested CRP Latent variable modelling Features Beta process/Indian buffet process Survival analysis Hazard Beta process, Neutral-to-the-right process Power-law behaviour Pitman-Yor process, Stable-beta process Dictionary learning Dictionary Beta process/Indian buffet process Dimensionality reduction Manifold Gaussian process latent variable model Deep learning Features Cascading/nested Indian buffet process Topic models Atomic distribution Hierarchical Dirichlet process Time series Infinite HMM Sequence prediction Conditional probs Sequence memoizer Reinforcement learning Conditional probs infinite POMDP Spatial modelling Functions Gaussian process, dependent Dirichlet process Relational modelling Infinite relational model, infinite hidden relational model, Mondrian process . . . . . . . . . Peter Orbanz 10 / 16

  11. R ESEARCH P ROBLEMS

  12. I NFERENCE MCMC ◮ Models are generative → MCMC natural choice ◮ Gibbs samplers easy to derive; can sample through hierarchies ◮ However: For most available samplers, inference probably too slow or wrong Gaussian process inference ◮ On data: positive definite matrices (Mercer theorem) ◮ Inference based on numerical linear algebra ◮ Naive methods scale cubically with sample size Approximations ◮ For latent variable methods: Variational approximations ◮ For Gaussian processes: Inducing point methods Peter Orbanz 12 / 16

  13. A SYMPTOTICS Consistency P 0 outside model: misspecified Model A Bayesian model is consistent at P 0 if the posterior converges to δ P 0 with growing sample size. Convergence rate ◮ Find smallest balls B ε n ( θ 0 ) for which P 0 = P θ 0 n →∞ Q ( B ε n ( θ 0 ) | X 1 , . . . , X n ) − − − → 1 ◮ Rate = sequence ε 1 , ε 2 , . . . M ( X ) ◮ Optimal rate is ε n ∝ n − 1 / 2 Example result Bandwidth adaptation with GPs: ◮ True parameter θ 0 ∈ C α [ 0 , 1 ] d , smoothness α unknown ◮ With gamma prior on GP bandwidth: Convergence rate is n − α/ ( 2 α + d ) Peter Orbanz [Gho10, KvdV06, Sch65, GvdV07, vdVvZ08a, vdVvZ08b] 13 / 16

  14. E RGODIC T HEORY de Finetti as Ergodic Decomposition � ∞ � � � ⇔ P ( A ) = ( A ) Q ( d θ ) for unique ν ∈ M ( M ( X )) P S ∞ -invariant θ M ( X ) j = 1 � ⇔ P ( A ) = e ( A ) ν ( de ) for unique ν ∈ M ( E ) P G -invariant E where G (nice) group on X and E its set of ergodic measures. e 2 e 1 Relevance to Statistics ν 2 ν 1 P ◮ de Finetti: random infinite sequences ◮ What if the data is matrix-valued, network-valued, ...? ν 3 ◮ Examples: Partitions (Kingman) Graphs (Aldous, Hoover) Markov chains (Diaconis & Freedman) e 3 Peter Orbanz 14 / 16

  15. S UMMARY Motivation, in hindsight Bayesian (nonparametric) modeling: ◮ Identify pattern/explanatory object (function, discrete measure, ...) ◮ Usually: Applied probability knows a random version of this object ◮ Use process as prior and develop inference Technical Tools ◮ Stochastic processes. ◮ Exchangeability/ergodic theory. ◮ Graphical, hierarchical and dependent models. ◮ Inference: MCMC sampling, optimization methods, numerical linear algebra Open Challenges ◮ Novel models and useful applications. ◮ Better inference and flexible software packages. ◮ Mathematical statistics for Bayesian nonparametric models. Peter Orbanz 15 / 16

  16. R EFERENCES I [Gho10] S. Ghosal. Dirichlet process, related priors and posterior asymptotics. In N. L. Hjort et al., editors, Bayesian Nonparametrics , pages 36–83. Cambridge University Press, 2010. [GvdV07] Subhashis Ghosal and Aad van der Vaart. Posterior convergence rates of Dirichlet mixtures at smooth densities. Ann. Statist. , 35(2):697–723, 2007. [Kal05] Olav Kallenberg. Probabilistic Symmetries and Invariance Principles . Springer, 2005. [KvdV06] B. J. K. Kleijn and A. W. van der Vaart. Misspecification in infinite-dimensional Bayesian statistics. Annals of Statistics , 34(2):837–877, 2006. [RW06] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning . MIT Press, 2006. [Sch65] L. Schwartz. On Bayes procedures. Z. Wahr. Verw. Gebiete , 4:10–26, 1965. [Sch95] M. J. Schervish. Theory of Statistics . Springer, 1995. [vdVvZ08a] A. W. van der Vaart and J. H. van Zanten. Rates of contraction of posterior distributions based on Gaussian process priors. Ann. Statist. , 36(3):1435–1463, 2008. [vdVvZ08b] A. W. van der Vaart and J. H. van Zanten. Reproducing kernel Hilbert spaces of Gaussian priors. In Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh , volume 3 of Inst. Math. Stat. Collect. , pages 200–222. Inst. Math. Statist., Beachwood, OH, 2008. Peter Orbanz 16 / 16

Recommend


More recommend