Exchangeability Peter Orbanz Columbia University P ARAMETERS AND P - PowerPoint PPT Presentation

Exchangeability Peter Orbanz Columbia University

P ARAMETERS AND P ATTERNS Parameters P ( X | θ ) = Probability [ data | pattern ] 3 2 1 output, y 0 − 1 − 2 − 3 − 5 0 5 input, x Inference idea data = underlying pattern + independent noise Peter Orbanz 2 / 25

T ERMINOLOGY Parametric model ◮ Number of parameters fixed (or constantly bounded) w.r.t. sample size Nonparametric model ◮ Number of parameters grows with sample size ◮ ∞ -dimensional parameter space Example: Density estimation x 2 p(x) µ x 1 Parametric Nonparametric Peter Orbanz 3 / 25

N ONPARAMETRIC B AYESIAN MODEL Definition A nonparametric Bayesian model is a Bayesian model on an ∞ -dimensional parameter space. Interpretation Parameter space T = set of possible patterns. Recall previous tutorials: Model T Application Gaussian process Smooth functions Regression problems DP mixtures Smooth densities Density estimation CRP, 2-param. CRP Parititons Clustering Solution to Bayesian problem = posterior distribution on patterns Peter Orbanz [Sch95] 4 / 25

DE F INETTI ’ S T HEOREM Infinite exchangeability For all π ∈ S ∞ (= infinite symmetric group): P ( X 1 , X 2 , . . . ) = P ( X π ( 1 ) , X π ( 2 ) , ... ) π ( P ) = P or Theorem (de Finetti) � ∞ � � � ⇔ P ( X 1 , X 2 , . . . ) = Q ( X n ) d ν ( Q ) P exchangeable M ( X ) n = 1 ◮ Q is a random measure ◮ ν uniquely determined by P Peter Orbanz 5 / 25

F INITE E XCHANGEABILITY Finite sequence X 1 , . . . , X n Exchangeability of finite sequence �⇒ de Finetti-representation Example: Two exchangeable random bits X 1 = 0 X 1 = 1 X 2 = 0 1 / 2 0 X 2 = 1 1 / 2 0 Suppose de Finetti holds; then � [ 0 , 1 ] p 2 d ν ( p ) � � P ( X 1 = X 2 = 1 ) = ν { p = 0 } = 1 0 = ⇒ � [ 0 , 1 ] ( 1 − p ) 2 d ν ( p ) P ( X 1 = X 2 = 0 ) = ν { p = 1 } = 1 Intuition Finite exchangeability does not eliminate sequential patterns. Peter Orbanz [DF80] 6 / 25

S UPPORT OF P RIORS Model P 0 outside model: misspecified P 0 = P θ 0 M ( X ) Peter Orbanz [Gho10, KvdV06] 7 / 25

S UPPORT OF N ONPARAMETRIC P RIORS Large support ◮ Support of nonparametric priors is larger ( ∞ -dimensional) than of parametric priors (finite-dimensional). ◮ However: No uniform prior (or even “neutral” improper prior) exists on M ( X ) . Interpretation of nonparametric prior assumptions Concentration of nonparametric prior on subset of M ( X ) typically represents structural prior assumption. ◮ GP regression with unknown bandwidth: ◮ Any continuous function possible ◮ Prior can express e.g. “very smooth functions are more probable” ◮ Clustering: Expected number of clusters is... ◮ ...small − → CRP prior ◮ ...power law − → two-parameter CRP Peter Orbanz 8 / 25

P ARAMETERIZED M ODELS X Probability model X ( ω ) P ( X ) = X [ P ] Ω X ω P Θ( ω ) Θ T Parameterized model P [ X | Θ] X ∞ F T X ∞ M ( X ) ⊃ P Ω T Θ ◮ P = { P [ X | θ ] | θ ∈ T } ◮ F ≡ law of large numbers ◮ T : P [ . | Θ = θ ] �→ θ bijection ◮ Θ := T ◦ F ◦ X ∞ Peter Orbanz [Sch95] 9 / 25

J USTIFICATION : B Y E XCHANGEABILITY Again: de Finetti � ∞ � ∞ � � � � � � P ( X 1 , X 2 , . . . ) = Q ( X n ) d ν ( Q ) = Q ( X n | Θ = θ ) d ν T ( θ ) M ( X ) T n = 1 n = 1 ◮ Θ random measure (since Θ( ω ) ∈ M ( X ) ) Convergence results The de Finetti theorem comes with a convergence result attached: weakly ◮ Empirical measure: F n − − − → θ as n → ∞ ◮ Posterior Λ n (Θ | X 1 , . . . , X n ) = Λ n ( . , ω ) in M ( T ) exists n →∞ ◮ Posterior convergence: Λ n ( . , ω ) − − − → δ Θ( ω ) Peter Orbanz [Kal01] 10 / 25

S PECIAL T YPES OF E XCHANGEABLE D ATA

M ODIFICATIONS Pólya Urns n α 1 � P ( X n + 1 | X 1 = x 1 , . . . , X n = x n ) = δ x j ( X n + 1 ) + α + nG 0 ( X n + 1 ) α + n j = 1 Exchangeable: ◮ ν is DP ( α, G 0 ) �� ∞ � ◮ � ∞ n = 1 Q ( X n | θ ) = � ∞ n = 1 θ ( X n ) = � ∞ j = 1 c j δ t j ( X n ) n = 1 Exchangeable increment processes (H. Bühlmann) Stationary, exchangeable increment process = mixture of Lévy processes � P (( X t ) t ∈ R + ) = L α,γ,µ (( X t ) t ∈ R + ) d ν ( α, γ, µ ) L α,γ,µ = Lévy process with jump measure µ [B¨ Peter Orbanz 60, Kal01] 12 / 25

M ODIFICATION 2: R ANDOM P ARTITIONS Random partition of N Π = { B 1 , B 2 , . . . } e.g. {{ 1 , 3 , 5 , . . . } , { 2 , 4 } , { 10 } , . . . } U 3 U 1 U 2 Paint-box distribution ◮ Weights s 1 , s 2 , . . . ≥ 0 with � s j ≤ 1 ◮ U 1 , U 2 , . . . ∼ Uniform [ 0 , 1 ] s 1 s 2 1 − � j s j Sampling Π ∼ β [ . | s ] : i , j ∈ N in same block ⇔ U i , U j in same interval � { i } separate block ⇔ U i in interval 1 − s j Theorem (Kingman) � Π exchangeable ⇔ P (Π ∈ . ) = β [Π ∈ . | s ] Q ( d s ) Peter Orbanz [Kin78] 13 / 25

R OTATION INVARIANCE Rotatable sequence P n ( X 1 , . . . , X n ) = P n ( R n ( X 1 , . . . , X n )) R n ∈ O ( n ) for all Infinite case : ⇔ X 1 , X 2 , . . . rotatable X 1 , . . . , X n rotatable for all n Theorem (Freedman) Infinite sequence rotatable iff � ∞ � � � P ( X 1 , X 2 , . . . ) = N σ ( X n ) d ν R + ( σ ) R + n = 1 N σ denotes ( 0 , σ ) -Gaussian Peter Orbanz 14 / 25

T WO INTERPRETATIONS As special case of de Finetti ◮ Rotatable ⇒ exchangeable ◮ General de Finetti: Parameter space T = M ( X ) ◮ Rotation invariance: T shrinks to { N σ | σ ∈ R + } As invariance under different symmetry ◮ Exchangeability = invariance of P ( X 1 , X 2 , ... ) under group action ◮ Freedman: Different group ( O ( n ) rather than S ∞ ) ◮ In these cases: symmetry ⇒ decomposition theorem Peter Orbanz 15 / 25

N ON - EXCHANGEABLE D ATA

E XCHANGEABILITY : R ANDOM G RAPHS Random graph with independent edges θ : [ 0 , 1 ] 2 → [ 0 , 1 ] Given: symmetric function ◮ U 1 , U 2 , . . . ∼ Uniform [ 0 , 1 ] 0 0 0 ◮ Edge ( i , j ) present: ( i , j ) ∼ Bernoulli ( θ ( U i , U j )) θ Call this distribution Γ( G ∈ . | θ ) . 1 1 1 Theorem (Aldous; Hoover) 3 4 2 A random (dense) graph G is exchangeable iff 1 5 � P ( G ∈ . ) = Γ( G ∈ . | θ ) Q ( d θ ) 9 T 6 7 8 Peter Orbanz [Ald81, Hoo79] 17 / 25

E XCHANGEABILITY : R ANDOM G RAPHS Random graph with independent edges θ : [ 0 , 1 ] 2 → [ 0 , 1 ] Given: symmetric function ◮ U 1 , U 2 , . . . ∼ Uniform [ 0 , 1 ] U 1 U 2 0 0 0 ◮ Edge ( i , j ) present: U 1 Pr { edge 1 , 2 } ( i , j ) ∼ Bernoulli ( θ ( U i , U j )) θ U 2 Call this distribution Γ( G ∈ . | θ ) . 1 1 1 Theorem (Aldous; Hoover) 3 4 2 A random (dense) graph G is exchangeable iff 1 5 � P ( G ∈ . ) = Γ( G ∈ . | θ ) Q ( d θ ) 9 T 6 7 8 Peter Orbanz [Ald81, Hoo79] 17 / 25

DE F INETTI : G EOMETRY Finite case e 1 � P = ν i e i ν 1 e i ∈E ◮ E = { e 1 , e 2 , e 3 } P ◮ ( ν 1 , ν 2 , ν 3 ) barycentric coordinates ν 3 ν 2 e 2 e 3 Infinite/continuous case � � P ( . ) = e ( . ) d ν ( e ) = k ( θ, . ) d ν T ( θ ) E T ◮ k : T → E ⊂ M ( X ) probability kernel (= conditional probability) ◮ k is random measure with values k ( θ, . ) ∈ E ◮ de Finetti: k ( θ, . ) = � n ∈ N Q ( . | θ ) and T = M ( X ) Peter Orbanz 18 / 25

D ECOMPOSITION BY S YMMETRY Theorem (Varadarajan) ◮ G nice group on space Y ◮ Call measure µ ergodic if µ ( A ) ∈ { 0 , 1 } for all G -invariant sets A . ◮ E := { ergodic probability measures } Then there is a Markov kernel k : Y → E s.t.: � P ∈ M ( V ) ⇔ P ( A ) = k ( θ, A ) d ν ( θ ) G -invariant T de Finetti ◮ G = S ∞ and Y = X ∞ ◮ G -invariant sets = exchangeable events ◮ E = factorial distributions (“Hewitt-Savage 0-1 law”) Peter Orbanz [Var63] 19 / 25

S YMMETRY AND S UFFICIENCY

S UFFICIENT S TATISTICS Problem Apparently no direct connection with standard models Sufficient Statistic Functions S n of data sufficient if: ◮ Intuitively: S n ( X 1 , . . . , X n ) contains all information sample provides on parameter ◮ Formally: P n ( X 1 , . . . , X n | Θ , S n ) = P ( X 1 , . . . , X n | S ) for all n Sufficiency and symmetry � n ◮ P exchangeable ⇔ S n ( x 1 , . . . , x n ) = 1 i = 1 δ x n sufficient n �� n ◮ P rotatable ⇔ S n ( x 1 , . . . , x n ) = i = 1 x 2 i = � ( x 1 , . . . , x n ) � 2 sufficient Peter Orbanz 21 / 25

D ECOMPOSITION BY S UFFICIENCY Theorem (Diaconis and Freedman; Lauritzen; several others) Given: Sufficient statistic S n for each n k n ( . , s n ) = conditional probability of X 1 , . . . , X n given s n 1. k n converges to a limit function: n →∞ k n ( . , S n ( X 1 ( ω ) , . . . , X n ( ω ))) − − − → k ∞ ( . , ω ) 2. P ( X 1 , X 2 , . . . ) has the decomposition � P ( . ) = k ∞ ( . , ω ) d ν ( ω ) 3. The model P ⊂ M ( X ) is a convex set with extreme points k ∞ ( . , ω ) 4. The measure ν is uniquely determined by P (Theorem statement omits technical conditions.) Peter Orbanz 22 / 25

Exchangeability Peter Orbanz Columbia University P ARAMETERS AND P - PowerPoint PPT Presentation

Exchangeability Peter Orbanz Columbia University P ARAMETERS AND P ATTERNS Parameters P ( X | ) = Probability [ data | pattern ] 3 2 1 output, y 0 1 2 3 5 0 5 input, x Inference idea data = underlying pattern +

Random Networks, Graphical Models and Exchangeability Alessandro Rinaldo Carnegie Mellon

A Cantelli-type inequality for constructing non-parametric p-boxes based on exchangeability

FINITE AND INFINITE EXCHANGEABILITY Takis Konstantopoulos Uppsala University, Sweden

Exchangeability and predictive inference Using a special type of symmetry Gert de Cooman SYSTeMS

Assessment Factors for Toxicity Based Risk Assessment in the Presence of Non-Exchangeable Species

Preamble Statistical concept of Species Sensitivity Distributions (SSDs) is used frequently.

Bayesian Inference Harvard Math Camp - Econometrics Ashesh Rambachan Summer 2018 Outline What

Statistical inference for R enyi entropy David K allberg Department of Mathematics and

System identification for quantum Markov processes Mdlin Gu School of Mathematical

Some References P. Carpentier Master MMMEF Cours MNOS 2014-2015 263 / 263 Stochastic

A Location-Mixture Autoregressive Model for Online Forecasting of Lung Tumor Motion Dan Cervone ,

Modeling the probability of occurrence of events with the new stpreg command Matteo Bottai, ScD 1

Multi-Dimensional Reflective BSDE July 29 2010, Cornell University By Qinghua Li, Columbia

Should we think of quantum probabilities as Bayesian probabilities? Carlton M. Caves C. M.

On the Properties of Variational Approximations in Statistical Learning. Pierre Alquier UCD

Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 L. Rosasco Bayesian

Embeddings of statistical manifolds H ong V an L e Institute of Mathematics, CAS

the multiple Chernoff distance Ke Li California Institute of Technology QMath 13, Georgia Tech

Kernel Methods for Topological Data Analysis Kenji Fukumizu The Institute of Statistical

Constrained optimal discrimination designs for Fourier regression models S. Biedermann, School of

Multilevel methods for fast Bayesian optimal experimental design Ra ul Tempone Alexander von

Design and analysis of follow-up studies with genetic component Juha Karvanen Department of

Existence of the free boundary in a diffusive ow in porous media Gabriela Marinoschi

8 Approximate inference in switching linear dynamical systems using Gaussian mixtures David