Exchangeability Peter Orbanz Columbia University
P ARAMETERS AND P ATTERNS Parameters P ( X | θ ) = Probability [ data | pattern ] 3 2 1 output, y 0 − 1 − 2 − 3 − 5 0 5 input, x Inference idea data = underlying pattern + independent noise Peter Orbanz 2 / 25
T ERMINOLOGY Parametric model ◮ Number of parameters fixed (or constantly bounded) w.r.t. sample size Nonparametric model ◮ Number of parameters grows with sample size ◮ ∞ -dimensional parameter space Example: Density estimation x 2 p(x) µ x 1 Parametric Nonparametric Peter Orbanz 3 / 25
N ONPARAMETRIC B AYESIAN MODEL Definition A nonparametric Bayesian model is a Bayesian model on an ∞ -dimensional parameter space. Interpretation Parameter space T = set of possible patterns. Recall previous tutorials: Model T Application Gaussian process Smooth functions Regression problems DP mixtures Smooth densities Density estimation CRP, 2-param. CRP Parititons Clustering Solution to Bayesian problem = posterior distribution on patterns Peter Orbanz [Sch95] 4 / 25
DE F INETTI ’ S T HEOREM Infinite exchangeability For all π ∈ S ∞ (= infinite symmetric group): P ( X 1 , X 2 , . . . ) = P ( X π ( 1 ) , X π ( 2 ) , ... ) π ( P ) = P or Theorem (de Finetti) � ∞ � � � ⇔ P ( X 1 , X 2 , . . . ) = Q ( X n ) d ν ( Q ) P exchangeable M ( X ) n = 1 ◮ Q is a random measure ◮ ν uniquely determined by P Peter Orbanz 5 / 25
F INITE E XCHANGEABILITY Finite sequence X 1 , . . . , X n Exchangeability of finite sequence �⇒ de Finetti-representation Example: Two exchangeable random bits X 1 = 0 X 1 = 1 X 2 = 0 1 / 2 0 X 2 = 1 1 / 2 0 Suppose de Finetti holds; then � [ 0 , 1 ] p 2 d ν ( p ) � � P ( X 1 = X 2 = 1 ) = ν { p = 0 } = 1 0 = ⇒ � [ 0 , 1 ] ( 1 − p ) 2 d ν ( p ) P ( X 1 = X 2 = 0 ) = ν { p = 1 } = 1 Intuition Finite exchangeability does not eliminate sequential patterns. Peter Orbanz [DF80] 6 / 25
S UPPORT OF P RIORS Model P 0 outside model: misspecified P 0 = P θ 0 M ( X ) Peter Orbanz [Gho10, KvdV06] 7 / 25
S UPPORT OF N ONPARAMETRIC P RIORS Large support ◮ Support of nonparametric priors is larger ( ∞ -dimensional) than of parametric priors (finite-dimensional). ◮ However: No uniform prior (or even “neutral” improper prior) exists on M ( X ) . Interpretation of nonparametric prior assumptions Concentration of nonparametric prior on subset of M ( X ) typically represents structural prior assumption. ◮ GP regression with unknown bandwidth: ◮ Any continuous function possible ◮ Prior can express e.g. “very smooth functions are more probable” ◮ Clustering: Expected number of clusters is... ◮ ...small − → CRP prior ◮ ...power law − → two-parameter CRP Peter Orbanz 8 / 25
P ARAMETERIZED M ODELS X Probability model X ( ω ) P ( X ) = X [ P ] Ω X ω P Θ( ω ) Θ T Parameterized model P [ X | Θ] X ∞ F T X ∞ M ( X ) ⊃ P Ω T Θ ◮ P = { P [ X | θ ] | θ ∈ T } ◮ F ≡ law of large numbers ◮ T : P [ . | Θ = θ ] �→ θ bijection ◮ Θ := T ◦ F ◦ X ∞ Peter Orbanz [Sch95] 9 / 25
J USTIFICATION : B Y E XCHANGEABILITY Again: de Finetti � ∞ � ∞ � � � � � � P ( X 1 , X 2 , . . . ) = Q ( X n ) d ν ( Q ) = Q ( X n | Θ = θ ) d ν T ( θ ) M ( X ) T n = 1 n = 1 ◮ Θ random measure (since Θ( ω ) ∈ M ( X ) ) Convergence results The de Finetti theorem comes with a convergence result attached: weakly ◮ Empirical measure: F n − − − → θ as n → ∞ ◮ Posterior Λ n (Θ | X 1 , . . . , X n ) = Λ n ( . , ω ) in M ( T ) exists n →∞ ◮ Posterior convergence: Λ n ( . , ω ) − − − → δ Θ( ω ) Peter Orbanz [Kal01] 10 / 25
S PECIAL T YPES OF E XCHANGEABLE D ATA
M ODIFICATIONS Pólya Urns n α 1 � P ( X n + 1 | X 1 = x 1 , . . . , X n = x n ) = δ x j ( X n + 1 ) + α + nG 0 ( X n + 1 ) α + n j = 1 Exchangeable: ◮ ν is DP ( α, G 0 ) �� ∞ � ◮ � ∞ n = 1 Q ( X n | θ ) = � ∞ n = 1 θ ( X n ) = � ∞ j = 1 c j δ t j ( X n ) n = 1 Exchangeable increment processes (H. Bühlmann) Stationary, exchangeable increment process = mixture of Lévy processes � P (( X t ) t ∈ R + ) = L α,γ,µ (( X t ) t ∈ R + ) d ν ( α, γ, µ ) L α,γ,µ = Lévy process with jump measure µ [B¨ Peter Orbanz 60, Kal01] 12 / 25
M ODIFICATION 2: R ANDOM P ARTITIONS Random partition of N Π = { B 1 , B 2 , . . . } e.g. {{ 1 , 3 , 5 , . . . } , { 2 , 4 } , { 10 } , . . . } U 3 U 1 U 2 Paint-box distribution ◮ Weights s 1 , s 2 , . . . ≥ 0 with � s j ≤ 1 ◮ U 1 , U 2 , . . . ∼ Uniform [ 0 , 1 ] s 1 s 2 1 − � j s j Sampling Π ∼ β [ . | s ] : i , j ∈ N in same block ⇔ U i , U j in same interval � { i } separate block ⇔ U i in interval 1 − s j Theorem (Kingman) � Π exchangeable ⇔ P (Π ∈ . ) = β [Π ∈ . | s ] Q ( d s ) Peter Orbanz [Kin78] 13 / 25
R OTATION INVARIANCE Rotatable sequence P n ( X 1 , . . . , X n ) = P n ( R n ( X 1 , . . . , X n )) R n ∈ O ( n ) for all Infinite case : ⇔ X 1 , X 2 , . . . rotatable X 1 , . . . , X n rotatable for all n Theorem (Freedman) Infinite sequence rotatable iff � ∞ � � � P ( X 1 , X 2 , . . . ) = N σ ( X n ) d ν R + ( σ ) R + n = 1 N σ denotes ( 0 , σ ) -Gaussian Peter Orbanz 14 / 25
T WO INTERPRETATIONS As special case of de Finetti ◮ Rotatable ⇒ exchangeable ◮ General de Finetti: Parameter space T = M ( X ) ◮ Rotation invariance: T shrinks to { N σ | σ ∈ R + } As invariance under different symmetry ◮ Exchangeability = invariance of P ( X 1 , X 2 , ... ) under group action ◮ Freedman: Different group ( O ( n ) rather than S ∞ ) ◮ In these cases: symmetry ⇒ decomposition theorem Peter Orbanz 15 / 25
N ON - EXCHANGEABLE D ATA
E XCHANGEABILITY : R ANDOM G RAPHS Random graph with independent edges θ : [ 0 , 1 ] 2 → [ 0 , 1 ] Given: symmetric function ◮ U 1 , U 2 , . . . ∼ Uniform [ 0 , 1 ] 0 0 0 ◮ Edge ( i , j ) present: ( i , j ) ∼ Bernoulli ( θ ( U i , U j )) θ Call this distribution Γ( G ∈ . | θ ) . 1 1 1 Theorem (Aldous; Hoover) 3 4 2 A random (dense) graph G is exchangeable iff 1 5 � P ( G ∈ . ) = Γ( G ∈ . | θ ) Q ( d θ ) 9 T 6 7 8 Peter Orbanz [Ald81, Hoo79] 17 / 25
E XCHANGEABILITY : R ANDOM G RAPHS Random graph with independent edges θ : [ 0 , 1 ] 2 → [ 0 , 1 ] Given: symmetric function ◮ U 1 , U 2 , . . . ∼ Uniform [ 0 , 1 ] U 1 U 2 0 0 0 ◮ Edge ( i , j ) present: U 1 Pr { edge 1 , 2 } ( i , j ) ∼ Bernoulli ( θ ( U i , U j )) θ U 2 Call this distribution Γ( G ∈ . | θ ) . 1 1 1 Theorem (Aldous; Hoover) 3 4 2 A random (dense) graph G is exchangeable iff 1 5 � P ( G ∈ . ) = Γ( G ∈ . | θ ) Q ( d θ ) 9 T 6 7 8 Peter Orbanz [Ald81, Hoo79] 17 / 25
DE F INETTI : G EOMETRY Finite case e 1 � P = ν i e i ν 1 e i ∈E ◮ E = { e 1 , e 2 , e 3 } P ◮ ( ν 1 , ν 2 , ν 3 ) barycentric coordinates ν 3 ν 2 e 2 e 3 Infinite/continuous case � � P ( . ) = e ( . ) d ν ( e ) = k ( θ, . ) d ν T ( θ ) E T ◮ k : T → E ⊂ M ( X ) probability kernel (= conditional probability) ◮ k is random measure with values k ( θ, . ) ∈ E ◮ de Finetti: k ( θ, . ) = � n ∈ N Q ( . | θ ) and T = M ( X ) Peter Orbanz 18 / 25
D ECOMPOSITION BY S YMMETRY Theorem (Varadarajan) ◮ G nice group on space Y ◮ Call measure µ ergodic if µ ( A ) ∈ { 0 , 1 } for all G -invariant sets A . ◮ E := { ergodic probability measures } Then there is a Markov kernel k : Y → E s.t.: � P ∈ M ( V ) ⇔ P ( A ) = k ( θ, A ) d ν ( θ ) G -invariant T de Finetti ◮ G = S ∞ and Y = X ∞ ◮ G -invariant sets = exchangeable events ◮ E = factorial distributions (“Hewitt-Savage 0-1 law”) Peter Orbanz [Var63] 19 / 25
S YMMETRY AND S UFFICIENCY
S UFFICIENT S TATISTICS Problem Apparently no direct connection with standard models Sufficient Statistic Functions S n of data sufficient if: ◮ Intuitively: S n ( X 1 , . . . , X n ) contains all information sample provides on parameter ◮ Formally: P n ( X 1 , . . . , X n | Θ , S n ) = P ( X 1 , . . . , X n | S ) for all n Sufficiency and symmetry � n ◮ P exchangeable ⇔ S n ( x 1 , . . . , x n ) = 1 i = 1 δ x n sufficient n �� n ◮ P rotatable ⇔ S n ( x 1 , . . . , x n ) = i = 1 x 2 i = � ( x 1 , . . . , x n ) � 2 sufficient Peter Orbanz 21 / 25
D ECOMPOSITION BY S UFFICIENCY Theorem (Diaconis and Freedman; Lauritzen; several others) Given: Sufficient statistic S n for each n k n ( . , s n ) = conditional probability of X 1 , . . . , X n given s n 1. k n converges to a limit function: n →∞ k n ( . , S n ( X 1 ( ω ) , . . . , X n ( ω ))) − − − → k ∞ ( . , ω ) 2. P ( X 1 , X 2 , . . . ) has the decomposition � P ( . ) = k ∞ ( . , ω ) d ν ( ω ) 3. The model P ⊂ M ( X ) is a convex set with extreme points k ∞ ( . , ω ) 4. The measure ν is uniquely determined by P (Theorem statement omits technical conditions.) Peter Orbanz 22 / 25
Recommend
More recommend