A survey on mixing coe� cients: computation and estimation. Vitaly Kuznetsov Courant Institute of Mathematical Sciences, New York University October 29, 2013 1 / 24
Introduction Binary classi� cation Receive a sample X 1 , . . . , X m with labels in { 0 , 1 } . Choose a hypothesis h that has a good expected performance on unseen data. X 1 , . . . , X m are typically assumed i.i.d. 2 / 24
Introduction (continued) Much of the learning theory operates under the assumption that data comes from an i.i.d. source. In certain scenarios this assumption is not appropriate, e.g. time series analysis. To extend learning theory to this scenarios we need to � nd a suitable relaxation of i.i.d. requirement. One common approach found in literature is imposing various \ mixing conditions" . Under these mixing conditions the strength of dependence between random variables is measured using \ mixing coe� cients" . 3 / 24
Outline Mixing conditions and coefficients: definitions and basic properties. Computational aspects. Estimating mixing coefficients. Discussion. 4 / 24
How can we measure dependence between random variables? Common measures of dependence are so called “mixing” coefficients. Originally introduced to prove laws of large numbers for sequences of dependent variables. 5 / 24
α mixing coe� cient between two σ -algebras Given a probability space (Ω , F , P ) and two sub σ -algebras σ 1 and σ 2 , define α -mixing coefficient α ( σ 1 , σ 2 ) = sup | P ( A ) P ( B ) − P ( A ∩ B ) | A , B where supremum is taken over all A ∈ σ 1 and B ∈ σ 2 . 6 / 24
ϕ mixing coe� cient Define ϕ -mixing coefficient ϕ ( σ 1 | σ 2 ) = sup | P ( A ) − P ( A | B ) | A , B where supremum is taken over all A ∈ σ 1 and B ∈ σ 2 . Note that ϕ coefficient is not symmetric. 7 / 24
β mixing coe� cient De� ne β -mixing coe� cient between two σ -algebras σ 1 and σ 2 : β ( σ 1 , σ 2 ) = E sup | P ( A ) − P ( A | σ 2 ) | A where supremum is taken over all A ∈ σ 1 . We can rewrite β -mixing coe� cient as follows: I J � � 1 β ( σ 1 , σ 2 ) = | P ( A i ) P ( B j ) − P ( A i ∩ B j ) | 2 sup i =1 j =1 where supremum is taken over all � nite partitions { A 1 , . . . , A I } and { B 1 , . . . , B J } of � such that A i ∈ σ 1 and B j ∈ S 2 . 8 / 24
Alternative de� nitions of β mixing coe� cient This leads to yet another characterization of β -mixing coe� cient: β ( σ 1 , σ 2 ) = � P σ 1 ⊗ P σ 2 − P σ 1 ⊗ σ 2 � where � · � denotes the total variation distance, i.e. � P − Q � = sup A | P ( A ) − Q ( A ) | . Assuming distributions P and Q have densities f and g respectively � 1 � P − Q � = | f − g | 2 9 / 24
Relations between mixing coe� cients We have the following: 2 α ( σ 1 , σ 2 ) ≤ β ( σ 1 , σ 2 ) ≤ ϕ ( σ 1 , σ 2 ) The second inequality is immediate from the de� nition. Proof of the � rst inequality: | P ( A ) P ( B ) − P ( A ∩ B ) | + | P ( A ) P ( B c ) − P ( A ∩ B c ) | + | P ( A c ) P ( B ) − P ( A c ∩ B ) | + | P ( A c ) P ( B c ) − P ( A c ∩ B c ) | ≤ 2 β ( σ 1 , σ 2 ) 10 / 24
From two variables to stochastic processes (i) Let { X t } ∞ t = −∞ be a doubly infinite sequence of random variables. Notation: X j i = ( X i , X i + 1 , . . . , X j ) P j i is the joint probability distribution of X j i σ j i is the σ -algebra generated by X j i 11 / 24
From two variables to stochastic processes (ii) De� ne the following mixing coe� cients t α ( σ t −∞ , σ ∞ α ( a ) = sup t + a ) −∞ , σ ∞ t β ( σ t β ( a ) = sup t + a ) −∞ , σ ∞ t ϕ ( σ t ϕ ( a ) = sup t + a ) We say that a sequence of random variables X ∞ −∞ is α , β or ϕ mixing if the corresponding mixing coe� cient → 0 as a → ∞ . These coe� cients measure dependence between future and the past separated by a time units. 12 / 24
Stationary stochastic processes A stochastic process X ∞ −∞ is (strictly) stationary for any t ∈ Z and k , n ∈ N the distribution of X t + n is the t same as the distribution of X t + k + n . t + k For stationary processes mixing coe� cients can be simpli� ed to −∞ , σ ∞ α ( a ) = α ( σ 0 a ) −∞ , σ ∞ β ( a ) = β ( σ 0 a ) −∞ , σ ∞ ϕ ( a ) = ϕ ( σ 0 a ) 13 / 24
Connections to machine learning Theorem (M. Mohri, A. Rostamizadeh, 2009): Let H = {X → Y} be a set of hypothesis and L be an M -bounded loss function. Let S be a sample of size 2 µ a from a stationary β -mixing process on X × Y , for any δ > 4( µ − 1) β ( a ) with probability at least 1 − δ ′ the following holds for all h ∈ H � m log 4 E [ L ( h ( X ) , Y )] ≤ 1 � L ( h ( X i ) , Y i ) + ^ δ ′ R S µ ( L ◦ H ) + 3 M m 2 µ i =1 where ^ R S µ denotes the empirical Rademacher complexity and δ ′ = δ − 4( µ − 1) β ( a ). Other results of the similar nature by R. Meir, M. Mohri and A. Rostamizadeh, I. Steinwart et. al. to name a few. 14 / 24
Can we compute mixing coe� cients? Theorem (M. Ahsen, M. Vidyasagar, 2013): Suppose X and Y are discrete random variables with known joint and marginal probability distributions. Then computing α -mixing coe� cient is NP - hard. (equivalent to \ partition problem" ). Ahsen and Vidyasgar also give e� ciently computable upper and lower bounds. 15 / 24
Can we compute mixing coe� cients? (continued) Theorem (M. Ahsen, M. Vidyasagar, 2013): Suppose X and Y are discrete random variables with known joint distribution θ ij and marginal probability distributions µ i and ν j . Then one has that � � 1 β ( σ ( X ) , σ ( Y )) = | γ ij | 2 � 1 ϕ ( σ ( X ) , σ ( Y )) = max max( γ ij , 0) ν j j i where γ ij = θ ij − µ i ν j . Thus, β ( σ ( X ) , σ ( Y )) and ϕ ( σ ( X ) , σ ( Y )) both are computable in polynomial time. 16 / 24
Estimation of mixing coe� cients: naive approach (i) Question: Given i.i.d. samples ( X 1 , Y 1 ) , . . . , ( X m , Y m ) from a joint distribution of real-valued ( X , Y ), can we estimate any of the mixing coe� cients? De� ne the following estimators of the joint and marginal distributions: m 1 � ^ � ( x ) = I X i ≤ x m i =1 m 1 � ^ � ( y ) = I Y i ≤ y m i =1 m 1 ^ � � ( x , y ) = I X i ≤ x , Y i ≤ y m i =1 Let ^ β and ^ ϕ be estimators of β and γ based on empirical c.d.f.’s. 17 / 24
Estimation of mixing coe� cients: naive approach (ii) Theorem (M. Ahsen, M. Vidyasagar, 2013): β = m − 1 ϕ ≥ ^ → 1 as m → ∞ ^ m Justification: Under empirical probability distributions each sample has mass 1 / m . Marginals are also uniform and hence product distribution assigns mass of 1 / m to each point in the grid ( x i , y j ). The conclusion now follows from the above formula for discrete β . 18 / 24
Estimation of mixing coe� cients: histograms (i) A histogram estimator ^ f of a density f based on a sample X 1 , . . . , X m is J ^ p j � ^ f ( x ) = I B j ( x ) mw j j =1 where B j ’s are bins partitioning the region with observations m � ^ p j = I B j ( X i ) counts number of samples in bin B j i =1 w j is the width of the j-th bin 19 / 24
Estimation of mixing coe� cients: histograms (ii) Given m samples choose J m intervals on R so that each bin contains ⌊ m / J m ⌋ or ⌊ m / J m ⌋ + 1 samples from both X and Y . Theorem (M. Ahsen, M. Vidyasagar, 2013): Suppose ( X , Y ) ∼ θ , X ∼ µ and Y ∼ ν with θ being absolutely continuous with respect to µ ⊗ ν . Then ^ β converges to β provided that J m / m → 0. If in addition, the density f ∈ L ∞ then ^ α and ^ ϕ also converge to α and ϕ respectively. The measure-theoretic arguments used in the proof establish consistency of the estimators but do not yield error rates. 20 / 24
Estimation of mixing coe� cients: stochastic processes (i) Two step approximation | ^ β d ( a ) − β ( a ) | ≤ | ^ β d ( a ) − β d ( a ) | + | β d ( a ) − β ( a ) | ) and ^ t − d , σ t + a + d where β d ( a ) = sup β ( σ t β d ( a ) is an t + a estimator based on � β d ( a ) = ^ | ^ f d ⊗ ^ f d − ^ 1 f 2 d | 2 with ^ f d , ^ f 2 d being d and 2 d dimensional histogram estimators. 21 / 24
Estimation of mixing coe� cients: stochastic processes (ii) Theorem (D. McDonald, C. Shalizi, M. Shervish, 2011): Let X m 1 be a sample from a stationary β -mixing process. For m = 2 µ m b m and d ≤ µ m we have that � − µ m ǫ 2 � − µ m ǫ 2 � � P ( | ^ 1 2 β d ( a ) − β d ( a ) | ≥ ǫ ) ≤ 2 exp + 2 exp 2 2 + 4( µ m − 1) β ( b m ) | ^ | ^ � � where ǫ 1 = ǫ/ 2 − E [ f d − f d | ] and ǫ 2 = ǫ − E [ f 2 d − f 2 d | ]. Proof is based on blocking technique. 22 / 24
Estimation of mixing coe� cients: stochastic processes (iii) | β d ( a ) − β ( a ) | a measure-theoretic argument can be used to show that this → 0 as d → ∞ . Under the assumption that densities f d and f 2 d are in the Sobolev space H 2 McDonald, Shalizi and Shervish argue that ^ f 2 d and ^ f d are consistent. Choosing d m = O (exp( W (log n )), w m = O ( m − k m ) where 1 W (log m ) + 2 log m k m = log m ( 1 2 exp( W (log n )) + 1) and W is an inverse of w exp( w ), they show that estimator of β based on histograms is consistent. 23 / 24
Recommend
More recommend