A Monte Carlo approach to a divergence minimization problem (work in progress) IGAIA IV, June 12-17, 2016, Liblice Michel Broniatowski Université Pierre et Marie Curie, Paris, France June 13, 2016 Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 1 / 36
Contents From Large deviations to Monte Carlo based minimization Divergences Large deviations for bootstrapped empirical measure A Minimization problem Minimum of the Kullback divergence Minimum of the Likelihood divergence Building weights Exponential families and their variance functions, minimizing Cressie-Read divergences Rare events and Gibbs conditional principle Looking for the minimizers Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 2 / 36
An inferential principle for minimization A sequence of random elements X n with values in a measurable space ( T , T ) satisfies a Large Deviation Principle with rate Φ whenever, for all measurable set Ω ⊂ T it holds − Φ ( int ( Ω )) ≤ lim inf n → ∞ ε n log Pr ( X n ∈ Ω ) ≤ lim sup n → ∞ ε n log Pr ( X n ∈ Ω ) ≤ − Φ ( cl ( Ω )) for some positive sequence ε n where int ( Ω ) (resp. cl ( Ω ) ) denotes the interior (resp. the closure) of Ω in T and Φ ( Ω ) : = inf { Φ ( t ) ; t ∈ Ω } . The σ − field T is the Borel one defined by a given basis on T . For subsets Ω in T such that Φ ( int ( Ω )) = Φ ( cl ( Ω )) (1) it follows by inclusion that n → ∞ ε n log Pr ( X n ∈ Ω ) = Φ ( int ( Ω )) = Φ ( cl ( Ω )) = inf t ∈ Ω Φ ( t ) = Φ ( Ω ) . − lim (2) Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 3 / 36
Assume that we are given such a family of random elements X 1 , X 2 , .. together with a set Ω ⊂ T which satisfies (1). Suppose that we are interested in estimating Φ ( Ω ) . Then, whenever we are able to simulate a family of replicates X n , 1 , .., X n , K such that Pr ( X n ∈ Ω ) can be approximated by the frequency of those X n , i ’s in Ω , say f n , K ( Ω ) : = 1 K card ( i : X n , i ∈ Ω ) (3) a natural estimator of Φ ( Ω ) writes Φ n , K ( Ω ) : = − ε n log f n , K ( Ω ) . We have substituted the approximation of the variational problem Φ ( Ω ) : = inf ( Φ ( ω ) , ω ∈ Ω ) by a much simpler one, namely a Monte Carlo one, defined by (3). No need to identify the set of points ω in Ω which minimize Φ . Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 4 / 36
This program can be realized whenever we can identify the sequence of random elements X i ’s for which, given the criterion Φ and the set Ω , the limit statement (2) holds. Here the X i ’s are empirical measures of some kind, and Φ ( Ω ) writes φ ( Ω , P ) which is the infimum of a divergence between some reference probability measure P and a class of probability measures Ω . Standpoint: φ ( Ω , P ) is a LDP rate for specific X i ’s to be built. Applications: choice of models, estimation of the minimizers (dichotomy, etc) Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 5 / 36
Divergences Let ( X , B ) be a measurable Polish space and P be a given reference probability measure (p.m.) on ( X , B ) . Denote M 1 the set of all p.m.’s on ( X , B ) . Let ϕ be a proper closed convex function from ] − ∞ , + ∞ [ to [ 0 , + ∞ ] with ϕ ( 1 ) = 0 and such that its domain dom ϕ : = { x ∈ R such that ϕ ( x ) < ∞ } is a finite or infinite interval . For any measure Q in M 1 , the φ -divergence between Q and P is defined by � dQ � � φ ( Q , P ) : = X ϕ dP ( x ) dP ( x ) . if Q << P . When Q is not a.c. w.r.t. P , set φ ( Q , P ) = + ∞ . The φ -divergences between p.m.’s were introduced in Csiszar (1963) as “ f -divergences” with some different definition. For all p.m. P , the mappings Q ∈ M �→ φ ( Q , P ) are convex and take nonnegative values. When Q = P then φ ( Q , P ) = 0. Furthermore, if the function x �→ ϕ ( x ) is strictly convex on a neighborhood of x = 1, then φ ( Q , P ) = 0 if and only if Q = P . Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 6 / 36
Cressie-Read divergences When defined on M 1 , divergences associated with ϕ 1 ( x ) = x log x − x + 1 (KL), ϕ 0 ( x ) = − log x + x − 1 (KL m -likelihood), 2 ( x − 1 ) 2 (Spearman Chi-square), ϕ − 1 ( x ) = 1 2 ( x − 1 ) 2 / x ϕ 2 ( x ) = 1 (modified Chi-square, Neyman), ϕ 1 / 2 ( x ) = 2 ( √ x − 1 ) 2 (Hellinger) The class of Cressie and Read power divergences x ∈ ] 0 , + ∞ [ �→ ϕ γ ( x ) : = x γ − γ x + γ − 1 (4) γ ( γ − 1 ) Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 7 / 36
Extensions The power divergences functions Q ∈ M 1 �→ φ γ ( Q , P ) can be defined on the whole vector space of signed finite measures M via the extension of the definition of the convex functions ϕ γ : For all γ ∈ R such that the function x �→ ϕ γ ( x ) is not defined on ] − ∞ , 0 [ or defined but not convex on whole R , we extend its definition as follows � ϕ γ ( x ) if x ∈ [ 0 , + ∞ [ , x ∈ ] − ∞ , + ∞ [ �→ (5) + ∞ x ∈ ] − ∞ , 0 [ . if 2 ( x − 1 ) 2 is Note that for the χ 2 -divergence for instance, ϕ 2 ( x ) : = 1 defined and convex on whole R . Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 8 / 36
The conjugate (or Legendre transform) of ϕ will be denoted ϕ ∗ , t ∈ R �→ ϕ ∗ ( t ) : = sup { tx − ϕ ( x ) } , x ∈ R Property: ϕ is essentially smooth iff ϕ ∗ is strictly convex; then, � � ϕ ∗ ( t ) = t ϕ �− 1 ( t ) − ϕ ϕ �− 1 ( t ) ϕ ∗� ( t ) = ϕ �− 1 ( t ) . and In the present setting this holds. Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 9 / 36
The bootstrapped empirical measure Let Y , Y 1 , Y 2 , ... denote a sequence of positive i.i.d. random variables . We assume that Y satisfies the so-called Cramer condition � � t ∈ R such that Λ Y ( t ) : = log Ee tY < ∞ N : = contains a neigborhood of 0 with non void interior. Consider the weights W n i , 1 ≤ i ≤ n Y i W n i : = ∑ n i = 1 Y i which define a vector of exchangeable variables ( W n 1 , .., W n n ) for all n ≥ 1 . Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 10 / 36
The data x n 1 , .., x n n : We assume that n 1 ∑ lim δ x n i = P n n → ∞ i = 1 a.s. and we define the bootstrapped empirical measure of ( x n 1 , .., x n n ) by n : = 1 P W W n ∑ i δ x n i . n n i = 1 Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 11 / 36
A Sanov type result for the weighted Bootstrap empirical measure Define the Legendre transform of Λ Y , say Λ ∗ defined on Im Λ � by Λ ∗ ( x ) : = sup t tx − Λ Y ( t ) . Theorem Under the above hypotheses and notation the sequence P W obeys a LDP n on the space of all finite signed measures on X equipped with the weak convergence topology with rate function � Λ ∗ � � m dQ inf m > 0 dP ( x ) dP ( x ) if Q << P φ ( Q , P ) : = (6) + ∞ otherwise This Theorem is a variation on Corollary 3.3 in Trashorras and Wintenberger (2014). Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 12 / 36
Estimation of the minimum of the Kullback divergence Set Y 1 , .., Y n i.i.d. standard exponential . Then Λ ∗ ( x ) = ϕ 1 ( x ) : = x log x − x + 1 and � � � dQ � � � mdQ Λ ∗ Λ ∗ inf dP ( x ) dP ( x ) = dP ( x ) dP ( x ) = KL ( Q , P ) . m > 0 Repete sampling ( Y 1 , .., Y n ) i.i.d. E(1) K times. Hence for sets Ω such that KL ( int Ω , P ) = KL ( cl Ω , P ) then for large K �� � � 1 n log 1 P W K card j ∈ Ω , 1 ≤ j ≤ K n is a proxy of � � 1 P W ∈ Ω n log Pr n and therefore an estimator of KL ( Ω , P ) . Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 13 / 36
When Y is E(1) then by Pyke’s Theorem, ( W 1 .., W n ) coincides with the spacings of the ordered statistics of n i.i.d. uniformly distributed r.v’s on ( 0 , 1 ) , i.e. the simplest bootstrap version of P n based on exchangeable weights. It also holds with these weights � � � 1 − 1 � P W � x n 1 , .., x n ∈ Ω n log Pr ( P n ∈ Ω ) = 0 lim n log Pr n n n → ∞ This weighted bootstrap is the only LDP efficient one. Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 14 / 36
Estimation of the minimum of the Likelihood divergence � dQ � � dQ � � � KL m ( Q , P ) : = ϕ 0 dP = − log dP dP dP ϕ 0 ( x ) : = − log x + x − 1 . Set Y 1 , .., Y n i.i.d. Poisson (1), then Λ ∗ ( x ) = ϕ 0 ( x ) : = − log x + x − 1 � � � dQ � � � mdQ Λ ∗ Λ ∗ inf dP ( x ) dP ( x ) = dP ( x ) dP ( x ) = KL ( Q , P ) . m > 0 Repete sampling ( Y 1 , .., Y n ) i.i.d. Poisson(1) K times. For large K �� � � 1 n log 1 P W j ∈ Ω , 1 ≤ j ≤ K K card n is an estimator of KL m ( Ω , P ) , since a proxy of � � 1 P W n log Pr ∈ Ω n Michel Broniatowski (Institute) Monte Carlo and divergences June 13, 2016 15 / 36
Recommend
More recommend