Three Approaches towards Optimal Property Estimation and Testing Jiantao Jiao (Stanford EE) Joint work with: Yanjun Han, Dmitri Pavlichin, Kartik Venkat, Tsachy Weissman Frontiers in Distribution Testing Workshop, FOCS 2017 Oct. 14th, 2017 1 / 23
Statistical properties Disclaimer: Throughout this talk, n refers to the number of samples, S refer to the alphabet size of a distribution. 1 Shannon entropy: H ( P ) � � S i =1 − p i ln p i . 2 F α ( P ): F α ( P ) � � S i =1 p α i , α > 0 . 3 KL divergence, χ 2 divergence, L 1 distance, Hellinger distance F ( P , Q ) � � S i =1 f ( p i , q i ) for f ( x , y ) = x ln( x / y ) , ( x − y ) 2 / x , | x − y | , ( √ x − √ y ) 2 . 2 / 23
Tolerant testing/learning/estimation We focus on the question: how many samples are needed to achieve accuracy ǫ for estimating these properties from empirical data? Example: L 1 ( P , U S ), U S = (1 / S , 1 / S , . . . , 1 / S ), observe n i.i.d. samples from P ; � S (VV’11, VV’11): exist approach whose error is n ln n when ln S � n � S ; no consistent estimator when n � S S ln S ; � The MLE plug-in L 1 ( ˆ S n when n � S . P n , U S ) achieves error 3 / 23
Tolerant testing/learning/estimation We focus on the question: how many samples are needed to achieve accuracy ǫ for estimating these properties from empirical data? Example: L 1 ( P , U S ), U S = (1 / S , 1 / S , . . . , 1 / S ), observe n i.i.d. samples from P ; � S (VV’11, VV’11): exist approach whose error is n ln n when ln S � n � S ; no consistent estimator when n � S S ln S ; � The MLE plug-in L 1 ( ˆ S n when n � S . P n , U S ) achieves error Effective sample size enlargement Minimax rate-optimal with n samples ⇐ ⇒ MLE with n ln n samples Similar results also hold for Shannon entropy (VV’11, VV’11, VV’13, WY’16, JVHW’15), power sum functional (JVHW’15), R´ enyi entropy estimation (AOST’14), χ 2 , Hellinger, and KL-divergence estimation (HJW’16, BZLV’16), L r norm estimation under Gaussian white noise model (HJMW’17), L 1 distance estimation (JHW’16), etc. except for support size (WY’16) 3 / 23
Effective sample size enlargement E | ˆ R minmax ( F , P , n ) = inf sup F − F ( P ) | ˆ F ( X 1 ,..., X n ) P ∈P E | F ( ˆ R plug-in ( F , P , n ) = sup P n ) − F ( P ) | . P ∈P F ( P ) P R minmax ( F , P , n ) R plug-in ( F , P , n ) S � � 1 S log( S ) S log( S ) � p i log M S + + √ n √ n p i n log( n ) n i =1 S S S � p α 0 < α ≤ 1 F α ( P ) = M S i , 2 ( n log( n )) α n α i =1 S 1 − α S 1 − α S S 1 F α ( P ) , 2 < α < 1 M S ( n log( n )) α + n α + √ n √ n ( n log( n )) − ( α − 1) n − ( α − 1) 1 < α < 3 F α ( P ) , M S 2 1 1 α ≥ 3 F α ( P ) , M S √ n √ n 2 � �� �� n log( n ) � n , n S − Θ max � S S Se − Θ � { P : min i p i ≥ 1 1 ( p i � = 0) S } S Se i =1 � q i S S S � q i � � � | p i − q i | M S q i ∧ q i ∧ n ln n n i =1 i =1 i =1 4 / 23
Effective sample size enlargement Divergence functions: here P , Q ∈ M S where we have m samples from p and n samples from q . For the Kullback-Leibler and χ 2 divergence estimators we only consider ( P , Q ) ∈ { ( P , Q ) | P , Q ∈ M S , P i Q i ≤ u ( S ) } where u ( S ) is some function of S . F ( P , Q ) R minmax ( F , P , m , n ) R plug-in ( F , P , m , n ) S � � S S � | p i − q i | + min { m , n } log(min { m , n } ) min { m , n } i =1 S � � 1 S S ( √ p i − √ q i ) 2 � 2 min { m , n } log(min { m , n } ) min { m , n } i =1 S � � Su ( S ) log( u ( S )) � u ( S ) Su ( S ) log( u ( S )) � u ( S ) p i S S � D ( P � Q ) = p i log + + + + + + √ m √ n √ m √ n m log( m ) n log( n ) q i m n i =1 p 2 Su ( S ) 2 u ( S ) 3 / 2 Su ( S ) 2 u ( S ) 3 / 2 S u ( S ) u ( S ) χ 2 ( P � Q ) = i � − 1 + + + + √ m √ n √ m √ n q i n log( n ) n i =1 5 / 23
Goal of this talk Understand the mechanism behind the logarithmic sample size enlargement. For what functionals do we have this phenomenon? What concrete algorithms achieve this phenomenon? If there exist multiple approaches, what are their relative advantages and disadvantages? 6 / 23
First approach: Approximation methodology Question Is the enlargement phenomenon caused by the fact that the functionals are permutation invariant (symmetric)? 7 / 23
First approach: Approximation methodology Question Is the enlargement phenomenon caused by the fact that the functionals are permutation invariant (symmetric)? Answer Nope. :) Literature on approximation methodology VV’11 (linear estimator), WY’16, WY’16 JVHW’15, AOST’14, HJW’16, BZLV’16, HJMW’16, JHW’16 7 / 23
Example: L 1 distance estimation Given Q = ( q 1 , q 2 , . . . , q S ), we estimate L 1 ( P , Q ) given i.i.d. samples from P . Theorem (J., Han, Weissman’16) √ √ q i ∧ q i �� S � Suppose ln S � ln n � ln n ln n , S ≥ 2 . Then, i =1 � q i S E P | ˆ � inf sup L − L 1 ( P , Q ) | ≍ q i ∧ n ln n . (1) ˆ L P ∈M S i =1 For the MLE, we have S � q i E P | L 1 ( ˆ � sup P n , Q ) − L 1 ( P , Q ) | ≍ q i ∧ n . (2) P ∈M S i =1 8 / 23
Confidence sets in binomial model: coverage probability ≍ 1 − n − A 0 1 Θ = [0 , 1] n ˆ p ∼ B( n , p )
Confidence sets in binomial model: coverage probability ≍ 1 − n − A ln n n 0 1 Θ = [0 , 1] n ˆ p ∼ B( n , p )
Confidence sets in binomial model: coverage probability ≍ 1 − n − A ln n n 0 1 p < ln n Θ = [0 , 1] ˆ n n ˆ p ∼ B( n , p )
Confidence sets in binomial model: coverage probability ≍ 1 − n − A ∼ ln n n ln n U (ˆ p ) n 0 1 p < ln n Θ = [0 , 1] ˆ n n ˆ p ∼ B( n , p )
Confidence sets in binomial model: coverage probability ≍ 1 − n − A ∼ ln n n ln n U (ˆ p ) n 0 1 p < ln n p > ln n Θ = [0 , 1] ˆ ˆ n n n ˆ p ∼ B( n , p )
Confidence sets in binomial model: coverage probability ≍ 1 − n − A � p ln n ˆ ∼ ln n ∼ n n ln n U (ˆ p ) U (ˆ p ) n 0 1 p < ln n p > ln n Θ = [0 , 1] ˆ ˆ n n n ˆ p ∼ B( n , p ) 9 / 23
Confidence sets in binomial model: coverage probability ≍ 1 − n − A � p ln n ˆ ∼ ln n ∼ n n ln n U (ˆ p ) U (ˆ p ) n 0 1 p < ln n p > ln n Θ = [0 , 1] ˆ ˆ n n n ˆ p ∼ B( n , p ) Theorem (J., Han, Weissman’16) Partition [0 , 1] into finitely number of intervals I i = [ x i , x i +1 ] , x 0 = 0 , n , √ x i +1 − √ x i ≍ � x 1 ≍ ln n ln n n . Then, p ∈ 2 I i with probability 1 − n − A ; 1 if p ∈ I i , then ˆ 2 if ˆ p ∈ I i , then p ∈ 2 I i with probability 1 − n − A ; 3 Those intervals are of the shortest length. 9 / 23
Algorithmic description of Approximation methodology p ′ First conduct sampling splitting, get ˆ p i , ˆ i i.i.d. with distribution 2 n · B( n / 2 , p i ). Suppose q i ∈ I j . For each i do the following: 1 if ˆ p i ∈ I j , compute best polynomial approximation in 2 I j : P K ( x ; q i ) = arg min z ∈ 2 I j || z − q i | − P ( z ) | , max (3) P ∈ Poly K and then estimate | p i − q i | by the unbiased estimator of P K ( p i ; q i ) p ′ using ˆ i ; 2 if ˆ p ′ p i / ∈ I j , estimate | p i − q i | by | ˆ i − q i | ; 3 sum everything up. 10 / 23
Why it works? 1 Suppose ˆ p i ∈ I j . No matter what we use to estimate, one can always assume that p i ∈ 2 I j ; 2 The bias of the MLE is approximately (Strukov and Timan’77) � q i sup || p i − q i | − E | ˆ p i − q i || ≍ q i ∧ n ; (4) p i ∈ 2 I j 3 The bias of the Approximation methodology is approximately (Ditzian and Totik’87) � q i sup || p i − q i | − P K ( p i ; q i ) | ≍ q i ∧ n ln n . (5) p i ∈ 2 I j 4 Permutation invariance does not play a role since we are doing symbol by symbol bias correction; 5 The bias dominates in high dimensions (measure concentration phenomenon). 11 / 23
Properties of the Approximation Methodology 1 Applies to essentially any functional 2 Applies to a wide range of statistical models (binomial, Poisson, Gaussian, etc) 3 Near-linear complexity 4 Explicit polynomial approximation for each different functional 5 Need to tune parameters in practice 12 / 23
Second approach: Local moment matching methodology Motivation Does there exist a single plug-in estimator that can replace the Approximation methodology? 13 / 23
Second approach: Local moment matching methodology Motivation Does there exist a single plug-in estimator that can replace the Approximation methodology? Answer No. For any plug-in rule ˆ P , there exists a fixed Q such that L 1 ( ˆ P , Q ) requires n ≫ S samples to consistently estimate L 1 ( P , Q ), while the S optimal method requires at most n ≫ ln S . 13 / 23
Second approach: Local moment matching methodology Motivation Does there exist a single plug-in estimator that can replace the Approximation methodology? Answer No. For any plug-in rule ˆ P , there exists a fixed Q such that L 1 ( ˆ P , Q ) requires n ≫ S samples to consistently estimate L 1 ( P , Q ), while the S optimal method requires at most n ≫ ln S . Weakened goal What about we only consider permutation invariant functionals? Literature on the local moment matching methodology VV’11 (linear programming), HJW’17 13 / 23
Recommend
More recommend