Mean estimation: median-of-means tournaments G´ abor Lugosi ICREA, Pompeu Fabra University, BGSE based on joint work with Luc Devroye (McGill, Montreal) Matthieu Lerasle (CNRS, Nice) Roberto Imbuzeiro Oliveira (IMPA, Rio) Shahar Mendelson (Technion and ANU)
estimating the mean Given X 1 , . . . , X n , a real i.i.d. sequence, estimate µ = E X 1 .
estimating the mean Given X 1 , . . . , X n , a real i.i.d. sequence, estimate µ = E X 1 . “Obvious” choice: empirical mean n � µ n = 1 X i n i =1
estimating the mean Given X 1 , . . . , X n , a real i.i.d. sequence, estimate µ = E X 1 . “Obvious” choice: empirical mean n � µ n = 1 X i n i =1 By the central limit theorem, if X has a finite variance σ 2 , � √ n | µ n − µ | > σ � � n →∞ P lim 2 log(2 /δ ) ≤ δ . We would like non-asymptotic inequalities of a similar form.
estimating the mean Given X 1 , . . . , X n , a real i.i.d. sequence, estimate µ = E X 1 . “Obvious” choice: empirical mean n � µ n = 1 X i n i =1 By the central limit theorem, if X has a finite variance σ 2 , � √ n | µ n − µ | > σ � � n →∞ P lim 2 log(2 /δ ) ≤ δ . We would like non-asymptotic inequalities of a similar form. If the distribution is sub-Gaussian, E exp( λ ( X − µ )) ≤ exp( σ 2 λ 2 / 2) , then with probability at least 1 − δ , � 2 log(2 /δ ) | µ n − µ | ≤ σ . n
empirical mean–heavy tails The empirical mean is computationally attractive. Requires no a priori knowledge and automatically scales with σ . If the distribution is not sub-Gaussian, we still have Chebyshev’s inequality: w.p. ≥ 1 − δ , � 1 | µ n − µ | ≤ σ n δ . Exponentially weaker bound. Especially hurts when many means are estimated simultaneously. This is the best one can say. Catoni (2012) shows that for each δ there exists a distribution with variance σ such that � c � � | µ n − µ | ≥ σ ≥ δ . P n δ
median of means A simple estimator is median-of-means. Goes back to Nemirovsky, Yudin (1983), Jerrum, Valiant, and Vazirani (1986), Alon, Matias, and Szegedy (2002). m km � � 1 X t , . . . , 1 def µ MM � = median X t m m t =1 t =( k − 1) m +1
median of means A simple estimator is median-of-means. Goes back to Nemirovsky, Yudin (1983), Jerrum, Valiant, and Vazirani (1986), Alon, Matias, and Szegedy (2002). m km � � 1 X t , . . . , 1 def µ MM � = median X t m m t =1 t =( k − 1) m +1 Lemma Let δ ∈ (0 , 1) , k = 8 log δ − 1 and m = n 8 log δ − 1 . Then with probability at least 1 − δ , � 32 log(1 /δ ) | � µ MM − µ | ≤ σ n
proof � By Chebyshev, each mean is within distance σ 4 / m of µ with probability 3 / 4 . � The probability that the median is not within distance σ 4 / m of µ is at most P { Bin ( k , 1 / 4) > k / 2 } which is exponentially small in k .
median of means • Sub-Gaussian deviations. • Scales automatically with σ . • Parameters depend on required confidence level δ . • See Lerasle and Oliveira (2012), Hsu and Sabato (2013), Minsker (2014) for generalizations. • Also works when the variance is infinite. If � | X − E X | 1+ α � = M for some α ≤ 1 , then, with E probability at least 1 − δ , � � α/ (1+ α ) 8(12 M ) 1 /α ln(1 /δ ) | � µ MM − µ | ≤ n
why sub-Gaussian? Sub-Gaussian bounds are the best one can hope for when the variance is finite. In fact, for any M > 0 , α ∈ (0 , 1] , δ > 2 e − n / 4 , and mean � | X − E X | 1+ α � estimator � µ n , there exists a distribution E = M such that � � α/ (1+ α ) M 1 /α ln(1 /δ ) | � µ n − µ | ≥ . n Proof: The distributions P + (0) = 1 − p , P + ( c ) = p and P − (0) = 1 − p , P − ( − c ) = p are indistinguishable if all n samples are equal to 0 .
why sub-Gaussian? This shows optimality of the median-of-means estimator for all α . It also shows that finite variance is necessary even for rate n − 1 / 2 . One cannot hope to get anything better than sub-Gaussian tails. Catoni proved that sample mean is optimal for the class of Gaussian distributions.
multiple- δ estimators Do there exist estimators that are sub-Gaussian simultaneously for all confidence levels? An estimator is multiple- δ -sub-Gaussian for a class of distributions P and δ min if for all δ ∈ [ δ min , 1) , and all distributions in P , � log(2 /δ ) | � µ n − µ | ≤ L σ . n
multiple- δ estimators Do there exist estimators that are sub-Gaussian simultaneously for all confidence levels? An estimator is multiple- δ -sub-Gaussian for a class of distributions P and δ min if for all δ ∈ [ δ min , 1) , and all distributions in P , � log(2 /δ ) | � µ n − µ | ≤ L σ . n The picture is more complex than before.
known variance Given 0 < σ 1 ≤ σ 2 < ∞ , define the class [ σ 2 1 ,σ 2 2 ] = { P : σ 2 1 ≤ σ 2 P ≤ σ 2 P 2 . } 2 Let R = σ 2 /σ 1 .
known variance Given 0 < σ 1 ≤ σ 2 < ∞ , define the class [ σ 2 1 ,σ 2 2 ] = { P : σ 2 1 ≤ σ 2 P ≤ σ 2 P 2 . } 2 Let R = σ 2 /σ 1 . • If R is bounded then there exists a multiple- δ -sub-Gaussian estimator with δ min = 4 e 1 − n / 2 ; • If R is unbounded then there is no multiple- δ -sub-Gaussian estimate for any L and δ min → 0 . A sharp distinction. The exponentially small value of δ min is best possible.
construction of multiple- δ estimator Reminiscent to Lepski’s method of adaptive estimation. For k = 1 , . . . , K = log 2 (1 /δ min ) , use the median-of-means estimator to construct confidence intervals I k such that ∈ I k } ≤ 2 − k . P { µ / (This is where knowledge of σ 2 and boundedness of R is used.) Define K � � I j � = ∅ k = min k : . j = k Finally, let K � µ n = mid point of � I j j = � k
proof For any k = 1 , . . . , K , P {| � µ n − µ | > | I k |} ≤ P {∃ j ≥ k : µ / ∈ I j } because if µ ∈ ∩ K j = k I j , then ∩ K j = k I j is non-empty and therefore µ n ∈ ∩ K � j = k I j . But K � ∈ I j } ≤ 2 1 − k P {∃ j ≥ k : µ / ∈ I j } ≤ P { µ / j = k
higher moments For η ≥ 1 and α ∈ (2 , 3] , define P α,η = { P : E | X − µ | α ≤ ( η σ ) α } . Then for some C = C ( α, η ) there exists a multiple- δ estimator with a constant L and δ min = e − n / C for all sufficiently large n .
k -regular distributions This follows from a more general result: Define j j � � p − ( j ) = P X i ≤ j µ and p + ( j ) = P X i ≥ j µ . i =1 i =1 A distribution is k -regular if ∀ j ≥ k , min( p + ( j ) , p − ( j )) ≥ 1 / 3 . For this class there exists a multiple- δ estimator with a constant L and δ min = e − n / k for all n .
multivariate distributions Let X be a random vector taking values in R d with mean µ = E X and covariance matrix Σ = E ( X − µ )( X − µ ) T . Given an i.i.d. sample X 1 , . . . , X n , we want to estimate µ that has sub-Gaussian performance.
multivariate distributions Let X be a random vector taking values in R d with mean µ = E X and covariance matrix Σ = E ( X − µ )( X − µ ) T . Given an i.i.d. sample X 1 , . . . , X n , we want to estimate µ that has sub-Gaussian performance. What is sub-Gaussian? If X has a multivariate Gaussian distribution, the sample mean µ n = (1 / n ) � n i =1 X 1 satisfies, with probability at least 1 − δ , � � Tr (Σ) 2 λ max log(1 /δ ) � µ n − µ � ≤ + , n n Can one construct mean estimators with similar performance for a large class of distributions?
coordinate-wise median of means Coordinate-wise median of means yields the bound: � Tr (Σ) log( d /δ ) � � µ MM − µ � ≤ K . n We can do better.
multivariate median of means Hsu and Sabato (2013), Minsker (2015) extended the median-of-means estimate. Minsker proposes an analogous estimate that uses the multivariate median N � Med ( x 1 , . . . , x N ) = argmin � y − x i � . y ∈ R d i =1 For this estimate, with probability at least 1 − δ , � Tr (Σ) log(1 /δ ) � � µ MM − µ � ≤ K . n No further assumption or knowledge of the distribution is required. Computationally feasible. Almost sub-Gaussian but not quite. Dimension free.
median-of-means tournament We propose a new estimator with a purely sub-Gaussian performance, without further conditions. The mean µ is the minimizer of f ( x ) = E � X − µ � 2 . For any pair a , b ∈ R d , we try to guess whether f ( a ) < f ( b ) and set up a “tournament”. Partition the data points into k blocks of size m = n / k . We say that a defeats b if � � 1 � X i − a � 2 < 1 � X i − b � 2 m m i ∈ B j i ∈ B j on more than k / 2 blocks B j .
median-of-means tournament Within each block compute � Y j = 1 X i . m i ∈ B j Then a defeats b if � Y j − a � < � Y j − b � on more than k / 2 blocks B j . Lemma. Let k = ⌈ 200 log(2 /δ ) ⌉ . With probability at least 1 − δ , µ defeats all b ∈ R d such that � b − µ � ≥ r , where � � Tr (Σ) λ max log(2 /δ ) 800 . r = max , 240 n n
Recommend
More recommend