Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments Stas Minsker University of Southern California July 21, 2016 ICERM Workshop
Simple question: how to estimate the mean? Assume that X 1 , . . . , X n are i.i.d. N ( µ, σ 2 0 ) . Problem: construct CI norm ( α ) for µ with coverage probability ≥ 1 − 2 α .
Simple question: how to estimate the mean? Assume that X 1 , . . . , X n are i.i.d. N ( µ, σ 2 0 ) . Problem: construct CI norm ( α ) for µ with coverage probability ≥ 1 − 2 α . n � µ n := 1 Solution: compute ˆ X j , take n j = 1 � � � � √ √ log ( 1 /α ) log ( 1 /α ) CI norm ( α ) = µ n − σ 0 ˆ 2 , ˆ µ n + σ 0 2 n n
Simple question: how to estimate the mean? Assume that X 1 , . . . , X n are i.i.d. N ( µ, σ 2 0 ) . Problem: construct CI norm ( α ) for µ with coverage probability ≥ 1 − 2 α . n � µ n := 1 Solution: compute ˆ X j , take n j = 1 � � � � √ √ log ( 1 /α ) log ( 1 /α ) CI norm ( α ) = µ n − σ 0 ˆ 2 , ˆ µ n + σ 0 2 n n Coverage is guaranteed since � � � � � 2 log ( 1 /α ) � ≥ σ 0 � ˆ Pr µ n − µ ≤ 2 α. n
Example: how to estimate the mean? P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X 1 , . . . , X n are i.i.d. copies of X ∼ Π such that E X = µ, Var ( X ) ≤ σ 2 0 ?
Example: how to estimate the mean? P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X 1 , . . . , X n are i.i.d. copies of X ∼ Π such that E X = µ, Var ( X ) ≤ σ 2 0 ? Problem: construct CI for µ with coverage probability ≥ 1 − α such that for any α length ( CI ( α )) ≤ ( Absolute constant ) · length ( CI norm ( α )) No additional assumptions on Π are imposed.
Example: how to estimate the mean? P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X 1 , . . . , X n are i.i.d. copies of X ∼ Π such that E X = µ, Var ( X ) ≤ σ 2 0 ? Problem: construct CI for µ with coverage probability ≥ 1 − α such that for any α length ( CI ( α )) ≤ ( Absolute constant ) · length ( CI norm ( α )) No additional assumptions on Π are imposed. n � µ n = 1 Remark : guarantees for the sample mean ˆ X j is unsatisfactory: n j = 1 � � � � � ( 1 /α ) � ≥ σ 0 � ˆ Pr µ n − µ ≤ α. n Does the solution exist?
Example: how to estimate the mean? Answer (somewhat unexpected?): Yes!
Example: how to estimate the mean? Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊ log ( 1 /α ) ⌋ + 1 groups G 1 , . . . , G k of size ≃ n / k each: G 1 G k � �� � � �� � X 1 , . . . , X | G 1 | . . . . . . X n −| G k | + 1 , . . . , X n � �� � � �� � 1 1 µ 1 := ˆ � X i µ k := ˆ � X i | G 1 | | Gk | Xi ∈ G 1 Xi ∈ Gk � �� � µ ∗ =ˆ ˆ µ ∗ ( α ):= median (ˆ µ 1 ,..., ˆ µ k )
Example: how to estimate the mean? Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊ log ( 1 /α ) ⌋ + 1 groups G 1 , . . . , G k of size ≃ n / k each: G 1 G k � �� � � �� � X 1 , . . . , X | G 1 | . . . . . . X n −| G k | + 1 , . . . , X n � �� � � �� � 1 1 µ 1 := ˆ � X i µ k := ˆ � X i | G 1 | | Gk | Xi ∈ G 1 Xi ∈ Gk � �� � µ ∗ =ˆ ˆ µ ∗ ( α ):= median (ˆ µ 1 ,..., ˆ µ k ) Claim: � � � log ( e /α ) Pr | ˆ µ ∗ − µ | ≥ 7 . 7 σ 0 ≤ α n
Example: how to estimate the mean? Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊ log ( 1 /α ) ⌋ + 1 groups G 1 , . . . , G k of size ≃ n / k each: G 1 G k � �� � � �� � X 1 , . . . , X | G 1 | . . . . . . X n −| G k | + 1 , . . . , X n � �� � � �� � 1 1 µ 1 := ˆ � X i µ k := ˆ � X i | G 1 | | Gk | Xi ∈ G 1 Xi ∈ Gk � �� � µ ∗ =ˆ ˆ µ ∗ ( α ):= median (ˆ µ 1 ,..., ˆ µ k ) Claim: � � � log ( e /α ) Pr | ˆ µ ∗ − µ | ≥ 7 . 7 σ 0 ≤ α n Then take � � � � log ( e /α ) log ( e /α ) CI ( α ) = µ ∗ − 7 . 7 σ 0 ˆ , ˆ µ ∗ + 7 . 7 σ 0 n n
Idea of the proof: ˆ . . . . . . µ . . . . . . ˆ ˆ µ 1 µ 8 | ˆ µ − µ | ≥ s = ⇒ at least half of events {| ˆ µ j − µ | ≥ s } occur.
Improve the constant? O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0 − log ( 1 − x + x 2 / 2 ) ≤ ψ ( x ) ≤ log ( 1 + x + x 2 / 2 ) , and define ˆ µ via n � � � ψ θ ( X j − ˆ µ ) = 0 . j = 1
Improve the constant? O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0 − log ( 1 − x + x 2 / 2 ) ≤ ψ ( x ) ≤ log ( 1 + x + x 2 / 2 ) , and define ˆ µ via n � � � ψ θ ( X j − ˆ µ ) = 0 . j = 1 Truncation τ ( x ) = ( | x | ∧ 1 ) sign ( x ) satisfies a weaker inequality − log ( 1 − x + x 2 ) ≤ τ ( x ) ≤ log ( 1 + x + x 2 ) 1 0 − 1 − 1 0 1
Improve the constant? n � � � θ ( X j − ˆ ψ µ ) = 0 . j = 1 Intuition: for small θ > 0, n n � � � � θ ( X j − ˆ ≃ θ ( X j − ˆ ψ µ ) µ ) = 0 j = 1 j = 1 n � µ ≃ 1 = ⇒ ˆ X j n j = 1
Improve the constant? n � � � ψ θ ( X j − ˆ µ ) = 0 . j = 1 � 2 log ( 1 /α ) 1 The following holds: set θ ∗ = σ 0 . Then n � � √ � log ( 1 /α ) | ˆ µ − µ | ≤ 2 + o ( 1 ) σ 0 n with probability ≥ 1 − 2 α .
Extensions to higher dimensions A natural question: is it possible to extend presented techniques to the multivariate mean?
Extensions to higher dimensions A natural question: is it possible to extend presented techniques to the multivariate mean? Motivation: PCA 72 3 71.9 2.8 71.8 2.6 71.7 2.4 71.6 2.2 ⇒ = 71.5 2 71.4 1.8 71.3 1.6 71.2 1.4 71.1 1.2 71 1 10 10 9 9 8 8 10 10 7 7 9 9 6 8 6 8 7 7 5 5 6 6 4 4 5 5 3 4 3 4 2 3 2 3 2 2 1 1 1 1 0 0 0 0
Extensions to higher dimensions Motivation: PCA Genes mirror geography within Europe , J. Novembre et al, Nature 2008. PC1 a PC2 b c good explanation for non-experts: https://faculty.washington.edu/tathornt/SISG2015/lectures/assoc2015session05.pdf
Extensions to higher dimensions Motivation: PCA Genes mirror geography within Europe , J. Novembre et al, Nature 2008. Mathematical framework: Y 1 , . . . , Y n ∈ R d , i.i.d. E Y j = 0 , E Y j Y T = Σ . j Goal: construct ˆ Σ , an estimator of Σ such that � � � � � ˆ Σ − Σ � Op is small.
Extensions to higher dimensions Motivation: PCA Genes mirror geography within Europe , J. Novembre et al, Nature 2008. Mathematical framework: Y 1 , . . . , Y n ∈ R d , i.i.d. E Y j = 0 , E Y j Y T = Σ . j Goal: construct ˆ Σ , an estimator of Σ such that � � � � � ˆ Σ − Σ � Op is small. Sample covariance n � Σ n = 1 ˜ Y j Y T j n j = 1 is very sensitive to outliers.
Extensions to higher dimensions Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent.
Extensions to higher dimensions Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. k � x ∗ = med ( x 1 , . . . , x k ) := argmin � y − x j � . y ∈ R d j = 1
Extensions to higher dimensions Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. k � x ∗ = med ( x 1 , . . . , x k ) := argmin � y − x j � . y ∈ R d j = 1 Still some issues: 1 does not work well for small sample sizes; 2 yields bounds in the wrong norm.
Extensions to higher dimensions Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. k � x ∗ = med ( x 1 , . . . , x k ) := argmin � y − x j � . y ∈ R d j = 1 Still some issues: 1 does not work well for small sample sizes; 2 yields bounds in the wrong norm. Alternatives: Tyler’s M-estimator, Maronna’s M-estimator; guarantees are limited to special classes of distributions.
Recommend
More recommend