Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries - PowerPoint PPT Presentation

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments Stas Minsker University of Southern California July 21, 2016 ICERM Workshop

Simple question: how to estimate the mean? Assume that X 1 , . . . , X n are i.i.d. N ( µ, σ 2 0 ) . Problem: construct CI norm ( α ) for µ with coverage probability ≥ 1 − 2 α .

Simple question: how to estimate the mean? Assume that X 1 , . . . , X n are i.i.d. N ( µ, σ 2 0 ) . Problem: construct CI norm ( α ) for µ with coverage probability ≥ 1 − 2 α . n � µ n := 1 Solution: compute ˆ X j , take n j = 1 � � � � √ √ log ( 1 /α ) log ( 1 /α ) CI norm ( α ) = µ n − σ 0 ˆ 2 , ˆ µ n + σ 0 2 n n

Simple question: how to estimate the mean? Assume that X 1 , . . . , X n are i.i.d. N ( µ, σ 2 0 ) . Problem: construct CI norm ( α ) for µ with coverage probability ≥ 1 − 2 α . n � µ n := 1 Solution: compute ˆ X j , take n j = 1 � � � � √ √ log ( 1 /α ) log ( 1 /α ) CI norm ( α ) = µ n − σ 0 ˆ 2 , ˆ µ n + σ 0 2 n n Coverage is guaranteed since � � � � � 2 log ( 1 /α ) � ≥ σ 0 � ˆ Pr µ n − µ ≤ 2 α. n

Example: how to estimate the mean? P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X 1 , . . . , X n are i.i.d. copies of X ∼ Π such that E X = µ, Var ( X ) ≤ σ 2 0 ?

Example: how to estimate the mean? P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X 1 , . . . , X n are i.i.d. copies of X ∼ Π such that E X = µ, Var ( X ) ≤ σ 2 0 ? Problem: construct CI for µ with coverage probability ≥ 1 − α such that for any α length ( CI ( α )) ≤ ( Absolute constant ) · length ( CI norm ( α )) No additional assumptions on Π are imposed.

Example: how to estimate the mean? P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X 1 , . . . , X n are i.i.d. copies of X ∼ Π such that E X = µ, Var ( X ) ≤ σ 2 0 ? Problem: construct CI for µ with coverage probability ≥ 1 − α such that for any α length ( CI ( α )) ≤ ( Absolute constant ) · length ( CI norm ( α )) No additional assumptions on Π are imposed. n � µ n = 1 Remark : guarantees for the sample mean ˆ X j is unsatisfactory: n j = 1 � � � � � ( 1 /α ) � ≥ σ 0 � ˆ Pr µ n − µ ≤ α. n Does the solution exist?

Example: how to estimate the mean? Answer (somewhat unexpected?): Yes!

Example: how to estimate the mean? Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊ log ( 1 /α ) ⌋ + 1 groups G 1 , . . . , G k of size ≃ n / k each: G 1 G k � �� X 1 , . . . , X | G 1 | . . . . . . X n −| G k | + 1 , . . . , X n � �� 1 1 µ 1 := ˆ � X i µ k := ˆ � X i | G 1 | | Gk | Xi ∈ G 1 Xi ∈ Gk � �� µ ∗ =ˆ ˆ µ ∗ ( α ):= median (ˆ µ 1 ,..., ˆ µ k )

Example: how to estimate the mean? Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊ log ( 1 /α ) ⌋ + 1 groups G 1 , . . . , G k of size ≃ n / k each: G 1 G k � �� X 1 , . . . , X | G 1 | . . . . . . X n −| G k | + 1 , . . . , X n � �� 1 1 µ 1 := ˆ � X i µ k := ˆ � X i | G 1 | | Gk | Xi ∈ G 1 Xi ∈ Gk � �� µ ∗ =ˆ ˆ µ ∗ ( α ):= median (ˆ µ 1 ,..., ˆ µ k ) Claim: � � � log ( e /α ) Pr | ˆ µ ∗ − µ | ≥ 7 . 7 σ 0 ≤ α n

Example: how to estimate the mean? Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊ log ( 1 /α ) ⌋ + 1 groups G 1 , . . . , G k of size ≃ n / k each: G 1 G k � �� X 1 , . . . , X | G 1 | . . . . . . X n −| G k | + 1 , . . . , X n � �� 1 1 µ 1 := ˆ � X i µ k := ˆ � X i | G 1 | | Gk | Xi ∈ G 1 Xi ∈ Gk � �� µ ∗ =ˆ ˆ µ ∗ ( α ):= median (ˆ µ 1 ,..., ˆ µ k ) Claim: � � � log ( e /α ) Pr | ˆ µ ∗ − µ | ≥ 7 . 7 σ 0 ≤ α n Then take � � � � log ( e /α ) log ( e /α ) CI ( α ) = µ ∗ − 7 . 7 σ 0 ˆ , ˆ µ ∗ + 7 . 7 σ 0 n n

Idea of the proof: ˆ . . . . . . µ . . . . . . ˆ ˆ µ 1 µ 8 | ˆ µ − µ | ≥ s = ⇒ at least half of events {| ˆ µ j − µ | ≥ s } occur.

Improve the constant? O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0 − log ( 1 − x + x 2 / 2 ) ≤ ψ ( x ) ≤ log ( 1 + x + x 2 / 2 ) , and define ˆ µ via n � � � ψ θ ( X j − ˆ µ ) = 0 . j = 1

Improve the constant? O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0 − log ( 1 − x + x 2 / 2 ) ≤ ψ ( x ) ≤ log ( 1 + x + x 2 / 2 ) , and define ˆ µ via n � � � ψ θ ( X j − ˆ µ ) = 0 . j = 1 Truncation τ ( x ) = ( | x | ∧ 1 ) sign ( x ) satisfies a weaker inequality − log ( 1 − x + x 2 ) ≤ τ ( x ) ≤ log ( 1 + x + x 2 ) 1 0 − 1 − 1 0 1

Improve the constant? n � � � θ ( X j − ˆ ψ µ ) = 0 . j = 1 Intuition: for small θ > 0, n n � � � � θ ( X j − ˆ ≃ θ ( X j − ˆ ψ µ ) µ ) = 0 j = 1 j = 1 n � µ ≃ 1 = ⇒ ˆ X j n j = 1

Improve the constant? n � � � ψ θ ( X j − ˆ µ ) = 0 . j = 1 � 2 log ( 1 /α ) 1 The following holds: set θ ∗ = σ 0 . Then n � � √ � log ( 1 /α ) | ˆ µ − µ | ≤ 2 + o ( 1 ) σ 0 n with probability ≥ 1 − 2 α .

Extensions to higher dimensions A natural question: is it possible to extend presented techniques to the multivariate mean?

Extensions to higher dimensions A natural question: is it possible to extend presented techniques to the multivariate mean? Motivation: PCA 72 3 71.9 2.8 71.8 2.6 71.7 2.4 71.6 2.2 ⇒ = 71.5 2 71.4 1.8 71.3 1.6 71.2 1.4 71.1 1.2 71 1 10 10 9 9 8 8 10 10 7 7 9 9 6 8 6 8 7 7 5 5 6 6 4 4 5 5 3 4 3 4 2 3 2 3 2 2 1 1 1 1 0 0 0 0

Extensions to higher dimensions Motivation: PCA Genes mirror geography within Europe , J. Novembre et al, Nature 2008. PC1 a PC2 b c good explanation for non-experts: https://faculty.washington.edu/tathornt/SISG2015/lectures/assoc2015session05.pdf

Extensions to higher dimensions Motivation: PCA Genes mirror geography within Europe , J. Novembre et al, Nature 2008. Mathematical framework: Y 1 , . . . , Y n ∈ R d , i.i.d. E Y j = 0 , E Y j Y T = Σ . j Goal: construct ˆ Σ , an estimator of Σ such that � � � � � ˆ Σ − Σ � Op is small.

Extensions to higher dimensions Motivation: PCA Genes mirror geography within Europe , J. Novembre et al, Nature 2008. Mathematical framework: Y 1 , . . . , Y n ∈ R d , i.i.d. E Y j = 0 , E Y j Y T = Σ . j Goal: construct ˆ Σ , an estimator of Σ such that � � � � � ˆ Σ − Σ � Op is small. Sample covariance n � Σ n = 1 ˜ Y j Y T j n j = 1 is very sensitive to outliers.

Extensions to higher dimensions Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent.

Extensions to higher dimensions Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. k � x ∗ = med ( x 1 , . . . , x k ) := argmin � y − x j � . y ∈ R d j = 1

Extensions to higher dimensions Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. k � x ∗ = med ( x 1 , . . . , x k ) := argmin � y − x j � . y ∈ R d j = 1 Still some issues: 1 does not work well for small sample sizes; 2 yields bounds in the wrong norm.

Extensions to higher dimensions Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. k � x ∗ = med ( x 1 , . . . , x k ) := argmin � y − x j � . y ∈ R d j = 1 Still some issues: 1 does not work well for small sample sizes; 2 yields bounds in the wrong norm. Alternatives: Tyler’s M-estimator, Maronna’s M-estimator; guarantees are limited to special classes of distributions.

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries - PowerPoint PPT Presentation

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments Stas Minsker University of Southern California July 21, 2016 ICERM Workshop Simple question: how to estimate the mean? Assume that X 1 , . . . , X

L-estimators, R-estimators, Redescending M gr. Jakub Petr asek Estimators Revision Seminar

Visualizing Model Architecture john.sekar@mssm.edu SASB `17 Kinetics ~ Reaction Rules Enz Sub

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian prior Variance:

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Permanent estimators via random matrices Mark Rudelson joint work with Ofer Zeitouni Department

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Multivariate Gaussian Mean vector: Covariance matrix: 2 1 Conditioning a Gaussian Joint

Gaussian Random Variables and Processes Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Gaussian Free Field in (self-adjoint) random matrices and random surfaces Alexei Borodin Corners

Dynamic Panel Data estimators Christopher F Baum EC 823: Applied Econometrics Boston College,

Small Sample Performance of Instrumental Variables Probit Estimators: A Monte Carlo Investigation

Review - Mathematical Statistics Estimators and Estimates Unbiased estimators Efficiency

Review - Mathematical Statistics Estimators and Estimates Unbiased estimators Efficiency

Contents Clustering K-means Mixture of Gaussians Expectation Maximization

On Efficient Spatial Matching Raymond Chi-Wing Wong (the Chinese University of Hong Kong) Yufei

Metaheuristic Search for Combinatorial Optimization Dirk Thierens Universiteit Utrecht The

Assigning People in Practice Robert Fourer Department of Industrial Engineering and Management

Supervised Learning Matthieu R. Bloch 1 Supervised learning Definition 1.1. Assume that there

Domain Adaptation with Asymmetrically Relaxed Distribution Alignment Yifan Wu , Ezra Winston,

Momentum Distribution in A = 3 Asymmetric Nuclei Jefferson Lab Hall-A Experiment E12-14-011

Can asymmetric halo profiles affect galaxy clustering? Tobiasz Grecki Astronomical Observatory