OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS Philippe Rigollet - PowerPoint PPT Presentation

OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS Philippe Rigollet (joint with Quentin Berthet)

High dimensional data Cloud of point in R p

High dimensional data Cloud of n points in R p

Principal component Principal component = direction of largest variance

Principal component analysis (PCA) • Tool for dimension reduction • Spectrum of covariance matrix • Main tool for exploratory data analysis. We study only the first principal component This talk: high-dimensional , finite sample framework. p � n

Testing for sphericity under rank-one alternative H 1 : Σ = I p + θ vv > H 0 : Σ = I p | v | 2 = 1 Isotropic Principal component

The model • Observations: i.i.d. X 1 , . . . , X n ∼ N p (0 , Σ ) • Estimator: empirical covariance matrix n Σ = 1 X ˆ X i X > i n i =1 If it is a consistent estimator. n � p If it is inconsistent (Nadler, Paul, Onatski, ...) n ' cp eigenvectors are orthogonal

Empirical spectrum under the null H 0 : Σ = I p 30 Marcenko-Pastur distribution 25 20 15 10 5 0 0 1 2 3 4 5 6 7 8 ˆ Spectrum of Σ

Empirical spectrum under the alternative H 1 : Σ = I p + θ vv > | v | 2 = 1 p The BBP (Baik, Ben Arous, Péché) transition n → α > 0 θ ≤ √ α θ > √ α Indistinguishable detection possible if r p from the null θ > n very strong signal!

Testing for sparse principal component H 1 : Σ = I p + θ vv > , H 0 : Σ = I p | v | 2 = 1 , | v | 0 ≤ k Isotropic Sparse principal direction

Testing for sparse principal component H 1 : Σ = I p + θ vv > , H 0 : Σ = I p | v | 2 = 1 , | v | 0 ≤ k minimum detection level ? θ Goal: find a statistic such that ϕ : S + p 7! R P H 0 ( ϕ (ˆ small under Σ ) < τ 0 ) ≥ 1 − δ H 0 P H 1 ( ϕ (ˆ large under Σ ) > τ 1 ) ≥ 1 − δ H 1 1 − δ 1 − δ τ 0 τ 1

P H 0 ( ϕ (ˆ small under Σ ) < τ 0 ) ≥ 1 − δ H 0 P H 1 ( ϕ (ˆ large under Σ ) > τ 1 ) ≥ 1 − δ H 1 1 − δ 1 − δ τ 0 τ 1 τ 0 ≤ τ ≤ τ 1 Take the test: ψ (ˆ Σ ) = 1 { ϕ (ˆ Σ ) > τ } . It satisfies: P H 0 ( ψ = 1) ∨ max P H 1 ( ψ = 0) ≤ δ | v | 2 =1 | v | 0 ≤ k

Sparse eigenvalue k-sparse eigenvalue: x > ˆ ϕ (ˆ max (ˆ | S | = k λ max (ˆ Σ ) = λ k Σ ) = max Σ x = max Σ S ) | x | 2 = 1 | x | 0  k Note that: λ k λ k max ( I p + θ vv > ) = 1 + θ max ( I p ) = 1 and Smaller fluctuations than the largest eigenvalue λ max (ˆ Σ )

Upper bounds w.p. 1 − δ Under the null hypothesis : r k log(9 ep/k ) + log(1 / δ ) max (ˆ λ k Σ ) ≤ 1 + 8 =: τ 0 n Under the alternative hypothesis : r log(1 / δ ) max (ˆ λ k Σ ) ≥ 1 + θ − 2(1 + θ ) =: τ 1 n Can detect as soon as , which yields τ 0 < τ 1 r k log( p/k ) θ ≥ C n

Minimax lower bound Fix (small). ν > 0 Then there exists a constant such that if C ν > 0 k log ( C ν p/k 2 + 1) r ∧ 1 θ < ¯ θ := √ n 2 Then ≥ 1 n o P n P n inf 0 ( ψ = 1) ∨ max v ( ψ = 0) 2 − ν ψ | v | 2 =1 | v | 0 ≤ k See also Arias-Castro, Bubeck and Lugosi (12)

Computational issues ✓ p ◆ To compute , need to compute eigenvalues max (ˆ λ k Σ ) k Can be used to find cliques in graphs: NP-complete pb. Need an approximation...

Semidefinite relaxation 101 Cauchy-Schwarz max ( A ) = max. λ k Tr( ) SDP k ( A ) = x > A x AZ subject to Tr( ) x > x = 1 � Z xx > | | 1 ≤ k | x | 0 ≤ k Z Z = xx > rank( Z ) = 1 Z ⌫ 0 Semidefinite program program (SDP) introduced by d’Aspremont, El Gahoui, Jordan and Lanckriet (2004). Testing procedure: 1 { SDP k (ˆ Σ ) > τ } Defined even if solution of SDP has rank > 1

Performance of SDP For the alternative : relaxation of so max (ˆ λ k Σ ) SDP k (ˆ max (ˆ Σ ) ≥ λ k Σ ) For the null : use dual (Bach et al. 2010) SDP k ( A ) = min { λ max ( A + U ) + k | U | ∞ } U ∈ S + p For any this gives an upper bound on SDP k (ˆ U ∈ S + Σ ) p Enough to look only at minimum dual perturbation n o MDP k (ˆ λ max ( st z (ˆ Σ ) = min Σ )) + kz z ≥ 0

Upper bounds w.p. 1 − δ � ∗ DP ∈ SDP , MDP Under the null hypothesis : k 2 log( ep/ δ ) r ∗ DP k (ˆ Σ ) ≤ 1 + 10 =: τ 0 n Under the alternative hypothesis : r log(1 / δ ) ∗ DP k (ˆ Σ ) ≥ 1 + θ − 2(1 + θ ) =: τ 1 n Can detect as soon as , which yields τ 0 < τ 1 k 2 log( p/k ) r θ ≥ C n

SDP k 1.08 MDP k λ max ( · ) 1.06 H 1 / H 0 1.04 1.02 1 0.98 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 θ Ratio of 5% quantile under H 1 over 95% quantile under H 0 , versus signal strength θ . When this ratio is larger than one, both type I and type II errors are below 5%. A. d’Aspremont Soutenance HDR, ENS Cachan, Nov. 2012. 32/33

Summary detection detection No with with θ detection λ k ∗ DP k max r r k ⇣ p k 2 ⇣ p ⌘ ⌘ n log n log k k Can we tighten the gap?

Numerical evidence Fix type I error at 1%, plot type II error of MDP k p={50, 100, 200, 500}, k= √ p 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 Q 0.5 P 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 f e k 2 k ⇣ p ⇣ p ⌘ ⌘ n log n log k k minimax optimal scaling proved scaling

Random graphs A random (Erdos-Renyi) graph on N vertices is obtained by drawing edges at random with probability 1/2 N=50 largest clique is of size =7.8 asymp. almost surely 2 log N

Hidden clique We can hide a clique (here of size 10) in this graph Choose points arbitrarily and draw a clique

Hidden clique embed in the original random graph

Hidden clique Question: is there a hidden clique in this graph?

Hidden clique problem It is believed that it is hard to find/test the presence of a clique in a random graph (Alon, Arora, Feige, Hazan, Krauthgamer,... Cryptosystems are based on this fact!) Conjecture: It is hard to find cliques of size between Alon, Krivelevich, Sudakov 98 and √ 2 log N N Feige and Krauthgamer 00 Dekel et al. 10 Feige and Ron 10 Ames and Vavasis 11 Canonical example of average case complexity

Hidden clique problem It seems related to our problem but not trivially (the randomness structure is very fragile) Note that all our results extend to sub-Gaussian r.v. Theorem. If we could prove that there exists C > 0 such that under the null hypothesis it holds k α log( ep/ δ ) r SDP k (ˆ Σ ) ≤ 1 + C n for some , then it can be used to test the α ∈ (1 , 2) 1 presence of a clique of size polylog ( N ) N 4 − α

Remarks Unlike usual hardness results, this one is for one (actually two) method only (not for all methods). In progress: we can remove this limitation using bi- cliques (need to carefully deal with independence)

Conclusion Optimal rates for sparse detection Computationally efficient methods with suboptimal rate First(?) link between sparse detection and average case complexity Opens the door to new statistical lower bounds: complexity theoretic lower bounds Evidence that heuristics cannot be optimal

OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS Philippe Rigollet - PowerPoint PPT Presentation

OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS Philippe Rigollet (joint with Quentin Berthet) High dimensional data Cloud of point in R p High dimensional data Cloud of point in R p High dimensional data Cloud of n points in R p Principal

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

A Circle Detection Method Based on Optimal A Circle Detection Method Based on Optimal Parameter

Nearly Optimal Sparse Fourier Transform Haitham Hassanieh Piotr Indyk Dina Katabi Eric Price

Estimating Sparse Principal Components and Subspaces Jing Lei Department of Statistics, CMU

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Extremal results for sparse pseudorandom graphs Yufei Zhao Massachusetts Institute of Technology

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Introduction to Machine Learning Session 3b: Principal Components Analysis Reto West

Linear Predictors COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Linear

MetaPhish Val Smith (valsmith@attackresearch.com) Colin Ames (amesc@attackresearch.com) David

Produce Safety Educators Call #30 March 26, 2018 Instructions All participants are muted.

with Energy Storage in Decorah, Iowa July 30, 2020 Housekeeping Join audio: Choose Mic

Downloading a Billion Files in Python A case study in multi-threading, multi-processing, and

Baseline accuracy: 74.4% Top 3 features: Top 3 students: Male sex (39 student) MICHAEL YONG

Page 1 Livingstone: Livingstone: System Architecture Model-based MIR Model-based MIR

H517 Visualization Design, Analysis, & Evaluation Week 4: Color Perception Khairi Reda |

OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS Philippe Rigollet - PowerPoint PPT Presentation

OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS Philippe Rigollet (joint with Quentin Berthet) High dimensional data Cloud of point in R p High dimensional data Cloud of point in R p High dimensional data Cloud of n points in R p Principal

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

A Circle Detection Method Based on Optimal A Circle Detection Method Based on Optimal Parameter

Nearly Optimal Sparse Fourier Transform Haitham Hassanieh Piotr Indyk Dina Katabi Eric Price

Estimating Sparse Principal Components and Subspaces Jing Lei Department of Statistics, CMU

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Extremal results for sparse pseudorandom graphs Yufei Zhao Massachusetts Institute of Technology

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Introduction to Machine Learning Session 3b: Principal Components Analysis Reto West

Linear Predictors COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Linear

MetaPhish Val Smith (valsmith@attackresearch.com) Colin Ames (amesc@attackresearch.com) David

Produce Safety Educators Call #30 March 26, 2018 Instructions All participants are muted.

with Energy Storage in Decorah, Iowa July 30, 2020 Housekeeping Join audio: Choose Mic

Downloading a Billion Files in Python A case study in multi-threading, multi-processing, and

Baseline accuracy: 74.4% Top 3 features: Top 3 students: Male sex (39 student) MICHAEL YONG

Page 1 Livingstone: Livingstone: System Architecture Model-based MIR Model-based MIR

H517 Visualization Design, Analysis, &amp; Evaluation Week 4: Color Perception Khairi Reda |

H517 Visualization Design, Analysis, & Evaluation Week 4: Color Perception Khairi Reda |