Understanding parallel analysis methods for rank selection in PCA - PowerPoint PPT Presentation

Understanding parallel analysis methods for rank selection in PCA David Hong Yue Sheng Edgar Dobriban Wharton Statistics, University of Pennsylvania Random Matrices and Complex Data Analysis Workshop 10 December 2019 This work was supported in part by NSF BIGDATA grant IIS 1837992 and NSF TRIPODS award 1934960.

An illustrative example: principal components for genetics 1000G genetics data: n = 2318 individuals, p = 115019 SNPs Rounak Dey Xihong Lin Parallel analysis for rank selection in PCA 1/22

An illustrative example: principal components for genetics 1000G genetics data: n = 2318 individuals, p = 115019 SNPs Rounak Dey Xihong Lin PC’s can reveal population (and sub-population) structure, but how many are meaningful? Parallel analysis for rank selection in PCA 1/22

An illustrative example: principal components for genetics Often, we look at the scree plot and the spectrum: Question: how can we make principled selections and reason about them? Parallel analysis for rank selection in PCA 2/22

An illustrative example: principal components for genetics Often, we look at the scree plot and the spectrum: Question: how can we make principled selections and reason about them? The spectrum looks like a spiked covariance model... Parallel analysis for rank selection in PCA 2/22

Rank selection for PCA Rank selection is important – it affects every downstream step! ◮ too many: add noise to downstream analyses ◮ too few: lose signals that were in the data Many excellent and practical methods: √ ◮ Likelihood ratio test ◮ 4 / 3 (Gavish & Donoho 2014) (Bartlett 1950) ◮ bi-cross-validation ◮ Fixed threshold (Kaiser 1960) (Owen & Wang 2016) ◮ Scree plot (Cattell 1966) ◮ ... Today’s talk: parallel analysis (Horn, 1965; Buja & Eyuboglu 1992) Parallel analysis for rank selection in PCA 3/22

Rank selection for PCA Rank selection is important – it affects every downstream step! ◮ too many: add noise to downstream analyses ◮ too few: lose signals that were in the data Many excellent and practical methods: √ ◮ Likelihood ratio test ◮ 4 / 3 (Gavish & Donoho 2014) (Bartlett 1950) ◮ bi-cross-validation ◮ Fixed threshold (Kaiser 1960) (Owen & Wang 2016) ◮ Scree plot (Cattell 1966) ◮ ... Today’s talk: parallel analysis (Horn, 1965; Buja & Eyuboglu 1992) PA is a popular method with extensive empirical evidence, but limited theoretical understanding – exciting area for work! Parallel analysis for rank selection in PCA 3/22

Parallel analysis for rank selection Parallel analysis is suggested in many reviews: ◮ Brown (2014): PA “is accurate in the vast majority of cases” ◮ Hayton et al. (2004): PA is “one of the most accurate factor retention methods” used in social science and management ◮ Costello and Osborne (2005): PA is “accurate and easy to use” ◮ Friedman et al. (2009): defaults to PA for rank selection Parallel analysis for rank selection in PCA 4/22

Parallel analysis for rank selection Parallel analysis is suggested in many reviews: ◮ Brown (2014): PA “is accurate in the vast majority of cases” ◮ Hayton et al. (2004): PA is “one of the most accurate factor retention methods” used in social science and management ◮ Costello and Osborne (2005): PA is “accurate and easy to use” ◮ Friedman et al. (2009): defaults to PA for rank selection Also gaining popularity in applied statistics (esp. biological sciences): ◮ Leek and Storey (2007) ◮ Gerard and Stephens (2017) ◮ Leek and Storey (2008) ◮ Zhou et al. (2017) ◮ Lin et al. (2016) ◮ ... Parallel analysis for rank selection in PCA 4/22

Parallel analysis for rank selection Parallel analysis is suggested in many reviews: ◮ Brown (2014): PA “is accurate in the vast majority of cases” ◮ Hayton et al. (2004): PA is “one of the most accurate factor retention methods” used in social science and management ◮ Costello and Osborne (2005): PA is “accurate and easy to use” ◮ Friedman et al. (2009): defaults to PA for rank selection Also gaining popularity in applied statistics (esp. biological sciences): ◮ Leek and Storey (2007) ◮ Gerard and Stephens (2017) ◮ Leek and Storey (2008) ◮ Zhou et al. (2017) ◮ Lin et al. (2016) ◮ ... But there remains limited theoretical understanding: PA is “at best a heuristic approach rather than a mathematically rigorous one” – Green et al. (2012) Parallel analysis for rank selection in PCA 4/22

Parallel analysis for rank selection Given: data matrix X ∈ R n × p and percentile α ∈ [0 , 1] 1. Generate X π by randomly permuting the entries in each column X Parallel analysis for rank selection in PCA 5/22

Parallel analysis for rank selection Given: data matrix X ∈ R n × p and percentile α ∈ [0 , 1] 1. Generate X π by randomly permuting the entries in each column X π X Parallel analysis for rank selection in PCA 5/22

Parallel analysis for rank selection Given: data matrix X ∈ R n × p and percentile α ∈ [0 , 1] 1. Generate X π by randomly permuting the entries in each column 2. Repeat several times 3. Select the k th component if the k th singular value of X exceeds the α -percentile of the k th singular value of X π Parallel analysis for rank selection in PCA 5/22

Parallel analysis for rank selection Given: data matrix X ∈ R n × p and percentile α ∈ [0 , 1] 1. Generate X π by randomly permuting the entries in each column 2. Repeat several times 3. Select the k th component if the k th singular value of X exceeds the α -percentile of the k th singular value of X π One component rises above the permuted version. Parallel analysis for rank selection in PCA 5/22

Parallel analysis for rank selection Given: data matrix X ∈ R n × p and percentile α ∈ [0 , 1] 1. Generate X π by randomly permuting the entries in each column 2. Repeat several times 3. Select the k th component if the k th singular value of X exceeds the α -percentile of the k th singular value of X π One component rises above the permuted version. Idea: recover “null” by destroying correlations between features. Parallel analysis for rank selection in PCA 5/22

A quick sneak peak... For a larger version of the same problem, i.e., bigger n , p : Parallel analysis for rank selection in PCA 6/22

A quick sneak peak... For a larger version of the same problem, i.e., bigger n , p : Permutation provides a good estimate of the noise spectrum. Parallel analysis for rank selection in PCA 6/22

A quick sneak peak... For a larger version of the same problem, i.e., bigger n , p : Permutation provides a good estimate of the noise spectrum. ...let’s begin characterizing this a bit! Parallel analysis for rank selection in PCA 6/22

Parallel analysis under factor models Model: data is a linear combination of factors λ jk with noise ε ij r � X ij = η ik λ jk + ε ij , k =1 Parallel analysis for rank selection in PCA 7/22

Parallel analysis under factor models Model: data is a linear combination of factors λ jk with noise ε ij r � X ij = η ik λ jk + ε ij , k =1 i.e., low-rank signal + noise X = η Λ ⊤ + E = S + E . S = + Parallel analysis for rank selection in PCA 7/22

Parallel analysis under factor models Key idea: permutation “destroys” the signal S but not the noise E Parallel analysis for rank selection in PCA 8/22

Parallel analysis under factor models Key idea: permutation “destroys” the signal S but not the noise E � S π � ≪ � S � S Parallel analysis for rank selection in PCA 8/22

Parallel analysis under factor models Key idea: permutation “destroys” the signal S but not the noise E � S π � ≪ � S � E π = d E S E Parallel analysis for rank selection in PCA 8/22

Parallel analysis under factor models Key idea: permutation “destroys” the signal S but not the noise E � S π � ≪ � S � E π = d E S E Consequence: PA estimates noise spectrum (i.e., noise floor) σ k ( X π ) = σ k ( S π + E π ) ≈ σ k ( E π ) = d σ k ( E π ) . Parallel analysis for rank selection in PCA 8/22

Understanding parallel analysis methods for rank selection in PCA - PowerPoint PPT Presentation

Understanding parallel analysis methods for rank selection in PCA David Hong Yue Sheng Edgar Dobriban Wharton Statistics, University of Pennsylvania Random Matrices and Complex Data Analysis Workshop 10 December 2019 This work was supported

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des

2018 - 2019 Teacher Salary Comparison Report 0-Year 5-Year 10-Year 15-Year 20-Year District

Introduction to rank-based cryptography Philippe Gaborit University of Limoges, France ASCRYPTO

Web Mining Mining content Simple rank is confused by rank sinks, e.g. two pages that

10. Learning to Rank Outline 10.1. Why Learning to Rank (LeToR)? 10.2. Pointwise, Pairwise,

Selection Problem Rank Given n unsorted elements, determine the Rank of an element is its

Selection Problem Rank Given n unsorted elements, determine the Rank of an element is its

Multiple-Rank Updates to Matrix Factorizations Zack 8/30/2013 Outline u Introduction u

/k Content 2/15 1. Introduction 2. Hamming weight 3. Rank weight 4. Extended rank weight

Symmetric rank distance codes Kai-Uwe Schmidt Otto-von-Guericke University Magdeburg, Germany 1

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

Explainable (Deep) Learning and Simulation approaches Torsten Mller Visualization and

Bayesian Adjustment for Multiplicity Jim Berger Duke University with James Scott University of

tdlo CS 744: DATACENTER AS A COMPUTER Shivaram Venkataraman Fall 2020 ANNOUNCEMENTS -

What is Modern Web? Web Frameworks Web Tooling Mobile / Tablet First

THE 3-R'S OF DATA- THE 3-R'S OF DATA- SCIENCE: SCIENCE: REPEATABILITY REPEATABILITY, ,

Surv rviving Restructure Welcome Surviving Restructure - Introductions Sandra Leek Catherine

Statistical Foundations: Sampling 17 February 2020 Modern Research Methods The Single

Youth Involvement Team Brahmpreet Gulati Member of Leicester City Young Peoples Council

Sambuz

Useful Links

Newsletter

Mail Us