Outline Motivation High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to Genetics Theoretical Underpinnings Random Matrices Shrinkage Estimation Decision Theory Samprit Banerjee, Ph.D Bayesian Estimation QTL Mapping Div. of Biostatistics Background Statistical Challenges Department of Public Health Bayesian Solution Bayesian Multiple Weill Medical College of Cornell University Traits UW-M 22-Apr-2010
Motivation Outline High Dimensional Data Motivation High Dimensional Examples Data Examples Theoretical Underpinnings Theoretical Underpinnings Random Matrices Random Matrices Shrinkage Estimation Decision Theory Shrinkage Estimation Bayesian Estimation Decision Theory QTL Mapping Background Bayesian Estimation Statistical Challenges Bayesian Solution Bayesian Multiple Traits QTL Mapping Background Statistical Challenges Bayesian Solution Bayesian Multiple Traits
Data Deluge Outline Motivation “The coming century is surely the century of data” High Dimensional Data Examples Theoretical David Donoho, 2000 Underpinnings Random Matrices Shrinkage Estimation Decision Theory “..industrial revolution of data.” Bayesian Estimation QTL Mapping Background Statistical Challenges The Economist, 2010 Bayesian Solution Bayesian Multiple Traits Sources of high dimensional data ◮ Genetics and Genomics ◮ Internet portals: e.g Netflix ◮ Financial data
High Dimensional Data Outline Motivation High Dimensional In statistics, Data Examples ◮ Observations: instances of a particular phenomenon Theoretical ◮ Example of instances ↔ human beings Underpinnings Random Matrices ◮ Typically, n denotes the number of observations. Shrinkage Estimation Decision Theory Bayesian Estimation ◮ Variable or Random variable is vector of values these QTL Mapping Background observations are measured on Statistical Challenges ◮ Example: blood pressure, weight, height. Bayesian Solution Bayesian Multiple Traits ◮ Typically, p denotes the number of variables. ◮ Recent trend: explosive growth of p , ↔ dimensionality. ◮ p ≫ n classical methods of statistics fail!
Example 1: Principal Component Analysis Let X n × p = [ X 1 : X 2 : · · · : X p ] be i . i . d variables. Outline Goal: reduce dimensionality by constructing a smaller Motivation High Dimensional number of “derived” variables. Data Examples Theoretical Underpinnings Random Matrices Shrinkage Estimation Decision Theory Bayesian Estimation QTL Mapping Background Statistical Challenges Bayesian Solution Bayesian Multiple Traits || w || =1 var ( W ′ X ) w 1 = arg max Spectral decomposition: X ′ X = WLW ′ , where L = diag { ℓ 1 , ..., ℓ p } are the eigenvalues.
Population Structure within Europe Outline Motivation High Dimensional Data Examples Theoretical Underpinnings Random Matrices Shrinkage Estimation Decision Theory Bayesian Estimation QTL Mapping Background Statistical Challenges Bayesian Solution Bayesian Multiple Traits 1 1 J Novembre et al. Nature 000, 1-4 (2008) doi:10.1038/nature07331
Example 2: Multivariate Regression Outline Motivation One of the most common use of the covariance matrix in High Dimensional Data Examples statistics is during a multivariate regression. Theoretical Underpinnings Random Matrices Y n × p = X n × q β q × p + E n × p Shrinkage Estimation Decision Theory Bayesian Estimation where e i ∼ N p (0 , Σ) , i = 1 , · · · , n and Σ is p × p . QTL Mapping Background ◮ Historically p < n ; High Dimensional data p >> n or Statistical Challenges Bayesian Solution q >> n Bayesian Multiple Traits ◮ Estimates can be obtained by maximizing the likelihood n � − 1 � � | Σ | − 1 / 2 exp 2( Y i − X i β ) ′ Σ − 1 ( Y i − X i β ) L ( β, Σ | X , Y ) ∝ i =1
Seemingly Unrelated Regression Outline Motivation High Dimensional Data Zellner, 1962 introduced the Seemingly Unrelated Regression Examples model. Theoretical Underpinnings Random Matrices Shrinkage Estimation Y ∗ np × 1 = X ∗ np × pq β ∗ pq × 1 + e ∗ Decision Theory np × 1 Bayesian Estimation where Y = vec ( Y ), X ∗ = diag { X 1 , · · · , X p } , β ∗ = vec ( β ) QTL Mapping Background , e ∗ = vec ( E ) and vec () is the vectorizing operator. Statistical Challenges Bayesian Solution Bayesian Multiple ◮ e ∗ ∼ N (0 , Σ ⊗ I n ) Traits ◮ GLS estimates: ˆ β = ( X ∗ ′ Ω − 1 X ∗ ) − 1 ( X ∗ ′ Ω − 1 Y ) ◮ where Ω = Σ ⊗ I n and ⊗ is the Kronecker product.
Random Matrix Theory Outline Motivation High Dimensional Data ◮ Covariance matrix Σ p × p is a random matrix Examples Theoretical ◮ Eigenvalues of Σ, { λ 1 , · · · , λ p } are random Underpinnings Random Matrices ◮ Properties of interest: joint distribution of eigenvalues, Shrinkage Estimation Decision Theory number of eigenvalues falling below a given value Bayesian Estimation QTL Mapping ◮ Beginning in 1950s, physicists began to use random Background Statistical Challenges matrices to study energy levels of a system in quantum Bayesian Solution Bayesian Multiple mechanics. Traits ◮ Wigner proposed a statistical description of an “ensemble” of energy levels - properties empirical distribution and distribution of spacings.
Covariance Matrices Outline Motivation High Dimensional Data Examples Theoretical In statistics: X 1 , · · · , X n ∼ N p (0 , Σ) and Underpinnings X n × p = [ X 1 , · · · , X n ] ′ The usual estimator is Random Matrices Shrinkage Estimation Decision Theory Bayesian Estimation Bayesian Estimation QTL Mapping Sample Covariance Matrix Background Statistical Challenges Bayesian Solution π (Σ | X ) ∝ p ( X | Σ) π (Σ) S = X ′ X / n Bayesian Multiple Traits ˆ Σ = E π ( . | X ) (Σ)
Gaussian and Wishart Distributions Outline Motivation High Dimensional Data If X 1 , X 2 , · · · , X n are n i . i . d samples from a p -variate or Examples p -dimensional Gaussian distribution N p (0 , Σ) with density. Theoretical Underpinnings Random Matrices √ � � − 1 Shrinkage Estimation 2 π Σ | − 1 / 2 exp 2 X ′ Σ − 1 X f ( X ) = | Decision Theory Bayesian Estimation QTL Mapping S = X ′ X follows a Wishart distribution (named after John Background Statistical Challenges Wishart, 1928) Bayesian Solution Bayesian Multiple Traits � − 1 � f ( S ) = c n , p | Σ | − n / 2 | S | ( n − p − 1) / 2 etr 2Σ − 1 S where etr () = exp ( tr ())
Outline Motivation High Dimensional Data Examples Theoretical Underpinnings Eigenstructure of sample covariance matrix Random Matrices Shrinkage Estimation It is well known that the eigenvalues of the sample Decision Theory Bayesian Estimation covariance matrix are more spread out compared to the true QTL Mapping eigenvalues of the population covariance matrix Background Statistical Challenges Bayesian Solution Bayesian Multiple Traits
Spread of Sample Eigenvalues n= 2 n= 3 n= 4 n= 5 n= 6 n= 7 n= 8 Outline n= 9 n= 10 n= 11 n= 12 n= 13 n= 14 n= 15 Motivation n= 16 n= 17 n= 18 n= 19 High Dimensional n= 20 n= 21 n= 22 Data n= 23 ◮ Counting the n= 24 n= 25 n= 26 Examples n= 27 n= 28 n= 29 number of times n= 30 n= 31 n= 32 Theoretical n= 33 the sample n= 34 n= 35 n= 36 1.0 Underpinnings n= 37 n= 38 eigenvalues are n= 39 n= 40 0.9 n= 41 Random Matrices n= 42 n= 43 spread. n= 44 n= 45 Shrinkage Estimation 0.8 n= 46 n= 47 ◮ ℓ 1 < λ 1 | ℓ p > λ p n= 48 n= 49 Decision Theory n= 50 0.7 n= 51 n= 52 ◮ ℓ 1 > ℓ 2 > · · · > ℓ p Bayesian Estimation n= 53 n= 54 n= 55 0.6 n= 56 n= 57 n= 58 are the eigenvalues QTL Mapping n= 59 0.5 n= 60 n= 61 n= 62 of the sample n= 63 Background n= 64 0.4 n= 65 n= 66 n= 67 Statistical Challenges covariance matrix S n= 68 n= 69 0.3 n= 70 n= 71 Bayesian Solution ◮ λ 1 > λ 2 > · · · > λ p n= 72 n= 73 n= 74 Bayesian Multiple n= 75 n= 76 n= 77 are the eigenvalues Traits n= 78 n= 79 n= 80 n= 81 of the population n= 82 n= 83 n= 84 n= 85 n= 86 covariance matrix Σ n= 87 n= 88 n= 89 n= 90 n= 91 n= 92 n= 93 n= 94 n= 95 n= 96 n= 97 n= 98 n= 99 n= 100 p= 2 p= 3 p= 4 p= 5 p= 6 p= 7 p= 8 p= 9 p= 10 p= 11 p= 12 p= 13 p= 14 p= 15 p= 16 p= 17 p= 18 p= 19 p= 20 p= 21 p= 22 p= 23 p= 24 p= 25 p= 26 p= 27 p= 28 p= 29 p= 30 p= 31 p= 32 p= 33 p= 34 p= 35 p= 36 p= 37 p= 38 p= 39 p= 40 p= 41 p= 42 p= 43 p= 44 p= 45 p= 46 p= 47 p= 48 p= 49 p= 50 p= 51 p= 52 p= 53 p= 54 p= 55 p= 56 p= 57 p= 58 p= 59 p= 60 p= 61 p= 62 p= 63 p= 64 p= 65 p= 66 p= 67 p= 68 p= 69 p= 70 p= 71 p= 72 p= 73 p= 74 p= 75 p= 76 p= 77 p= 78 p= 79 p= 80 p= 81 p= 82 p= 83 p= 84 p= 85 p= 86 p= 87 p= 88 p= 89 p= 90 p= 91 p= 92 p= 93 p= 94 p= 95 p= 96 p= 97 p= 98 p= 99 p= 100
Recommend
More recommend