High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture 1 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.
Introduction classical asymptotic theory: sample size n → + ∞ with number of parameters p fixed ◮ law of large numbers, central limit theory ◮ consistency of maximum likelihood estimation
Introduction classical asymptotic theory: sample size n → + ∞ with number of parameters p fixed ◮ law of large numbers, central limit theory ◮ consistency of maximum likelihood estimation modern applications in science and engineering: ◮ large-scale problems: both p and n may be large (possibly p ≫ n ) ◮ need for high-dimensional theory that provides non-asymptotic results for ( n, p )
Introduction classical asymptotic theory: sample size n → + ∞ with number of parameters p fixed ◮ law of large numbers, central limit theory ◮ consistency of maximum likelihood estimation modern applications in science and engineering: ◮ large-scale problems: both p and n may be large (possibly p ≫ n ) ◮ need for high-dimensional theory that provides non-asymptotic results for ( n, p ) curses and blessings of high dimensionality ◮ exponential explosions in computational complexity ◮ statistical curses (sample complexity) ◮ concentration of measure
Introduction modern applications in science and engineering: ◮ large-scale problems: both p and n may be large (possibly p ≫ n ) ◮ need for high-dimensional theory that provides non-asymptotic results for ( n, p ) curses and blessings of high dimensionality ◮ exponential explosions in computational complexity ◮ statistical curses (sample complexity) ◮ concentration of measure Key ideas: what embedded low-dimensional structures are present in data? how can they can be exploited algorithmically?
Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ ∈ R p × p given i.i.d. samples X i ∼ N (0 , Σ), for i = 1 , 2 , . . . , n
Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ ∈ R p × p given i.i.d. samples X i ∼ N (0 , Σ), for i = 1 , 2 , . . . , n Classical approach: Estimate Σ via sample covariance matrix: n � 1 � X i X T Σ n := i n i =1 � �� � average of p × p rank one matrices
Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ ∈ R p × p given i.i.d. samples X i ∼ N (0 , Σ), for i = 1 , 2 , . . . , n Classical approach: Estimate Σ via sample covariance matrix: n � 1 � X i X T Σ n := i n i =1 � �� � average of p × p rank one matrices Reasonable properties: ( p fixed, n increasing) Unbiased: E [ � Σ n ] = Σ Consistent: � a.s. Σ n − → Σ as n → + ∞ Asymptotic distributional properties available
Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ ∈ R p × p given i.i.d. samples X i ∼ N (0 , Σ), for i = 1 , 2 , . . . , n Classical approach: Estimate Σ via sample covariance matrix: n � 1 � X i X T Σ n := i n i =1 � �� � average of p × p rank one matrices An alternative experiment: Fix some α > 0 Study behavior over sequences with p n = α Does � Σ n ( p ) converge to anything reasonable?
Empirical vs MP law ( α = 0.5) 1 Theory 0.9 0.8 0.7 Density (rescaled) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 Eigenvalue Marcenko & Pastur, 1967.
Empirical vs MP law ( α = 0.2) 1 Theory 0.9 0.8 0.7 Density (rescaled) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 Eigenvalue Marcenko & Pastur, 1967.
Low-dimensional structure: Gaussian graphical models Zero pattern of inverse covariance 1 1 2 2 3 4 3 5 5 4 1 2 3 4 5 � � − 1 2 x T Θ ∗ x P ( x 1 , x 2 , . . . , x p ) ∝ exp . Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 5 / 1
Maximum-likelihood with ℓ 1 -regularization Zero pattern of inverse covariance 1 1 2 2 3 4 3 5 5 4 1 2 3 4 5 Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ ∗ ∈ R p × p .
Maximum-likelihood with ℓ 1 -regularization Zero pattern of inverse covariance 1 1 2 2 3 4 3 5 5 4 1 2 3 4 5 Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ ∗ ∈ R p × p . Estimator (for inverse covariance) � � n � � � 1 � x i x T Θ ∈ arg min � i , Θ � � − log det(Θ) + λ n | Θ jk | n Θ i =1 j � = k Some past work: Yuan & Lin, 2006; d’Aspr´ emont et al., 2007; Bickel & Levina, 2007; El Karoui, 2007; d’Aspremont et al., 2007; Rothman et al., 2007; Zhou et al., 2007; Friedman et al., 2008; Lam & Fan, 2008; Ravikumar et al., 2008; Zhou, Cai & Huang, 2009
Gauss-Markov models with hidden variables Z X 1 X 2 X 3 X 4 Problems with hidden variables: conditioned on hidden Z , vector X = ( X 1 , X 2 , X 3 , X 4 ) is Gauss-Markov.
Gauss-Markov models with hidden variables Z X 1 X 2 X 3 X 4 Problems with hidden variables: conditioned on hidden Z , vector X = ( X 1 , X 2 , X 3 , X 4 ) is Gauss-Markov. Inverse covariance of X satisfies { sparse, low-rank } decomposition: 1 − µ µ µ µ µ 1 − µ µ µ = I 4 × 4 − µ 11 T . µ µ 1 − µ µ µ µ µ 1 − µ (Chandrasekaran, Parrilo & Willsky, 2010)
Vignette II: High-dimensional sparse linear regression θ ∗ X y w S = + n n × p S c Set-up: noisy observations y = Xθ ∗ + w with sparse θ ∗ Estimator: Lasso program n p � � 1 i θ ) 2 + λ n � ( y i − x T θ ∈ arg min | θ j | n θ i =1 j =1 Some past work: Tibshirani, 1996; Chen et al., 1998; Donoho/Xuo, 2001; Tropp, 2004; Fuchs, 2004; Efron et al., 2004; Meinshausen & Buhlmann, 2005; Candes & Tao, 2005; Donoho, 2005; Haupt & Nowak, 2005; Zhou & Yu, 2006; Zou, 2006; Koltchinskii, 2007; van
Application A: Compressed sensing (Donoho, 2005; Candes & Tao, 2005) X β ∗ y = n n × p p (a) Image: vectorize to β ∗ ∈ R p (b) Compute n random projections
Application A: Compressed sensing (Donoho, 2005; Candes & Tao, 2005) In practice, signals are sparse in a transform domain: θ ∗ := Ψ β ∗ is a sparse signal, where Ψ is an orthonormal matrix. X Ψ T θ ∗ y s = n n × p p Reconstruct θ ∗ (and hence image β ∗ = Ψ T θ ∗ ) based on finding a sparse solution to under-constrained linear system X = X Ψ T is another random matrix. y = � where � X θ
Noiseless ℓ 1 recovery: Unrescaled sample size Prob. exact recovery vs. sample size ( µ = 0) 1 0.9 0.8 Prob. of exact recovery 0.7 0.6 0.5 0.4 0.3 p = 128 0.2 p = 256 p = 512 0.1 0 0 50 100 150 200 250 300 Raw sample size n Probability of recovery versus sample size n .
Application B: Graph structure estimation let G = ( V, E ) be an undirected graph on p = | V | vertices pairwise graphical model factorizes over edges of graph: � � � P ( x 1 , . . . , x p ; θ ) ∝ exp θ st ( x s , x t ) . ( s,t ) ∈ E given n independent and identically distributed (i.i.d.) samples of X = ( X 1 , . . . , X p ), identify the underlying graph structure
Pseudolikelihood and neighborhood regression Markov properties encode neighborhood structure: d ( X s | X V \ s ) = ( X s | X N ( s ) ) � �� � � �� � Condition on full graph Condition on Markov blanket N ( s ) = { s, t, u, v, w } X s X t X w X s X u X v basis of pseudolikelihood method (Besag, 1974) basis of many graph learning algorithm (Friedman et al., 1999; Csiszar & Talata, 2005; Abeel et al., 2006; Meinshausen & Buhlmann, 2006)
Graph selection via neighborhood regression 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 0 . . . . . . . . . . . . . . . 0 Predict X s based on X \ s := { X s , t � = s } . 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 X \ s X s
Graph selection via neighborhood regression 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 0 . . . . . . . . . . . . . . . 0 Predict X s based on X \ s := { X s , t � = s } . 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 X \ s X s 1 For each node s ∈ V , compute (regularized) max. likelihood estimate: � � n � − 1 � θ [ s ] := arg min L ( θ ; X i \ s ) + λ n � θ � 1 n ���� � �� � θ ∈ R p − 1 i =1 local log. likelihood regularization
Graph selection via neighborhood regression 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 0 . . . . . . . . . . . . . . . 0 Predict X s based on X \ s := { X s , t � = s } . 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 X \ s X s 1 For each node s ∈ V , compute (regularized) max. likelihood estimate: � � n � − 1 � θ [ s ] := arg min L ( θ ; X i \ s ) + λ n � θ � 1 n ���� � �� � θ ∈ R p − 1 i =1 local log. likelihood regularization 2 Estimate the local neighborhood � N ( s ) as support of regression vector � θ [ s ] ∈ R p − 1 .
US Senate network (2004–2006 voting)
Recommend
More recommend