High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture 2 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.
High-level overview Last lecture: least-squares loss and ℓ 1 -regularization. The big picture: Lots of other estimators with same basic form: � � � L ( θ ; Z n θ λ n ∈ arg min 1 ) + λ n R ( θ ) . ���� � �� � � �� � θ ∈ Ω Estimate Loss function Regularizer Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 2 / 29
High-level overview Last lecture: least-squares loss and ℓ 1 -regularization. The big picture: Lots of other estimators with same basic form: � � � L ( θ ; Z n θ λ n ∈ arg min 1 ) + λ n R ( θ ) . ���� � �� � � �� � θ ∈ Ω Estimate Loss function Regularizer Past years have witnessed an explosion of results (compressed sensing, covariance estimation, block-sparsity, graphical models, matrix completion...) Question: Is there a common set of underlying principles? Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 2 / 29
Last lecture: Sparse linear regression θ ∗ X y w S = + n n × p S c Set-up: noisy observations y = Xθ ∗ + w with sparse θ ∗ Estimator: Lasso program n p � � 1 i θ ) 2 + λ n � ( y i − x T θ ∈ arg min | θ j | n θ i =1 j =1
Block-structured extension Θ ∗ Y X W = + S n n n × p p S c r r r Signal Θ ∗ is a p × r matrix: partitioned into non-zero rows S and zero rows S c Various applications: multiple-view imaging, gene array prediction, graphical model fitting.
Block-structured extension Θ ∗ Y X W = + S n n n × p p S c r r r Row-wise ℓ 1 /ℓ 2 -norm p � | | | Θ | | | 1 , 2 = � Θ j � 2 j =1
Block-structured extension Θ ∗ Y X W = + S n n n × p p S c r r r Row-wise ℓ 1 /ℓ 2 -norm p � | | | Θ | | | 1 , 2 = � Θ j � 2 j =1 More complicated group structure: (Obozinski et al., 2009) � | Θ ∗ | | | | | G = � Θ g � 2 g ∈G
Example: Low-rank matrix approximation V T U D Θ ∗ r × p 2 r × r p 1 × r p 1 × p 2 Set-up: Matrix Θ ∗ ∈ R p 1 × p 2 with rank r ≪ min { p 1 , p 2 } . Estimator: � 1 � min { p 1 ,p 2 } n � � � ) 2 + λ n � Θ ∈ arg min ( y i − � � X i , Θ � σ j (Θ) n Θ i =1 j =1 Some past work: Fazel, 2001; Srebro et al., 2004; Recht, Fazel & Parillo, 2007; Bach, 2008; Candes & Recht, 2008; Keshavan et al., 2009; Rohde & Tsybakov, 2010; Recht, 2009; Negahban & W., 2010 ...
Application: Collaborative filtering . . . . . . 4 ∗ 3 . . . . . . ∗ 3 5 ∗ . . . . . . 2 5 4 3 . . . . . . 3 2 ∗ ∗ . . . . . . 1 Universe of p 1 individuals and p 2 films Observe n ≪ p 2 p 2 ratings (e.g., Srebro, Alon & Jaakkola, 2004)
Security and robustness issues Spiritual guide Break-down of Amazon recommendation system, 2002.
Security and robustness issues Spiritual guide Sex manual Break-down of Amazon recommendation system, 2002.
Matrix decomposition: Low-rank plus sparse Matrix Y can be (approximately) decomposed into sum: V T U D Y r × p 2 r × r + ≈ p 1 × p 2 p 1 × r Θ ∗ Γ ∗ Y = + ���� ���� Low-rank component Sparse component
Matrix decomposition: Low-rank plus sparse Matrix Y can be (approximately) decomposed into sum: V T U D Y r × p 2 r × r + ≈ p 1 × p 2 p 1 × r Θ ∗ Γ ∗ Y = + ���� ���� Low-rank component Sparse component exact decomposition: initially studied by Chandrasekaran, Sanghavi, Parillo & Willsky, 2009 subsequent work: Candes et al., 2010; Xu et al., 2010 Hsu et al., 2010; Agarwal et al., 2011 Various applications: ◮ robust collaborative filtering ◮ robust PCA ◮ graphical model selection with hidden variables
Gauss-Markov models with hidden variables Z X 1 X 2 X 3 X 4 Problems with hidden variables: conditioned on hidden Z , vector X = ( X 1 , X 2 , X 3 , X 4 ) is Gauss-Markov.
Gauss-Markov models with hidden variables Z X 1 X 2 X 3 X 4 Problems with hidden variables: conditioned on hidden Z , vector X = ( X 1 , X 2 , X 3 , X 4 ) is Gauss-Markov. Inverse covariance of X satisfies { sparse, low-rank } decomposition: 1 − µ µ µ µ µ 1 − µ µ µ = I 4 × 4 − µ 11 T . µ µ 1 − µ µ µ µ µ 1 − µ (Chandrasekaran, Parrilo & Willsky, 2010)
Example: Sparse principal components analysis ���� ���� ���� ���� ���� ���� ���� ���� + = ���� ���� ���� ���� ���� ���� ���� ���� ZZ T D Σ Set-up: Covariance matrix Σ = ZZ T + D , where leading eigenspace Z has sparse columns. Estimator: � � � � � Θ , � Θ ∈ arg min −� Σ � � + λ n | Θ jk | Θ ( j,k ) Some past work: Johnstone, 2001; Joliffe et al., 2003; Johnstone & Lu, 2004; Zou et al., 2004; d’Aspr´ emont et al., 2007; Johnstone & Paul, 2008; Amini & Wainwright, 2008
Motivation and roadmap many results on different high-dimensional models all based on estimators of the type: � � � L ( θ ; Z n θ λ n ∈ arg min 1 ) + λ n R ( θ ) . ���� � �� � � �� � θ ∈ Ω Estimate Loss function Regularizer Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 29
Motivation and roadmap many results on different high-dimensional models all based on estimators of the type: � � � L ( θ ; Z n θ λ n ∈ arg min 1 ) + λ n R ( θ ) . ���� � �� � � �� � θ ∈ Ω Estimate Loss function Regularizer Question: Is there a common set of underlying principles? Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 29
Motivation and roadmap many results on different high-dimensional models all based on estimators of the type: � � � L ( θ ; Z n θ λ n ∈ arg min 1 ) + λ n R ( θ ) . ���� � �� � � �� � θ ∈ Ω Estimate Loss function Regularizer Question: Is there a common set of underlying principles? Answer: Yes, two essential ingredients. (I) Restricted strong convexity of loss function (II) Decomposability of the regularizer Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 29
(I) Role of curvature 1 Curvature controls difficulty of estimation: δ L δ L ∆ ∆ � � θ θ θ θ High curvature: easy to estimate (b) Low curvature: harder
(I) Role of curvature 1 Curvature controls difficulty of estimation: δ L δ L ∆ ∆ � � θ θ θ θ High curvature: easy to estimate (b) Low curvature: harder 2 captured by lower bound on Taylor series error T L (∆; θ ∗ ) L ( θ ∗ + ∆) − L ( θ ∗ ) − �∇L ( θ ∗ ) , ∆ � ≥ γ 2 � ∆ � 2 � �� � T L (∆ ,θ ∗ ) for all ∆ around θ ∗ .
High dimensions: no strong convexity! 1 0.8 0.6 0.4 0.2 0 1 0.5 1 0.5 0 0 −0.5 −0.5 −1 −1 When p > n , the Hessian ∇ 2 L ( θ ; Z n 1 ) has nullspace of dimension p − n .
Restricted strong convexity Definition Loss function L n satisfies restricted strong convexity (RSC) with respect to regularizer R if � � L n ( θ ∗ + ∆) − L n ( θ ∗ ) + �∇L n ( θ ∗ ) , ∆ � γ 2 ℓ � ∆ � 2 − τ 2 ℓ R 2 (∆) ≥ e � �� � � �� � � �� � Lower curvature Tolerance Taylor error T L (∆; θ ∗ ) for all ∆ in a suitable neighborhood of θ ∗ .
Restricted strong convexity Definition Loss function L n satisfies restricted strong convexity (RSC) with respect to regularizer R if � � L n ( θ ∗ + ∆) − L n ( θ ∗ ) + �∇L n ( θ ∗ ) , ∆ � γ 2 ℓ � ∆ � 2 − τ 2 ℓ R 2 (∆) ≥ e � �� � � �� � � �� � Lower curvature Tolerance Taylor error T L (∆; θ ∗ ) for all ∆ in a suitable neighborhood of θ ∗ . ordinary strong convexity: ◮ special case with tolerance τ ℓ = 0 ◮ does not hold for most loss functions when p > n
Restricted strong convexity Definition Loss function L n satisfies restricted strong convexity (RSC) with respect to regularizer R if � � L n ( θ ∗ + ∆) − L n ( θ ∗ ) + �∇L n ( θ ∗ ) , ∆ � γ 2 ℓ � ∆ � 2 − τ 2 ℓ R 2 (∆) ≥ e � �� � � �� � � �� � Lower curvature Tolerance Taylor error T L (∆; θ ∗ ) for all ∆ in a suitable neighborhood of θ ∗ . ordinary strong convexity: ◮ special case with tolerance τ ℓ = 0 ◮ does not hold for most loss functions when p > n RSC enforces a lower bound on curvature, but only when R 2 (∆) ≪ � ∆ � 2 e
Recommend
More recommend