High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture 3 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.
Non-parametric regression Goal: How to predict output from covariates? given covariates ( x 1 , x 2 , x 3 , . . . , x p ) output variable y want to predict y based on ( x 1 , . . . , x p )
Non-parametric regression Goal: How to predict output from covariates? given covariates ( x 1 , x 2 , x 3 , . . . , x p ) output variable y want to predict y based on ( x 1 , . . . , x p ) Different models: Ordered in terms of complexity/richness: linear non-linear but still parametric semi-parametric non-parametric
Non-parametric regression Goal: How to predict output from covariates? given covariates ( x 1 , x 2 , x 3 , . . . , x p ) output variable y want to predict y based on ( x 1 , . . . , x p ) Different models: Ordered in terms of complexity/richness: linear non-linear but still parametric semi-parametric non-parametric Challenge: How to control statistical and computational complexity for large number of predictors p ?
High dimensions and sample complexity Possible models: p � ordinary linear regression: y = θ j x j + w j =1 � �� � � θ, x � general non-parametric model: y = f ( x 1 , x 2 , . . . , x p ) + w .
High dimensions and sample complexity Possible models: p � ordinary linear regression: y = θ j x j + w j =1 � �� � � θ, x � general non-parametric model: y = f ( x 1 , x 2 , . . . , x p ) + w . Sample complexity: How many samples n for reliable prediction? linear models p/ǫ 2 ◮ without any structure: sample size n ≍ necessary/sufficient ���� linear in p
High dimensions and sample complexity Possible models: p � ordinary linear regression: y = θ j x j + w j =1 � �� � � θ, x � general non-parametric model: y = f ( x 1 , x 2 , . . . , x p ) + w . Sample complexity: How many samples n for reliable prediction? linear models p/ǫ 2 ◮ without any structure: sample size n ≍ necessary/sufficient ���� linear in p ( s log p ) /ǫ 2 ◮ with sparsity s ≪ p : sample size n ≍ necessary/sufficient � �� � logarithmic in p
High dimensions and sample complexity Possible models: p � ordinary linear regression: y = θ j x j + w j =1 � �� � � θ, x � general non-parametric model: y = f ( x 1 , x 2 , . . . , x p ) + w . Sample complexity: How many samples n for reliable prediction? linear models p/ǫ 2 ◮ without any structure: sample size n ≍ necessary/sufficient ���� linear in p ( s log p ) /ǫ 2 ◮ with sparsity s ≪ p : sample size n ≍ necessary/sufficient � �� � logarithmic in p non-parametric models: p -dimensional, smoothness α (1 /ǫ ) 2+ p/α Curse of dimensionality: n ≍ � �� � Exponential in p
Structure in non-parametric regression Upshot: Essential to impose structural constraints for high-dimensional non-parametric models. Reduced dimension models: dimension-reducing function: ϕ : R p → R k , where k ≪ p lower-dimensional function: g : R k → R composite function: f : R p → R � � f ( x 1 , x 2 , . . . , x p ) = g ϕ ( x 1 , x 2 , . . . , x p ) Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 4 / 27
Structure in non-parametric regression Reduced dimension models: dimension-reducing function: ϕ : R p → R k , where k ≪ p lower-dimensional function: g : R k → R composite function: f : R p → R � � f ( x 1 , x 2 , . . . , x p ) = g ϕ ( x 1 , x 2 , . . . , x p ) Example: Regression on k -dimensional manifold: 1.5 1 Form of model 0.5 0 � � f ( x 1 , x 2 , . . . , x p ) = g ϕ ( x 1 , x 2 , . . . , x p ) −0.5 −1 −1.5 ϕ is co-ordinate mapping. 0.1 0.05 1.5 1 0 0.5 0 −0.05 −0.5 −1 −0.1 −1.5 Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 4 / 27
Structure in non-parametric regression Reduced dimension models: dimension-reducing function: ϕ : R p → R k , where k ≪ p lower-dimensional function: g : R k → R composite function: f : R p → R � � f ( x 1 , x 2 , . . . , x p ) = g ϕ ( x 1 , x 2 , . . . , x p ) Example: Ridge functions Form of model: k � � � f ( x 1 , x 2 , . . . , x p ) = g j � a j , x � j =1 Dimension-reducing mapping for some A ∈ R k × p . ϕ ( x 1 , . . . , x p ) = Ax Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 4 / 27
Remainder of lecture 1 Sparse additive models ◮ formulation, applications ◮ families of estimators ◮ efficient implementation as SOCP 2 Statistical rates ◮ Kernel complexity ◮ Subset selection plus univariate function estimation 3 Minimax lower bounds ◮ Statistics as channel coding ◮ Metric entropy and lower bounds Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 5 / 27
Sparse additive models additive models f ( x 1 , x 2 , . . . , x p ) = � p j =1 f j ( x j ) (Stone, 1985; Hastie & Tibshirani, 1990) Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 6 / 27
Sparse additive models additive models f ( x 1 , x 2 , . . . , x p ) = � p j =1 f j ( x j ) (Stone, 1985; Hastie & Tibshirani, 1990) additivity with sparsity � f ( x 1 , x 2 , . . . , x p ) = f j ( x j ) for unknown subset of cardinality | S | = s j ∈ S Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 6 / 27
Sparse additive models additive models f ( x 1 , x 2 , . . . , x p ) = � p j =1 f j ( x j ) (Stone, 1985; Hastie & Tibshirani, 1990) additivity with sparsity � f ( x 1 , x 2 , . . . , x p ) = f j ( x j ) for unknown subset of cardinality | S | = s j ∈ S studied by previous authors: ◮ Lin & Zhang, 2006: COSSO relaxation, ◮ Ravikumar et al., 2007: SPAM back-fitting procedure, consistency ◮ Bach et al., 2008: multiple kernel learning (MLK), consistency in classical setting ◮ Meier et al., 2007, L 2 ( P n ) regularization ◮ Koltchinski & Yuan, 2008, 2010. ◮ Raskutti, W. & Yu, 2009: minimax lower bounds Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 6 / 27
Application: Copula methods and graphical models X t 1 X t 2 transform X j �→ Z j = f j ( X j ) model ( Z 1 , . . . , Z p ) as jointly X t 5 Gaussian Markov random field � � � P ( z 1 , z 2 , . . . , z p ) ∝ exp θ st z s z t . X s X t 3 ( s,t ) ∈ E X t 4 exploit Markov properties: neighborhood-based selection for learning graphs (Besag, 1974; Meinshausen & Buhlmann, 2006) combined with copula method: semi-parametric approach to graphical model learning (Liu, Lafferty & Wasserman, 2009) Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 7 / 27
Sparse and smooth Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ with: sparse representation: f ∗ = � j ∈ S f ∗ j univariate functions are smooth: f j ∈ H j
Sparse and smooth Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ with: sparse representation: f ∗ = � j ∈ S f ∗ j univariate functions are smooth: f j ∈ H j Disregarding computational cost: n � � � 2 1 min min y i − f ( x i ) f = � n | S |≤ s f j i =1 � �� � j ∈ S f j ∈H j � y − f � 2 n
Sparse and smooth Disregarding computational cost: n � � � 2 1 min min y i − f ( x i ) f = � n | S |≤ s f j i =1 � �� � j ∈ S f j ∈H j � y − f � 2 n 1-Hilbert-norm as convex surrogate: p � � f � 1 , H := � f j � H j j =1
Sparse and smooth Disregarding computational cost: n � � � 2 1 min min y i − f ( x i ) f = � n | S |≤ s f j i =1 � �� � j ∈ S f j ∈H j � y − f � 2 n 1-Hilbert-norm as convex surrogate: p � � f � 1 , H := � f j � H j j =1 1- L 2 ( P n )-norm as convex surrogate: p � � f � 1 ,n := � f j � L 2 ( P n ) j =1 � n where � f j � 2 L 2 ( P n ) := 1 i =1 f 2 j ( x ij ). n
A family of estimators Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ = � j ∈ S f ∗ j .
A family of estimators Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ = � j ∈ S f ∗ j . Estimator: � 1 � n p � � � � 2 + ρ n � f � 1 , H + µ n � f � 1 ,n � f ∈ arg min y i − f j ( x ij ) . f = � p n j =1 f j i =1 j =1
A family of estimators Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ = � j ∈ S f ∗ j . Estimator: � 1 � n p � � � � 2 + ρ n � f � 1 , H + µ n � f � 1 ,n � f ∈ arg min y i − f j ( x ij ) . f = � p n j =1 f j i =1 j =1 Two kinds of regularization: � � p p n � � � � � 1 f 2 � f � 1 ,n = � f j � L 2 ( P n ) = j ( x ij ) , and n j =1 j =1 i =1 p � � f � 1 , H = � f j � H j . j =1
Recommend
More recommend