Nonparametric Sparsity John Lafferty Larry Wasserman Computer Science Dept. Department of Statistics Machine Learning Dept. Machine Learning Dept. Carnegie Mellon University Carnegie Mellon University
Motivation • “Modern” data are very high dimensional • In order to be “learnable,” there must be lower-dimensional structure • Developing practical algorithms with theoretical guarantees for beating the curse of (apparent) dimensionality is a main scientific challenge for our field 2
Motivation • Sparsity is emerging as a key concept in statistics and machine learning • Dramatic progress in recent years on understanding sparsity in parametric settings • Nonparametric sparsity: Wide open 3
Outline • High dimensional learning: Parametric and nonparametric • Rodeo: Greedy, sparse nonparametric regression • Extensions of the Rodeo 4
Parametric Case: Variable Selection in Linear Models d β j X j + ǫ = X T β + ǫ � Y = j =1 where d might be larger than n . Predictive risk R = E ( Y new − X T new β ) 2 . Want to choose subset ( X j : j ∈ S ) , S ⊂ { 1 , . . . , d } to make R small. Bias-variance tradeoff: small S = ⇒ Bias ↑ Variance ↓ large S = ⇒ Bias ↓ Variance ↑ 5
Lasso/Basis Pursuit (Chen & Donoho, 1994; Tibshirani, 1996) � d j =1 | β j | ≤ t Level sets of squared error For orthogonal designs, solution given by soft thresholding ˆ β j = sign ( β j ) ( | β j | − λ ) + 6
Convex Relaxations for Sparse Signal Recovery Desired problem: min � β � 0 such that � X β − y � 2 ≤ � Requires intractable combinatorial optimization. Convex optimization surrogate: min � β � 1 such that � X β − y � 2 ≤ � Substantial progress recently on theoretical justification (Cand` es and Tao, Donoho, Tropp, Meinshausen and B¨ uhlmann, Wainwright, Zhao and Yu, Fan and Peng,...) 7
Nonparametric Regression Given ( X 1 , Y 1 ) , . . . , ( X n , Y n ) where X i = ( X 1 i , . . . , X di ) T ∈ R d , Y i ∈ R , Y i = m ( X 1 i , . . . , X di ) + ǫ i , E ( ǫ i ) = 0 Risk: � m ( x ) − m ( x )) 2 dx R ( m, ˆ m ) = E ( ˆ Minimax theorem: � 1 � 4 / (4+ d ) inf m sup R ( m, ˆ m ) ≍ n ˆ m ∈ F where F is class of functions with 2 smooth derivatives. Note the curse of dimensionality. 8
The Curse of Dimensionality (Sobolev space of order 2) d = 20 Risk = 0.01 1e+12 0.5 8e+11 0.4 0.3 6e+11 sample size Risk 0.2 4e+11 0.1 2e+11 0.0 0e+00 1e+02 1e+04 1e+06 1e+08 10 12 14 16 18 20 dimension sample size 9
Nonparametric Sparsity • In many applications, reasonable to expect true function depends only on small number of variables • Assume m ( x ) = m ( x R ) where x R = ( x j ) j ∈ R are the relevant variables with | R | = r ≪ d • Can hope to achieve the better minimax rate n − 4 / (4+ r ) • Challenge: Variable selection in nonparametric regression 10
Rodeo : Regularization of derivative expectation operator • A general strategy for nonparametric estimation: Regularize derivatives of estimator with respect to smoothing parameters • A simple new algorithm for simultaneous bandwidth and variable selection in nonparametric regression • Theoretical analysis : Algorithm correctly determines relevant variables, with high probability, and achieves (near) optimal minimax rate of convergence • Examples showing performance consistent with theory 11
Key Idea in Rodeo: Change of Representation � h F � ( x ) dx F ( h ) = F (0) + 0 12
Rodeo: The Main Idea • Use a nonparametric estimator based on a kernel • Start with large bandwidths in each dimension, for an estimate having small variance but high bias - Choosing large bandwidth is like ignoring a variable • Compute the derivatives of the estimate with respect to bandwidth • Threshold the derivatives to get a sparse estimate • Intuition: If a variable is irrelevant, then changing the bandwidth in that dimension should only result in a small change in the estimator 13
Rodeo: The Main Idea h 1 Start Rodeo path Ideal path Optimal bandwidth h 2 14
Using Local Linear Smoothing The estimator can be written as n � m h ( x ) = ˆ G ( X i , x, h ) Y i i =1 Our method is based on the statistic n Z j = ∂ ˆ m h ( x ) � = G j ( X i , x, h ) Y i ∂ h j i =1 The estimated variance is n � s 2 j = Var ( Z j | X 1 , . . . , X n ) = σ 2 G 2 j ( X i , x, h ) i =1 15
Rodeo: Hard Tresholding Version 1. Select parameter 0 < β < 1 and initial bandwidth h 0 . 2. Initialize the bandwidths, and activate all covariates: (a) h j = h 0 , j = 1 , 2 , . . . , d . (b) A = { 1 , 2 , . . . , d } 3. While A is nonempty , do for each j ∈ A : (a) Compute estimated derivative expectation: Z j and s j � (b) Compute threshold λ j = s j 2 log n . (c) If | Z j | > λ j , set h j ← β h j ; otherwise remove j from A . 4. Output bandwidths h � = ( h 1 , . . . , h d ) and estimator m ( x ) = � ˜ m h � ( x ) 16
Example: m ( x ) = 2( x 1 + 1) 3 + 2 sin(10 x 2 ) , d = 20 Average over 50 runs Typical Run 1.0 1.0 17 13 16 11 18 19 15 20 14 4 6 7 8 9 19 8 11 16 3 4 15 7 20 5 12 14 6 10 18 9 17 13 0.8 0.8 0.6 0.6 Bandwidth Bandwidth 3 0.4 0.4 5 12 10 1 0.2 0.2 1 2 2 0.0 0.0 2 4 6 8 10 12 14 5 10 15 Rodeo Step Rodeo Step 17
Loss with r=2, Increasing Dimension 0.10 0.08 0.06 0.04 0.02 0.00 5 10 15 20 25 30 5 10 15 20 25 30 Leave-one-out cross-validation Rodeo 18
Main Result: Near Optimal Rates Theorem. Suppose that d = O (log n/ log log n ) , h 0 = 1 / log log n , and | m jj ( x ) | > 0 . Then the rodeo outputs bandwidths h � that satisfy � � h � j = h 0 for all j > r → 1 P − and for every � > 0 , j ≤ n − 1 / (4+ r )+ � for all j ≤ r � � n − 1 / (4+ r ) − � ≤ h � → 1 . P − Let T n be the stopping time of the algorithm. Then P ( t L ≤ T n ≤ t U ) → 1 where � � 1 nA min t L = ( r + 4) log(1 / β )log log n (log log n ) d � � 1 nA max t U = ( r + 4) log(1 / β )log log n (log log n ) d 19
Greedy Rodeo and LARS • Rodeo can be viewed as a nonparametric version of least angle regression (LARS), (Efron et al., 2004) • In forward stagewise, variable selection is incremental. LARS adds the variable most correlated with the residuals of the current fit. • For the Rodeo, the derivative is essentially the correlation between the output and the derivative of the effective kernel • Reducing the bandwidth is like adding more of that variable 20
LARS Regularization Paths 4 8 6 13 Standardized Coefficients 4 3 1 2 7 9 0 2 10 � 2 5 0.0 0.2 0.4 0.6 0.8 1.0 |beta|/max|beta| 21
Greedy Rodeo Bandwidth Paths 0.5 0.4 Bandwidth 0.3 3 9 7 4 1 2 8 0.2 0.1 0.0 0 20 40 60 80 100 Greedy Rodeo Step Rodeo order: 3 (body mass index), 9 (serum), 7 (serum), 4 (blood pressure), 1 (age), 2 (sex), 8 (serum), 5 (serum), 10 (serum), 6 (serum). LARS order: 3, 9, 4, 7, 2, 10, 5, 8, 6, 1. 22
Extensions • Sparse density estimation • Local polynomial estimation • Classification using Rodeo with generalized linear models • Other nonparametric estimators • Data-adaptive basis pursuit 23
Combining Rodeo and Lasso: Data-Adaptive Basis Pursuit (with Han Liu) true regression line data adaptive basis, J=36 0.4 0.4 0.2 0.2 fitted 0.0 0.0 y � 0.2 � 0.2 � 0.4 � 0.4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x 24
Data-Adaptive Basis Pursuit • Recall idea of Rodeo: � 1 � � � Z ( x, h ( s )) , ˙ m ( x ) = � ˜ m 1 ( x ) − h ( s ) ds 0 • Let Φ ( X i ) = vec ( Z ( X i , h ( s k )) · dh ( s k )) over a grid of bandwidths • Run the Lasso: min � Y − Φ ( X ) β � 2 β such that � β � 1 ≤ t 25
Data-Adaptive Basis Pursuit base 6 base 27 base 15 0.03 0.02 0.02 0.05 0.00 0.01 0.00 y y y 0.00 � 0.02 � 0.01 � 0.04 � 0.02 � 0.05 � 0.03 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x x base 9 base 30 base 18 0.03 0.06 0.06 0.04 0.04 0.02 0.02 0.02 0.01 0.00 0.00 y y y 0.00 � 0.02 � 0.02 � 0.01 � 0.04 � 0.04 � 0.06 � 0.06 � 0.02 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x x 26
Summary • Sparsity is playing an increasingly important role in statistics and machine learning • In order to be “learnable,” there must be lower- dimensional structure • Nonparametric sparsity: many open problems. • Rodeo: conceptually simple and practical, theoretically nice properties. 27
Recommend
More recommend