nonparametric sparsity
play

Nonparametric Sparsity John Lafferty Larry Wasserman - PowerPoint PPT Presentation

Nonparametric Sparsity John Lafferty Larry Wasserman Computer Science Dept. Department of Statistics Machine Learning Dept. Machine Learning Dept. Carnegie Mellon University Carnegie Mellon University


  1. Nonparametric Sparsity John Lafferty Larry Wasserman Computer Science Dept. Department of Statistics Machine Learning Dept. Machine Learning Dept. Carnegie Mellon University Carnegie Mellon University

  2. Motivation • “Modern” data are very high dimensional • In order to be “learnable,” there must be lower-dimensional structure • Developing practical algorithms with theoretical guarantees for beating the curse of (apparent) dimensionality is a main scientific challenge for our field 2

  3. Motivation • Sparsity is emerging as a key concept in statistics and machine learning • Dramatic progress in recent years on understanding sparsity in parametric settings • Nonparametric sparsity: Wide open 3

  4. Outline • High dimensional learning: Parametric and nonparametric • Rodeo: Greedy, sparse nonparametric regression • Extensions of the Rodeo 4

  5. Parametric Case: Variable Selection in Linear Models d β j X j + ǫ = X T β + ǫ � Y = j =1 where d might be larger than n . Predictive risk R = E ( Y new − X T new β ) 2 . Want to choose subset ( X j : j ∈ S ) , S ⊂ { 1 , . . . , d } to make R small. Bias-variance tradeoff: small S = ⇒ Bias ↑ Variance ↓ large S = ⇒ Bias ↓ Variance ↑ 5

  6. Lasso/Basis Pursuit (Chen & Donoho, 1994; Tibshirani, 1996) � d j =1 | β j | ≤ t Level sets of squared error For orthogonal designs, solution given by soft thresholding ˆ β j = sign ( β j ) ( | β j | − λ ) + 6

  7. Convex Relaxations for Sparse Signal Recovery Desired problem: min � β � 0 such that � X β − y � 2 ≤ � Requires intractable combinatorial optimization. Convex optimization surrogate: min � β � 1 such that � X β − y � 2 ≤ � Substantial progress recently on theoretical justification (Cand` es and Tao, Donoho, Tropp, Meinshausen and B¨ uhlmann, Wainwright, Zhao and Yu, Fan and Peng,...) 7

  8. Nonparametric Regression Given ( X 1 , Y 1 ) , . . . , ( X n , Y n ) where X i = ( X 1 i , . . . , X di ) T ∈ R d , Y i ∈ R , Y i = m ( X 1 i , . . . , X di ) + ǫ i , E ( ǫ i ) = 0 Risk: � m ( x ) − m ( x )) 2 dx R ( m, ˆ m ) = E ( ˆ Minimax theorem: � 1 � 4 / (4+ d ) inf m sup R ( m, ˆ m ) ≍ n ˆ m ∈ F where F is class of functions with 2 smooth derivatives. Note the curse of dimensionality. 8

  9. The Curse of Dimensionality (Sobolev space of order 2) d = 20 Risk = 0.01 1e+12 0.5 8e+11 0.4 0.3 6e+11 sample size Risk 0.2 4e+11 0.1 2e+11 0.0 0e+00 1e+02 1e+04 1e+06 1e+08 10 12 14 16 18 20 dimension sample size 9

  10. Nonparametric Sparsity • In many applications, reasonable to expect true function depends only on small number of variables • Assume m ( x ) = m ( x R ) where x R = ( x j ) j ∈ R are the relevant variables with | R | = r ≪ d • Can hope to achieve the better minimax rate n − 4 / (4+ r ) • Challenge: Variable selection in nonparametric regression 10

  11. Rodeo : Regularization of derivative expectation operator • A general strategy for nonparametric estimation: Regularize derivatives of estimator with respect to smoothing parameters • A simple new algorithm for simultaneous bandwidth and variable selection in nonparametric regression • Theoretical analysis : Algorithm correctly determines relevant variables, with high probability, and achieves (near) optimal minimax rate of convergence • Examples showing performance consistent with theory 11

  12. Key Idea in Rodeo: Change of Representation � h F � ( x ) dx F ( h ) = F (0) + 0 12

  13. Rodeo: The Main Idea • Use a nonparametric estimator based on a kernel • Start with large bandwidths in each dimension, for an estimate having small variance but high bias - Choosing large bandwidth is like ignoring a variable • Compute the derivatives of the estimate with respect to bandwidth • Threshold the derivatives to get a sparse estimate • Intuition: If a variable is irrelevant, then changing the bandwidth in that dimension should only result in a small change in the estimator 13

  14. Rodeo: The Main Idea h 1 Start Rodeo path Ideal path Optimal bandwidth h 2 14

  15. Using Local Linear Smoothing The estimator can be written as n � m h ( x ) = ˆ G ( X i , x, h ) Y i i =1 Our method is based on the statistic n Z j = ∂ ˆ m h ( x ) � = G j ( X i , x, h ) Y i ∂ h j i =1 The estimated variance is n � s 2 j = Var ( Z j | X 1 , . . . , X n ) = σ 2 G 2 j ( X i , x, h ) i =1 15

  16. Rodeo: Hard Tresholding Version 1. Select parameter 0 < β < 1 and initial bandwidth h 0 . 2. Initialize the bandwidths, and activate all covariates: (a) h j = h 0 , j = 1 , 2 , . . . , d . (b) A = { 1 , 2 , . . . , d } 3. While A is nonempty , do for each j ∈ A : (a) Compute estimated derivative expectation: Z j and s j � (b) Compute threshold λ j = s j 2 log n . (c) If | Z j | > λ j , set h j ← β h j ; otherwise remove j from A . 4. Output bandwidths h � = ( h 1 , . . . , h d ) and estimator m ( x ) = � ˜ m h � ( x ) 16

  17. Example: m ( x ) = 2( x 1 + 1) 3 + 2 sin(10 x 2 ) , d = 20 Average over 50 runs Typical Run 1.0 1.0 17 13 16 11 18 19 15 20 14 4 6 7 8 9 19 8 11 16 3 4 15 7 20 5 12 14 6 10 18 9 17 13 0.8 0.8 0.6 0.6 Bandwidth Bandwidth 3 0.4 0.4 5 12 10 1 0.2 0.2 1 2 2 0.0 0.0 2 4 6 8 10 12 14 5 10 15 Rodeo Step Rodeo Step 17

  18. Loss with r=2, Increasing Dimension 0.10 0.08 0.06 0.04 0.02 0.00 5 10 15 20 25 30 5 10 15 20 25 30 Leave-one-out cross-validation Rodeo 18

  19. Main Result: Near Optimal Rates Theorem. Suppose that d = O (log n/ log log n ) , h 0 = 1 / log log n , and | m jj ( x ) | > 0 . Then the rodeo outputs bandwidths h � that satisfy � � h � j = h 0 for all j > r → 1 P − and for every � > 0 , j ≤ n − 1 / (4+ r )+ � for all j ≤ r � � n − 1 / (4+ r ) − � ≤ h � → 1 . P − Let T n be the stopping time of the algorithm. Then P ( t L ≤ T n ≤ t U ) → 1 where � � 1 nA min t L = ( r + 4) log(1 / β )log log n (log log n ) d � � 1 nA max t U = ( r + 4) log(1 / β )log log n (log log n ) d 19

  20. Greedy Rodeo and LARS • Rodeo can be viewed as a nonparametric version of least angle regression (LARS), (Efron et al., 2004) • In forward stagewise, variable selection is incremental. LARS adds the variable most correlated with the residuals of the current fit. • For the Rodeo, the derivative is essentially the correlation between the output and the derivative of the effective kernel • Reducing the bandwidth is like adding more of that variable 20

  21. LARS Regularization Paths 4 8 6 13 Standardized Coefficients 4 3 1 2 7 9 0 2 10 � 2 5 0.0 0.2 0.4 0.6 0.8 1.0 |beta|/max|beta| 21

  22. Greedy Rodeo Bandwidth Paths 0.5 0.4 Bandwidth 0.3 3 9 7 4 1 2 8 0.2 0.1 0.0 0 20 40 60 80 100 Greedy Rodeo Step Rodeo order: 3 (body mass index), 9 (serum), 7 (serum), 4 (blood pressure), 1 (age), 2 (sex), 8 (serum), 5 (serum), 10 (serum), 6 (serum). LARS order: 3, 9, 4, 7, 2, 10, 5, 8, 6, 1. 22

  23. Extensions • Sparse density estimation • Local polynomial estimation • Classification using Rodeo with generalized linear models • Other nonparametric estimators • Data-adaptive basis pursuit 23

  24. Combining Rodeo and Lasso: Data-Adaptive Basis Pursuit (with Han Liu) true regression line data adaptive basis, J=36 0.4 0.4 0.2 0.2 fitted 0.0 0.0 y � 0.2 � 0.2 � 0.4 � 0.4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x 24

  25. Data-Adaptive Basis Pursuit • Recall idea of Rodeo: � 1 � � � Z ( x, h ( s )) , ˙ m ( x ) = � ˜ m 1 ( x ) − h ( s ) ds 0 • Let Φ ( X i ) = vec ( Z ( X i , h ( s k )) · dh ( s k )) over a grid of bandwidths • Run the Lasso: min � Y − Φ ( X ) β � 2 β such that � β � 1 ≤ t 25

  26. Data-Adaptive Basis Pursuit base 6 base 27 base 15 0.03 0.02 0.02 0.05 0.00 0.01 0.00 y y y 0.00 � 0.02 � 0.01 � 0.04 � 0.02 � 0.05 � 0.03 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x x base 9 base 30 base 18 0.03 0.06 0.06 0.04 0.04 0.02 0.02 0.02 0.01 0.00 0.00 y y y 0.00 � 0.02 � 0.02 � 0.01 � 0.04 � 0.04 � 0.06 � 0.06 � 0.02 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x x 26

  27. Summary • Sparsity is playing an increasingly important role in statistics and machine learning • In order to be “learnable,” there must be lower- dimensional structure • Nonparametric sparsity: many open problems. • Rodeo: conceptually simple and practical, theoretically nice properties. 27

Recommend


More recommend