variations on nonparametric additive models computational
play

Variations on Nonparametric Additive Models: Computational and - PowerPoint PPT Presentation

Variations on Nonparametric Additive Models: Computational and Statistical Aspects John Lafferty Department of Statistics & Department of Computer Science University of Chicago Collaborators Sivaraman Balakrishnan (CMU) Mathias Drton


  1. Variations on Nonparametric Additive Models: Computational and Statistical Aspects John Lafferty Department of Statistics & Department of Computer Science University of Chicago

  2. Collaborators Sivaraman Balakrishnan (CMU) Mathias Drton (Chicago) Rina Foygel (Chicago) Michael Horrell (Chicago) Han Liu (Princeton) Kriti Puniyani (CMU) Pradeep Ravikumar (Univ. Texas, Austin) Larry Wasserman (CMU) 2

  3. Perspective Even the simplest models can be interesting, challenging, and useful for large, high-dimensional data. 3

  4. Motivation Great progress has been made on understanding sparsity for high dimensional linear models Many problems have clear nonlinear structure We are interested in purely functional methods for high dimensional, nonparametric inference • no basis expansions 4

  5. Additive Models 300 190 180 250 170 200 160 150 150 100 − 0.10 − 0.05 0.00 0.05 0.10 − 0.10 − 0.05 0.00 0.05 0.10 0.15 Age Bmi 240 160 150 200 140 160 130 120 120 110 − 0.10 − 0.05 0.00 0.05 0.10 − 0.10 − 0.05 0.00 0.05 0.10 0.15 Map Tc 5

  6. Additive Models Fully nonparametric methods appear hopeless • Logarithmic scaling, p = log n (e.g., “Rodeo” Lafferty and Wasserman (2008)) Additive models are useful compromise • Exponential scaling, p = exp ( n c ) (e.g., “SpAM” Ravikumar et al. (2009)) 6

  7. Themes of this talk • Variations on additive models enjoy most of the good statistical and computational properties of sparse linear models • Thresholded backfitting algorithms, via subdifferential calculus • RKHS formulations are problematic • A little nonparametricity goes a long way 7

  8. Outline • Sparse additive models • Nonparametric reduced rank regression • Functional sparse CCA • The nonparanormal • Conclusions 8

  9. Sparse Additive Models Ravikumar, Lafferty, Liu and Wasserman, JRSS B (2009) Y i = � p Additive Model: j = 1 m j ( X ij ) + ε i , i = 1 , . . . , n High dimensional: n ≪ p , with most m j = 0. � � 2 Y − � Optimization: minimize j m j ( X j ) E p � � E ( m 2 subject to j ) ≤ L n j = 1 E ( m j ) = 0 Related work by B¨ uhlmann and van de Geer (2009), Koltchinskii and Yuan (2010), Raskutti, Wainwright and Yu (2011) 9

  10. Sparse Additive Models � � � � m ∈ R 4 : m 2 11 + m 2 m 2 12 + m 2 C = 21 + 22 ≤ L π 12 C = π 13 C = 10

  11. Stationary Conditions Lagrangian � � 2 p � � Y − � p L ( f , λ, µ ) = 1 E ( m 2 j = 1 m j ( X j ) + λ j ( X j )) 2 E j = 1 Let R j = Y − � k � = j m k ( X k ) be j th residual. Stationary condition m j − E ( R j | X j ) + λ v j = 0 a . e . � E ( m 2 where v j ∈ ∂ j ) satisfies m j if E ( m 2 v j = j ) � = 0 � E ( m 2 j ) � E v 2 ≤ 1 otherwise j 11

  12. Stationary Conditions Rewriting, m j + λ v j = E ( R j | X j ) ≡ P j   λ  1 +  m j P j if E ( P 2 � = j ) > λ E ( m 2 j ) m j = 0 otherwise This implies   λ  1 −  m j = � P j E ( P 2 j ) + 12

  13. SpAM Backfitting Algorithm Input: Data ( X i , Y i ) , regularization parameter λ . Iterate until convergence: For each j = 1 , . . . , p : Compute residual: R j = Y − � k � = j � m k ( X k ) Estimate projection P j = E ( R j | X j ) , smooth: � P j = S j R j � E [ P j ] 2 Estimate norm: s j = � � 1 − λ � Soft-threshold: � m j ← P j s j � + m ( X i ) = � Output: Estimator � j � m j ( X ij ) . 13

  14. Example: Boston Housing Data Predict house value Y from 10 covariates. We added 20 irrelevant (random) covariates to test the method. Y = house value; n = 506, p = 30. Y = β 0 + m 1 ( crime ) + m 2 ( tax ) + · · · + · · · m 30 ( X 30 ) + ǫ. Note that m 11 = · · · = m 30 = 0. We choose λ by minimizing the estimated risk. SpAM yields 6 nonzero functions. It correctly reports that m 11 = · · · = � � m 30 = 0. 14

  15. Example Fits 20 20 10 10 −10 −10 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 15

  16. L 2 norms of fitted functions versus 1 /λ 4 3 Component Norms 10 2 8 6 3 1 17 7 5 0 0.0 0.2 0.4 0.6 0.8 1.0 16

  17. RKHS Version Raskutti, Wainwright and Yu (2011) Sample optimization p n � � 2 � � � � 1 min y i − m j ( x ij ) + λ � m j � H j + µ � m j � L 2 ( P n ) n f i = 1 j = 1 j j � � n 1 i = 1 m 2 where � m j � L 2 ( P n ) = j ( x ij ) . n By Representer Theorem, with m j ( · ) = K j α j , p � � n � � 2 � � � � 1 α T α T j K 2 min y i − K j α j + λ j K j α j + µ j α j n f i = 1 j = 1 j j Finite dimensional SOCP , but no scalable algorithms known. 17

  18. Open Problems • Under what conditions do the backfitting algorithms converge? • What guarantees can be given on the solution to the infinite dimensional optimization? • Is it possible to simultaneously adapt to unknown smoothness and sparsity? 18

  19. Multivariate Regression Y ∈ R q and X ∈ R p . Regression function M ( X ) = E ( Y | X ) . Linear model M ( X ) = BX where B ∈ R q × p . Reduced rank regression: r = rank ( B ) ≤ C . Recent work has studied properties and high dimensional scaling of reduced rank regression where nuclear norm min ( p , q ) � � B � ∗ := σ j ( B ) j = 1 as convex surrogate for rank constraint (Yuan et al., 2007; Negahban and Wainwright, 2011) 19

  20. Nonparametric Reduced Rank Regression Foygel, Horrell, Drton and Lafferty (2012) Nonparametric multivariate regression M ( X ) = ( m 1 ( X ) , . . . , m q ( X )) T Each component an additive model p � m k ( X ) = m k j ( X j ) j = 1 What is the nonparametric analogue of � B � ∗ penalty? 20

  21. � � � � � � � � Recall: Sparse Vectors and ℓ 1 Relaxation sparse vectors convex hull � X � 0 ≤ t � X � 1 ≤ t 21

  22. Low-Rank Matrices and Convex Relaxation low rank matrices convex hull rank ( X ) ≤ t � X � ∗ ≤ t 22

  23. Nuclear Norm Regularization Algorithms for nuclear norm minimization are a lot like iterative soft thresholding for lasso problems. To project a matrix B onto the nuclear norm ball � X � ∗ ≤ t : • Compute the SVD: B = U diag ( σ ) V T • Soft threshold the singular values: B ← U diag ( Soft λ ( σ )) V T 23

  24. Low Rank Functions What does it mean for a set of functions m 1 ( x ) , . . . , m q ( x ) to be low rank? Let x 1 , . . . , x n be a collection of points. We require the n × q matrix M ( x 1 : n ) = [ m k ( x i )] is low rank. Stochastic setting: M = [ m k ( X i )] . Natural penalty is � q q � � λ s ( M T M ) � M � ∗ = σ s ( M ) = s = 1 s = 1 Population version: � � � � Σ( M ) 1 / 2 � � � � � � ||| M ||| ∗ := Cov ( M ( X )) ∗ = � � � ∗ 24

  25. Constrained Rank Additive Models (CRAM) Let Σ j = Cov ( M j ) . Two natural penalties: � � � � � � � � � � � � � Σ 1 / 2 � Σ 1 / 2 � Σ 1 / 2 ∗ + ∗ + · · · + � � � p 1 2 ∗ � � � � � (Σ 1 / 2 Σ 1 / 2 · · · Σ 1 / 2 ) � p 1 2 ∗ Population risk functional (first penalty) � � � � �� � �� � �� �� 1 2 � � � M j � � Y − M j ( X j ) 2 + λ 2 E � ∗ j j 25

  26. Stationary Conditions ��� � � − 1 E ( FF ⊤ ) Subdifferential is ∂ ||| F ||| ∗ = F + H where ||| H ||| sp ≤ 1 , E ( FH ⊤ ) = 0 , E ( FF ⊤ ) H = 0 Let P ( X ) := E ( Y | X ) and consider optimization � � 1 � 2 � Y − M ( X ) 2 + λ ||| M ||| ∗ 2 E Let E ( PP T ) = U diag ( τ ) U T be the SVD. Define M = U diag ([ 1 − λ/ √ τ ] + ) U T P Then M is a stationary point of the optimization, satisfying E ( Y | X ) = M ( X ) + λ V ( X ) a . e ., for some V ∈ ∂ ||| M ||| ∗ 26

  27. CRAM Backfitting Algorithm (Penalty 1) Input: Data ( X i , Y i ) , regularization parameter λ . Iterate until convergence: For each j = 1 , . . . , p : Compute residual: R j = Y − � k � = j � f k ( X k ) Estimate projection P j = E ( R j | X j ) , smooth: � P j = S j R j n � P j � Compute SVD: 1 P T j = U diag ( τ ) U T M j = U diag ([ 1 − λ/ √ τ ] + ) U T � Soft-threshold: � P j M ( X i ) = � Output: Estimator � j � M j ( X ij ) . 27

  28. Example Data of Smith et al. (1962), chemical measurements for 33 individual urine specimens. q = 5 response variables: pigment creatinine, and the concentrations (in mg/ml) of phosphate, phosphorus, creatinine and choline. p = 3 covariates: weight of subject, volume and specific gravity of specimen. We use Penalty 2 with local linear smoothing. We take λ = 1 and bandwidth h = . 3. 28

Recommend


More recommend