nonparametric variable selection via sufficient dimension
play

Nonparametric Variable Selection via Sufficient Dimension Reduction - PowerPoint PPT Presentation

Nonparametric Variable Selection via Sufficient Dimension Reduction Lexin Li Workshop on Current Trends and Challenges in Model Selection and Related Areas Vienna, Austria July 24, 2008 1 Dimension Reduction and Variable Selection Outline


  1. Nonparametric Variable Selection via Sufficient Dimension Reduction Lexin Li Workshop on Current Trends and Challenges in Model Selection and Related Areas Vienna, Austria July 24, 2008 1 Dimension Reduction and Variable Selection

  2. Outline • Introduction to model free variable selection • Introduction to sufficient dimension reduction (SDR) • Regularizatied SDR for variable selection • Simulation study and real data analysis • Concluding remarks and discussions Joint work with Dr. Bondell at NCSU 2 Dimension Reduction and Variable Selection

  3. Introduction to Model Free Variable Selection Model based variable selection: • most existing variable selection approaches are model based, i.e., we assume the underlying true model is known up to a finite dimensional parameter, or the imposed working model is usefully close to the true model Potential limitations: • the true model is unknown, and model formulation can be complex • accessing the goodness of model fitting can be difficult when interweaved with model building and selection 3 Dimension Reduction and Variable Selection

  4. Introduction to Model Free Variable Selection Model free variable selection: selection that does not require any traditional model A “disclaimer”: model free variable selection in the exploratory stage of the analysis; refined by model based variable selection approaches Sufficient dimension reduction: model free variable selection is to be achieved through the framework of SDR (Li, 1991, 2000, Cook, 1998) 4 Dimension Reduction and Variable Selection

  5. Introduction to SDR General framework of SDR: R r given X ∈ I R p • study conditional distribution of Y ∈ I • find a p × d matrix η = ( η 1 , . . . , η d ) , d ≤ p , such that D Y | X = Y | η T X ⇔ Y X | η T X • replace X with η T X without loss of information on regression Y | X Key concept: central subspace S Y | X • Y X | η T X ⇒ S DRS = Span ( η ) ⇒ S Y | X = ∩ S DRS • S Y | X is a parsimonious population parameter that captures all regression information of Y | X ; main object of interest in SDR 5 Dimension Reduction and Variable Selection

  6. Introduction to SDR Known regression models: • single / multi-index model: Y = f 1 ( η T 1 X ) + . . . + f d ( η T d X ) + ε • heteroscedastic model: Y = f ( η T 1 X ) + g ( η T 2 X ) ε � � P r ( Y =1 | X ) • logit model: log = f ( η T 1 X ) . . . 1 − P r ( Y =1 | X ) Existing SDR estimation methods: • sliced inverse regression (SIR), sliced average variance estimation (SAVE), principal Hessian directions (PHD), . . . • inverse regression estimation (IRE), covariance inverse regression estimation (CIRE), . . . 6 Dimension Reduction and Variable Selection

  7. A Simple Motivating Example Consider a response model: T Y = exp( − 0 . 5 η 1 X ) + 0 . 5 ε • all predictors X and error ε are independent standard normal √ • S Y | X = Span ( η 1 ) , where η 1 = (1 , − 1 , 0 , . . . , 0) T / 2 • CIRE estimate ( n = 100 , p = 6 ): (0 . 659 , − 0 . 734 , − 0 . 128 , 0 . 097 , − 0 . 015 , 0 . 030) T Observations: • produce linear combinations of all the predictors • interpretation can be difficult; no variable selection Goal: do variable selection by obtaining sparse SDR estimate 7 Dimension Reduction and Variable Selection

  8. Key Ideas Regularized SDR estimation: • observe that the majority of SDR estimators can be formulated as a genearlized spectral decomposition problem • transform the spectral decomposition into an equivalent least squares formulation • add L 1 penalty to the least squares Focus of our work: • demonstrate that the resulting model free variable selection can achieve the usual selection consistency under the usual conditions as in the case of model based variable selection (say, e.g., the multiple linear regression model) 8 Dimension Reduction and Variable Selection

  9. Minimum Discrepancy Approach Generalized spectral decomposition formulation: Ω β j = λ j Σ β j , j = 1 , . . . , p where Ω is a p × p positive semi-definite symmetric matrix, Σ = cov( X ) . For instance, Ω SIR = cov[E { X − E( X ) | Y } ] , Ω SAVE = Σ − cov( X | Y ) , Ω PHD = E[ { Y − E( Y ) }{ X − E( X ) }{ X − E( X ) } T ] . It satisfies that Σ − 1 Span (Ω) ⊆ S Y | X Assumptions: • above SDR methods impose assumptions on the marginal distribution of X , instead of the conditional distribution of Y | X • model free 9 Dimension Reduction and Variable Selection

  10. Minimum Discrepancy Approach An equivalent least squares optimization formulation: consider h � T Σ( θ j − ηγ j ) , min η,γ L ( η, γ ) = min ( θ j − ηγ j ) η p × d ,γ d × h j =1 subject to η T Σ η = I d , Let (˜ η, ˜ γ ) = arg min η,γ L ( η, γ ) . Then ˜ η consists of the first d eigenvectors ( β 1 , . . . , β d ) from the eigen decomposition Ω β j = λ j Σ β j , j = 1 , . . . , p, �� h � where Ω = Σ j =1 θ j θ T Σ . j In matrix form: � T V � � � L ( η, γ ) = vec( θ ) − vec( ηγ ) vec( θ ) − vec( ηγ ) where V = I h ⊗ Σ . 10 Dimension Reduction and Variable Selection

  11. Minimum Discrepancy Approach A minimum discrepancy formulation: • start with the construction of a p × h matrix θ = ( θ 1 , . . . , θ h ) , such that, Span ( θ ) ⊆ S Y | X ; given data, construct a √ n -consistent estimator ˆ θ of θ • construct a positive definite matrix V ph × ph , and a √ n -consistent estimator ˆ V of V • estimate ( η, γ ) by minimizing a quadratic discrepancy function: � T ˆ � � � vec(ˆ vec(ˆ (ˆ η, ˆ γ ) = arg min θ ) − vec( ηγ ) V θ ) − vec( ηγ ) η p × d ,γ d × h • Span { ˆ η } forms a consistent inverse regression estimator of S Y | X • Cook and Ni (2005) 11 Dimension Reduction and Variable Selection

  12. Minimum Discrepancy Approach A whole class of estimators: its individual member is determined by the choice of the pair ( θ, V ) and (ˆ θ, ˆ V ) ; for instance, • for sliced inverse regression (SIR): f s Σ − 1 { E( X | J s = 1) − E( X ) } , θ s = diag( f − 1 V = ) ⊗ Σ s • for covariance inverse regression estimation (CIRE): Σ − 1 cov( Y J s , X ) , θ s = Γ − 1 , V = where Γ is the asymptotic covariance of n 1 / 2 { vec(ˆ θ ) − vec( θ ) } 12 Dimension Reduction and Variable Selection

  13. Regularized Minimum Discrepancy Approach Proposed regularization solution: • let α = ( α 1 , . . . , α p ) T denote a p × 1 shrinkage vector, given (ˆ θ, ˆ η, ˆ γ ) � T ˆ � � � vec(ˆ vec(ˆ α = arg min ˆ θ ) − vec(diag( α )ˆ η ˆ γ ) V θ ) − vec(diag( α )ˆ η ˆ γ ) , α p � subject to | α j | ≤ τ, τ ≥ 0 . j =1 • Span { diag(ˆ α )ˆ η } is called the shrinkage inverse regression estimator of S Y | X . • note that: – when τ ≥ p , ˆ α j = 1 for all j ’s – when τ decreases, some ˆ α j ’s are shrunken to zero, which in turn shrinking the entire rows of η 13 Dimension Reduction and Variable Selection

  14. Regularized Minimum Discrepancy Approach Additional notes: • generalized the shrinkage SIR estimator of Ni, Cook, and Tsai (2005) • closely related to nonnegative garrote (Breiman, 1995) • Pr (ˆ α j ≥ 0) → 1 for all j ’s • use an information-type criterion to select the tuning parameter τ • achieve simultaneous dimension reduction and variable selection 14 Dimension Reduction and Variable Selection

  15. Regularized Minimum Discrepancy Approach Optimization: T         diag(ˆ η ˆ γ 1 ) diag(ˆ η ˆ γ 1 )             . .         vec(ˆ vec(ˆ ˆ . . arg min α n θ ) −  α V θ ) −  α .     . .                diag(ˆ η ˆ γ h )   diag(ˆ η ˆ γ h )      It becomes a “standard” lasso problem, with the “response” U ph , and the “predictors” W ph × p :   diag(ˆ η ˆ γ 1 ) U = √ n ˆ W = √ n ˆ .   V 1 / 2 vec(ˆ V 1 / 2 . θ ) ,  ,   .    diag(ˆ η ˆ γ h ) The optimization is easy. 15 Dimension Reduction and Variable Selection

  16. Variable Selection without a Model Goal: to seek the smallest subset of the predictors X A , with partition X = ( X T A , X T A c ) T , such that Y X A c | X A Here A denotes a subset of indices of { 1 , . . . , p } corresponding to the relevant predictor set X A , and A c is the compliment of A . Existence and uniqueness: Given the existence of the central subspace S Y | X , A uniquely exists. 16 Dimension Reduction and Variable Selection

  17. Variable Selection without a Model Relation between A and basis of S Y | X : (Cook, 2004, Proposition 1)    η A η p × d =  , η A ∈ I R ( p − p 0 ) × d , η A c ∈ I R p 0 × d . η A c The rows of a basis of the central subspace corresponding to X A c , i.e., η A c , are all zero vectors; and all the predictors whose corresponding rows of the S Y | X basis equal zero belong to X A c . 17 Dimension Reduction and Variable Selection

Recommend


More recommend