sliced inverse regression with interaction siri detection
play

Sliced Inverse Regression with Interaction (Siri) Detection for - PowerPoint PPT Presentation

Sliced Inverse Regression with Interaction (Siri) Detection for non-Gaussian BN learning Jun S. Liu Department of Statistics Harvard University Joint work with Bo Jiang 1 General: Regression and Classification Responses Covariates Ind 1 x


  1. Sliced Inverse Regression with Interaction (Siri) Detection for non-Gaussian BN learning Jun S. Liu Department of Statistics Harvard University Joint work with Bo Jiang 1

  2. General: Regression and Classification Responses Covariates Ind 1 x 11 , x 12 , …, x 1p Y 1 Ind 2 Y 2 x 21 , x 22 , …, x 2p M   x N1 , x N2 , …, x NP Ind N Y N 2

  3. 3

  4. Variable Selection with Interaction Let Y ∈ R be a univerate response variable and X ∈ R p be a vector of p continuous predictor variables Y = X 1 × X 2 + ϵ , ϵ ∼ N ( 0, σ 2 ) , X ∼ MVN ( 0, I p ) Suppose p = 1000. How to find X 1 and X 2 ? One step forward selection : ∼ 500,000 interaction terms 4

  5. Variable Selection with Interaction Let Y ∈ R be a univerate response variable and X ∈ R p be a vector of p continuous predictor variables Y = X 1 × X 2 + ϵ , ϵ ∼ N ( 0, σ 2 ) , X ∼ MVN ( 0, I p ) Suppose p = 1000. How to find X 1 and X 2 ? One step forward selection : ∼ 500,000 interaction terms Is there any marginal relationship between Y and X 1 ? 5

  6. [Y | X] ? [X | Y] ? Who is behind the bar ? 6

  7. General: Regression and Classification Responses Covariates Ind 1 x 11 , x 12 , …, x 1p Y 1 Ind 2 Y 2 x 21 , x 22 , …, x 2p M   x N1 , x N2 , …, x NP Ind N Y N P Y ( | X ) P ( X | ) ( )/ Y P Y P ( ) X = How to model this? 7

  8. Naïve Bayes model Y X 1 X 2 X 3 X m 8

  9. (Augmented) Naïve Bayes Model l BEAM: Bayesian Epistasis Association Mapping (Zhang and Liu 2007): discrete univariate response and discrete predictors l (Augmented) Naïve Bayes Classifier with Variable Selection and Interaction Detection (Yuan Yuan et al.): discrete univariate response and continuous (but discretized) predictors l Bayesian Partition Model for eQTL study (Zhang et al. 2010): continuous multivariate responses and discrete predictors l Sliced Inverse Regression with Interaction Detection (SIRI): continuous univariate response and continuous predictors 9

  10. Tree-Augmented Naïve Bayes Y TAN (tree-augmented naïve Bayes) X 4 X 6 X 3 X 1 X 2 X 5 (Pearl 1988; Friedman 1997) 10

  11. Augmented Naïve Bayes X 2.21 Y Group 0 X 01 X 02 X 2.22 Group 22 X 2.12 X 11 X 12 X 13 X 2.11 X 2.13 Group 1 Group 21 11

  12. How about continuous covariates? • We may discretize Y, and discretize each X • Or discretize Y, assuming joint Gaussian distributions on X? • Sound familiar?

  13. An observation: Y = X 1 × X 2 + ϵ , ϵ ∼ N ( 0, σ 2 ) , X ∼ MVN ( 0, I p ) y 13 x 1

  14. Sliced Inverse Regression (SIR, Li 1991) SIR is a tool for dimension reduction in multivariate statistics Let Y ∈ R be a univerate response variable and X ∈ R p be a vector of p continuous predictor variables T X , ... , β K T X , ϵ ) Y = f ( β 1 f is an unknown function and ϵ is the error with finite variance How to identify unknown projection vectors β 1 , ... , β K ? 14

  15. 15

  16. 16

  17. 17

  18. 18

  19. 19

  20. 20

  21. 21

  22. 22

  23. 23

  24. 24

  25. 25

  26. 26

  27. 27

  28. 28

  29. 29

  30. ###ëëÊ ¹ à”‹åf 30

  31. ̂ ̂ SIR Algorithm Let Σ xx be the covariance matrix of X . Standarize X to : − 1 / 2 { X − E X } Z = Σ xx Divde the range of y i into S nonoverlapping slices H s ∈ { 1,... ,S } n s is the number of observations within each slice − 1 ∑ i ∈ H s z i , and Compute the mean of z i over all slices ̄ z s = n s calculate the estimate for Cov { E ( X ∣ Y )} : S − 1 ∑ s = 1 T M = n n s ̄ z s ̄ z s M , ̂ λ k and corresponding Identify largest K eigenvalues of ̂ η k . Then, eigenvectors ̂ − 1 / 2 ̂ β k = ̂ Σ xx η k ( k = 1,... ,K ) 31

  32. SIR with Variable Selection Only a subset of predictors are relevant: β 1 , ... , β K are sparse Backward subset selection (Cook 2004, Li et al. 2005) Shrinkage estimates of β 1 , ... , β K using L 1 - or L 2 -penalty : Regularized SIR (RSIR, Zhong et al. 2005) Sparse SIR (SSIR, Li 2007) Correlation Pursuit (Zhong et al. 2012) : A forward selection and backward elimination procedure motivated by F-test in stepwise regression F 1, n − d − 1 = ( n − d − 1 )( ̂ R d + 1 − ̂ R d 2 2 ) 1 − ̂ R d + 1 2 32

  33. Correlation Pursuit (COP) A the k th Let A be the current set of selected predictors and ̂ λ k largest eigenvalue estimated by SIR based on predictors in A For j th predictor ( j ∉ A ) , X j , define statistic A + j = n ( ̂ λ k A + j − ̂ λ k A ) COP k 1 − ̂ λ k A + j 33

  34. Correlation Pursuit (COP) A the k th Let A be the current set of selected predictors and ̂ λ k largest eigenvalue estimated by SIR based on predictors in A For j th predictor ( j ∉ A ) , X j , define statistic A + j = n ( ̂ λ k A + j − ̂ λ k A ) COP k 1 − ̂ λ k A + j If j ∉ A ,COP k A + j ( k = 1,... , K ) are asymptotically � 2 ( 1 ) , i.i.d. χ A + j = ∑ k = 1 K A + j is asymptotically χ 2 ( K ) and COP 1: K COP k 34

  35. Correlation Pursuit (COP) A the k th Let A be the current set of selected predictors and ̂ λ k largest eigenvalue estimated by SIR based on predictors in A For j th predictor ( j ∉ A ) , X j , define statistic A + j = n ( ̂ λ k A + j − ̂ λ k A ) COP k 1 − ̂ λ k A + j If j ∉ A ,COP k A + j ( k = 1,... , K ) are asymptotically � 2 ( 1 ) , i.i.d. χ A + j = ∑ k = 1 K A + j is asymptotically χ 2 ( K ) and COP 1: K COP k r ) ,r < 1 / 2 The stepwise procedure is consistent if p = O ( n 35

  36. Correlation Pursuit (COP) A the k th Let A be the current set of selected predictors and ̂ λ k largest eigenvalue estimated by SIR based on predictors in A For j th predictor ( j ∉ A ) , X j , define statistic A + j = n ( ̂ λ k A + j − ̂ λ k A ) COP k 1 − ̂ λ k A + j If j ∉ A ,COP k A + j ( k = 1,... , K ) are asymptotically � 2 ( 1 ) , i.i.d. χ A + j = ∑ k = 1 K A + j is asymptotically χ 2 ( K ) and COP 1: K COP k r ) ,r < 1 / 2 The stepwise procedure is consistent if p = O ( n Dimension K and threshold in forward selection (backward elimnation) are chosen by cross-validation 36

  37. SIR via MLE C , d = ∣ A ∣ Let A be the set of relevant predictors and C = A X A ∣ Y ∈ H s ∼ N ( μ s , Σ ) 37

  38. SIR via MLE Let A be the set of relevant predictors and C = ¬ A , d = ∣ A ∣ X A ∣ Y ∈ H s ∼ N ( μ s , Σ ) X C ∣ X A ,Y ∈ H s ∼ N ( X A β , Σ 0 ) 38

  39. SIR via MLE C , d = ∣ A ∣ Let A be the set of relevant predictors and C = A X A ∣ Y ∈ H s ∼ N ( μ s , Σ ) X C ∣ X A ,Y ∈ H s ∼ N ( X A β , Σ 0 ) μ s = α + Γ γ s , where γ s ∈ R K and Γ is a d × K orthogonal matrix 39

  40. SIR via MLE C , d = ∣ A ∣ Let A be the set of relevant predictors and C = A X A ∣ Y ∈ H s ∼ N ( μ s , Σ ) X C ∣ X A ,Y ∈ H s ∼ N ( X A β , Σ 0 ) K , belongs to a K -dimensional affine space ( K < d ) μ s = α + V 40

  41. SIR via MLE C , d = ∣ A ∣ Let A be the set of relevant predictors and C = A X A ∣ Y ∈ H s ∼ N ( μ s , Σ ) X C ∣ X A ,Y ∈ H s ∼ N ( X A β , Σ 0 ) K , belongs to a K -dimensional affine space ( K < d ) μ s = α + V K coincides with SIR directions MLE of the span of subspace V (Cook 2007, Szretter and Yohai 2009) 41

  42. SIR via MLE Let A be the set of relevant predictors and C = ¬ A , d = ∣ A ∣ X A ∣ Y ∈ H s ∼ N ( μ s , Σ ) X C ∣ X A ,Y ∈ H s ∼ N ( X A β , Σ 0 ) K , belongs to a K -dimensional affine space ( K < d ) μ s = α + V K coincides with SIR directions MLE of the span of subspace V (Cook 2007, Szretter and Yohai 2009) Given current A and predctor X j ∉ A , we want to test H 0 : X j is irrelevant, vs. H 1 : X j is relevant 42

  43. SIR via MLE C , d = ∣ A ∣ Let A be the set of relevant predictors and C = A X A ∣ Y ∈ H s ∼ N ( μ s , Σ ) X C ∣ X A ,Y ∈ H s ∼ N ( X A β , Σ 0 ) K , belongs to a K -dimensional affine space ( K < d ) μ s = α + V K coincides with SIR directions MLE of the span of subspace V (Cook 2007, Szretter and Yohai 2009) Given current A and predctor X j ∉ A , we want to test H 0 : X j is irrelevant, vs. H 1 : X j is relevant P M 1 ( X ∣ Y ) P M 1 ( X j ∣ X A ,Y ) P M 0 ( X ∣ Y ) = P M 0 ( X j ∣ X A ,Y ) 43

  44. SIR via MLE C , d = ∣ A ∣ Let A be the set of relevant predictors and C = A X A ∣ Y ∈ H s ∼ N ( μ s , Σ ) X C ∣ X A ,Y ∈ H s ∼ N ( X A β , Σ 0 ) K , belongs to a K -dimensional affine space ( K < d ) μ s = α + V K coincides with SIR directions MLE of the span of subspace V (Cook 2007, Szretter and Yohai 2009) Given current A and predctor X j ∉ A , we want to test H 0 : X j is irrelevant, vs. H 1 : X j is relevant M 1 ( X j ∣ X A ,Y ) P ̂ LR j = M 0 ( X j ∣ X A ,Y ) P ̂ 44

  45. ̂ LR Test vs. COP Given current A , the likelihood ratio (LR) test statistic of H 0 : X j is irrelevant, vs. H 1 : X j is relevant A + j )− ∑ k = 1 K K 2LR j = − n ( ∑ k = 1 log ( 1 − ̂ log ( 1 − ̂ λ k A )) λ k log ( A + j ) A + j − ̂ λ k A λ k K = n ∑ k = 1 1 + 1 − ̂ λ k Under H 0 : X j is irrelevant A + j = n ( ̂ λ k A + j − ̂ λ k 2 ( 1 ) , ( ̂ λ k A + j − ̂ λ k A ) A ) → p χ → p 0 COP k 1 − ̂ λ k 1 − ̂ λ k A + j A + j 45

Recommend


More recommend