discovering conditionally salient features with
play

Discovering Conditionally Salient Features with Statistical - PowerPoint PPT Presentation

Discovering Conditionally Salient Features with Statistical Guarantees Jaime Roquero Gimenez, James Zou Stanford University 1 / 11 Feature Selection Setting the problem: Dataset with d features X 1 , . . . , X d Response variable Y Goal : Find


  1. Discovering Conditionally Salient Features with Statistical Guarantees Jaime Roquero Gimenez, James Zou Stanford University 1 / 11

  2. Feature Selection Setting the problem: Dataset with d features X 1 , . . . , X d Response variable Y Goal : Find set of important variables H 1 ⊂ { 1 , . . . , d } A variable j ∈ H 0 is null (i.e. irrelevant for predicting Y ) if X j ⊥ ⊥ Y | X − j Otherwise, we say that that j ∈ H 1 is non-null. Construct a procedure that outputs an estimate ˆ S of H 1 False Discovery Rate control as statistical guarantee: � | ˆ S ∩ H 0 | � FDR = E | ˆ S | ∨ 1 2 / 11

  3. Feature Selection Setting the problem: Dataset with d features X 1 , . . . , X d Response variable Y Goal : Find set of important variables H 1 ⊂ { 1 , . . . , d } A variable j ∈ H 0 is null (i.e. irrelevant for predicting Y ) if X j ⊥ ⊥ Y | X − j Otherwise, we say that that j ∈ H 1 is non-null. Construct a procedure that outputs an estimate ˆ S of H 1 False Discovery Rate control as statistical guarantee: � | ˆ S ∩ H 0 | � FDR = E | ˆ S | ∨ 1 2 / 11

  4. Feature Selection Setting the problem: Dataset with d features X 1 , . . . , X d Response variable Y Goal : Find set of important variables H 1 ⊂ { 1 , . . . , d } A variable j ∈ H 0 is null (i.e. irrelevant for predicting Y ) if X j ⊥ ⊥ Y | X − j Otherwise, we say that that j ∈ H 1 is non-null. Construct a procedure that outputs an estimate ˆ S of H 1 False Discovery Rate control as statistical guarantee: � | ˆ S ∩ H 0 | � FDR = E | ˆ S | ∨ 1 2 / 11

  5. Feature Selection in Linear Model Fit a linear model to the data: Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + · · · + β d X d + ǫ Which variables are important? Those whose corresponding coefficients are non-zero. β 1 , β 3 � = 0 ⇒ 1 , 3 ∈ H 1 β 2 = β 4 = · · · = β d = 0 ⇒ 2 , 4 , . . . , d ∈ H 0 In this model, non-null features are global non-nulls . We have H 1 = { 1 , 3 } , regardless of the value of X 3 / 11

  6. Feature Selection in Linear Model Fit a linear model to the data: Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + · · · + β d X d + ǫ Which variables are important? Those whose corresponding coefficients are non-zero. β 1 , β 3 � = 0 ⇒ 1 , 3 ∈ H 1 β 2 = β 4 = · · · = β d = 0 ⇒ 2 , 4 , . . . , d ∈ H 0 In this model, non-null features are global non-nulls . We have H 1 = { 1 , 3 } , regardless of the value of X 3 / 11

  7. Feature Selection in Linear Model Fit a linear model to the data: Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + · · · + β d X d + ǫ Which variables are important? Those whose corresponding coefficients are non-zero. β 1 , β 3 � = 0 ⇒ 1 , 3 ∈ H 1 β 2 = β 4 = · · · = β d = 0 ⇒ 2 , 4 , . . . , d ∈ H 0 In this model, non-null features are global non-nulls . We have H 1 = { 1 , 3 } , regardless of the value of X 3 / 11

  8. Global vs. Local non-nulls What if a feature is non-null depending on the value of other features ? � Y = X 2 + ǫ if X 1 > c Y = X 3 + ǫ if X 1 ≤ c � H 1 = { 1 , 2 } if X 1 > c ” ⇒ ” H 1 = { 1 , 3 } if X 1 ≤ c From a global perspective, H 1 = { 1 , 2 , 3 } . Can we generate a procedure that selects non-null features locally , while retaining statistical guarantees? Potentially yes if model interactions in parametric models of Y | X . What if such models are not available? 4 / 11

  9. Global vs. Local non-nulls What if a feature is non-null depending on the value of other features ? � Y = X 2 + ǫ if X 1 > c Y = X 3 + ǫ if X 1 ≤ c � H 1 = { 1 , 2 } if X 1 > c ” ⇒ ” H 1 = { 1 , 3 } if X 1 ≤ c From a global perspective, H 1 = { 1 , 2 , 3 } . Can we generate a procedure that selects non-null features locally , while retaining statistical guarantees? Potentially yes if model interactions in parametric models of Y | X . What if such models are not available? 4 / 11

  10. Global vs. Local non-nulls What if a feature is non-null depending on the value of other features ? � Y = X 2 + ǫ if X 1 > c Y = X 3 + ǫ if X 1 ≤ c � H 1 = { 1 , 2 } if X 1 > c ” ⇒ ” H 1 = { 1 , 3 } if X 1 ≤ c From a global perspective, H 1 = { 1 , 2 , 3 } . Can we generate a procedure that selects non-null features locally , while retaining statistical guarantees? Potentially yes if model interactions in parametric models of Y | X . What if such models are not available? 4 / 11

  11. Local Definition of Null Variable A variable j ∈ H 0 is null if X j ⊥ ⊥ Y | X − j We define / construct: the sets of local nulls H 0 ( x ) , local non-nulls H 1 ( x ) at points in feature space a procedure to return a local estimate ˆ S ( x ) of the local non-nulls a generalization of FDR to a local FDR How to retain FDR control in a local setting, without using a parametric model for Y | X ? 5 / 11

  12. Local Definition of Null Variable A variable j ∈ H 0 ( x ) is a local null at X = x if X j ⊥ ⊥ Y | X − j = x − j We define / construct: the sets of local nulls H 0 ( x ) , local non-nulls H 1 ( x ) at points in feature space a procedure to return a local estimate ˆ S ( x ) of the local non-nulls a generalization of FDR to a local FDR How to retain FDR control in a local setting, without using a parametric model for Y | X ? 5 / 11

  13. Local Definition of Null Variable A variable j ∈ H 0 ( x ) is a local null at X = x if X j ⊥ ⊥ Y | X − j = x − j We define / construct: the sets of local nulls H 0 ( x ) , local non-nulls H 1 ( x ) at points in feature space a procedure to return a local estimate ˆ S ( x ) of the local non-nulls a generalization of FDR to a local FDR How to retain FDR control in a local setting, without using a parametric model for Y | X ? 5 / 11

  14. Knockoff Procedure Most feature selection procedures construct scores T j for each feature: X 1 , X 2 , . . ., X d , Y ↓ T 1 , T 2 , . . ., T d Then scores are ranked and some cutoff leads to ˆ S . Need a statistical model to have statistical guarantees on FDR If high-dimensional setting, statistical assumptions may fail. If wanted to do local feature selection, subsetting data could limit the power and break assumptions based on asymptotic behavior. These limitations make local feature selection a hard problem for usual methods. 6 / 11

  15. Knockoff Procedure Most feature selection procedures construct scores T j for each feature: X 1 , X 2 , . . ., X d , Y ↓ T 1 , T 2 , . . ., T d Then scores are ranked and some cutoff leads to ˆ S . Need a statistical model to have statistical guarantees on FDR If high-dimensional setting, statistical assumptions may fail. If wanted to do local feature selection, subsetting data could limit the power and break assumptions based on asymptotic behavior. These limitations make local feature selection a hard problem for usual methods. 6 / 11

  16. Knockoff Procedure The knockoff procedure generates a new, synthetic dataset ˜ X , and constructs scores as previously: X 1 , X 2 , . . ., X d , ˜ X 1 , ˜ X 2 , . . . , ˜ X d , Y ↓ ↓ T 1 , ˜ ˜ T 2 , . . . , ˜ T 1 , T 2 , . . ., T d , T d Ranking the differences W j = T j − ˜ T j allows to select features with FDR control. Does not require modeling Y | X for FDR control. Statistical guarantees only depend on the validity of the process to generate ˜ X . 7 / 11

  17. Knockoff Procedure The knockoff procedure generates a new, synthetic dataset ˜ X , and constructs scores as previously: X 1 , X 2 , . . ., X d , ˜ X 1 , ˜ X 2 , . . . , ˜ X d , Y ↓ ↓ T 1 , ˜ ˜ T 2 , . . . , ˜ T 1 , T 2 , . . ., T d , T d Ranking the differences W j = T j − ˜ T j allows to select features with FDR control. Does not require modeling Y | X for FDR control. Statistical guarantees only depend on the validity of the process to generate ˜ X . 7 / 11

  18. Localize the Knockoff Procedure Our work generalizes the Knockoff procedure to tackle local feature selection: Generalize the distributional properties of the knockoff variables ˜ X to the local setting, without additional constraints. Generalize the construction of the scores to capture local dependence. By generating ˜ X as in the usual knockoff procedure, using the whole dataset, the statistical guarantees hold for the localized procedure . 8 / 11

  19. Localize the Knockoff Procedure Our work generalizes the Knockoff procedure to tackle local feature selection: Generalize the distributional properties of the knockoff variables ˜ X to the local setting, without additional constraints. Generalize the construction of the scores to capture local dependence. By generating ˜ X as in the usual knockoff procedure, using the whole dataset, the statistical guarantees hold for the localized procedure . 8 / 11

  20. Example: Switch variable model Three switch features X s 0 , X s 1 , X s 2 and four different sets of local non-nulls S 00 , S 01 , S 10 , S 11 . Y has a linear response in X S ij . 9 / 11

  21. Local FDR control 1.0 0.8 0.6 Power 0.4 0.2 Average Global Power - Full Space Average Local Power - Medium radius (2 Partitions) Average Local Power - Small radius (4 Partitions) 0.0 5000 10000 20000 30000 40000 50000 0.4 Average Global FDR - Full Space Average Local FDR - Medium radius (2 Partitions) 0.3 Average Local FDR - Small radius (4 Partitions) FDR 0.2 0.1 0.0 5000 10000 20000 30000 40000 50000 Number of samples 10 / 11

  22. Thank you 11 / 11

Recommend


More recommend