selective inference a conditional perspective
play

Selective inference: a conditional perspective Xiaoying Tian Harris - PowerPoint PPT Presentation

Selective inference: a conditional perspective Xiaoying Tian Harris Joint work with Jonathan Taylor September 26, 2016 Model selection Observe data ( y , X ), X R n p , y R n Model selection Observe data ( y , X ), X R n


  1. Selective inference: a conditional perspective Xiaoying Tian Harris Joint work with Jonathan Taylor September 26, 2016

  2. Model selection ◮ Observe data ( y , X ), X ∈ R n × p , y ∈ R n

  3. Model selection ◮ Observe data ( y , X ), X ∈ R n × p , y ∈ R n ◮ model = lm(y ∼ X1 + X2 + X3 + X4)

  4. Model selection ◮ Observe data ( y , X ), X ∈ R n × p , y ∈ R n ◮ model = lm(y ∼ X1 + X2 + X3 + X4) model = lm(y ∼ X1 + X2 + X4)

  5. Model selection ◮ Observe data ( y , X ), X ∈ R n × p , y ∈ R n ◮ model = lm(y ∼ X1 + X2 + X3 + X4) model = lm(y ∼ X1 + X2 + X4) model = lm(y ∼ X1 + X3 + X4)

  6. Model selection ◮ Observe data ( y , X ), X ∈ R n × p , y ∈ R n ◮ model = lm(y ∼ X1 + X2 + X3 + X4) model = lm(y ∼ X1 + X2 + X4) model = lm(y ∼ X1 + X3 + X4) ◮ Inference after model selection 1. Use data to select a set of variables E 2. Normal z-test to get p-values

  7. Model selection ◮ Observe data ( y , X ), X ∈ R n × p , y ∈ R n ◮ model = lm(y ∼ X1 + X2 + X3 + X4) model = lm(y ∼ X1 + X2 + X4) model = lm(y ∼ X1 + X3 + X4) ◮ Inference after model selection 1. Use data to select a set of variables E 2. Normal z-test to get p-values ◮ Problem: inflated significance 1. Normal z-tests need adjustment 2. Selection is biased towards “significance”

  8. Inflated Significance Setup: ◮ X ∈ R 100 × 200 has i.i.d normal entries ◮ y = X β + ǫ , ǫ ∼ N (0 , I ) ◮ β = (5 , . . . , 5 , 0 , . . . , 0) � �� � 10 ◮ LASSO, nonzero coefficient set E ◮ z-test, null pvalues for i ∈ E , i �∈ { 1 , . . . , 10 } 0.5 null pvalues after selection 0.4 0.3 frequencies 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 p-values

  9. Post-selection inference ◮ PoSI approach: 1. Reduce to simultaneous inference 2. Protects against any selection procedure 3. Conservative and computationally expensive

  10. Post-selection inference ◮ PoSI approach: 1. Reduce to simultaneous inference 2. Protects against any selection procedure 3. Conservative and computationally expensive ◮ Selective inference approach: 1. Conditional approach 2. Specific to particular selection procedures 3. More powerful tests

  11. Conditional approach: example Consider the selection for “big effects”: � n i . i . d i =1 X i ◮ X 1 , . . . , X n ∼ N (0 , 1), X = n ◮ Select for “big effects”, X > 1 ◮ Observation: X obs = 1 . 1, with n = 5 ◮ Normal z -test v.s. selective test for H 0 : µ = 0. original distribution for ¯ X conditional distribution after selection 0.9 6 0.8 5 0.7 0.6 4 0.5 3 0.4 0.3 2 0.2 1 0.1 0.0 0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0

  12. Conditional approach: example Consider the selection for “big effects”: � n i . i . d i =1 X i ◮ X 1 , . . . , X n ∼ N (0 , 1), X = n ◮ Select for “big effects”, X > 1 ◮ Observation: X obs = 1 . 1, with n = 5 ◮ Normal z -test v.s. selective test for H 0 : µ = 0. original distribution for ¯ X conditional distribution after selection 0 . 9 6 0 . 8 5 0 . 7 0 . 6 4 0 . 5 3 0 . 4 0 . 3 2 0 . 2 1 0 . 1 0 . 0 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0

  13. Moral of selective inference Conditional approach: ◮ Selection, e.g. X > 1. ◮ Conditional distribution after selection, e.g. N ( µ, 1 n ), truncated at 1. ◮ Target of inference may (or may not) depend on outcome of the selection. 1. Not dependent: e.g. H 0 : µ = 0. 2. Dependent: e.g. two-sample problem, inference for variables selected by LASSO

  14. Moral of selective inference Conditional approach: ◮ Selection, e.g. X > 1. ◮ Conditional distribution after selection, e.g. N ( µ, 1 n ), truncated at 1. ◮ Target of inference may (or may not) depend on outcome of the selection. 1. Not dependent: e.g. H 0 : µ = 0. 2. Dependent: e.g. two-sample problem, inference for variables selected by LASSO ◮ Random hypothesis?

  15. Random hypothesis ◮ Replication studies

  16. Random hypothesis ◮ Replication studies ◮ Data splitting: observe data ( X , y ), with X fixed, entries of y are independent (given X )

  17. Random hypothesis ◮ Replication studies ◮ Data splitting: observe data ( X , y ), with X fixed, entries of y are independent (given X ) Random hypothesis selected by the data

  18. Random hypothesis ◮ Replication studies ◮ Data splitting: observe data ( X , y ), with X fixed, entries of y are independent (given X ) Random hypothesis selected by the data ◮ Data splitting as a conditional approach: L ( y 2 ) = L ( y 2 | H 0 selected by y 1 ) .

  19. Selective inference: a conditional approach ◮ Data splitting as a conditional approach: L ( y 2 ) = L ( y 2 | H 0 selected by y 1 ) . ◮ Inference based on the conditional law: y ∗ = y ∗ ( y , ω ) , L ( y | H 0 selected by y ∗ ) , where ω is some randomization independent of y .

  20. Selective inference: a conditional approach ◮ Data splitting as a conditional approach: L ( y 2 ) = L ( y 2 | H 0 selected by y 1 ) . ◮ Inference based on the conditional law: y ∗ = y ∗ ( y , ω ) , L ( y | H 0 selected by y ∗ ) , where ω is some randomization independent of y . ◮ Examples of y ∗ : 1. y ∗ = y 1 , where ω is a random split 2. y ∗ = y , ω is void 3. y ∗ = y + ω , where ω ∼ N (0 , γ 2 ), additive noise

  21. Different y ∗ ◮ Much more powerful tests. ◮ Randomization transfers the properties of unselective distributions to selective counterparts. y ∗ = y y ∗ = y 1 y ∗ = y + ω randomized LASSO y Lee et al. Data T. & Taylor T. & Tay- (2013), splitting, (2015) lor Taylor et Fithian et (2015) al.(2014) al.(2014)

  22. Selective v.s. unselective distributions � n i . i . d i =1 X i Example: X 1 , . . . , X n ∼ N (0 , 1), X = , n = 5. n Selection: X > 1. original distribution ¯ X conditional distribution after selection 0.9 6 0.8 5 0.7 0.6 4 0.5 3 0.4 0.3 2 0.2 1 0.1 0.0 0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0 . 0 0 . 5 1 . 0 1 . 5

  23. Selective v.s. unselective distributions � n i . i . d i =1 X i Example: X 1 , . . . , X n ∼ N (0 , 1), X = , n = 5. n Selection: X + ω > 1, where ω ∼ Laplace (0 . 15) Explicit formulas for the densities of the selective distribution. original distribution ¯ X conditional distribution after selection 0 . 9 2 . 0 0 . 8 0 . 7 1 . 5 0 . 6 0 . 5 1 . 0 0 . 4 0 . 3 0 . 5 0 . 2 0 . 1 0 . 0 0 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 0 . 0 0 . 5 1 . 0 1 . 5 The selective distribution is much better behaved after randomization

  24. Selective v.s. Unselective distributions i . i . d ◮ Suppose X i ∼ F , X i ∈ R k . � n i =1 ξ i ( X i ) + o p ( n − 1 ◮ Linearizable statistics: T = 1 2 ), with ξ i n being measurable to X i ’s. ◮ Central limit theorem: � � µ, Σ T ⇒ N , n where E [ T ] = µ ∈ R p , Var ( T ) = Σ .

  25. Selective v.s. Unselective distributions i . i . d ◮ Suppose X i ∼ F , X i ∈ R k . � n i =1 ξ i ( X i ) + o p ( n − 1 ◮ Linearizable statistics: T = 1 2 ), with ξ i n being measurable to X i ’s. ◮ Central limit theorem: � � µ, Σ T ⇒ N , n where E [ T ] = µ ∈ R p , Var ( T ) = Σ . Would this still hold under the selective distribution?

  26. Selective distributions Randomized selection with T ∗ = T ∗ ( T , ω ), ˆ M : T ∗ �→ M , ◮ Original distribution of T (with density f ): f ( t ) ◮ Selective distribution: � � � ˆ f ( t ) ℓ ( t ) , ℓ ( t ) ∝ M [ T ∗ ( t + ω )] = M g ( ω ) d ω 1 where g is the density for ω . ◮ ℓ ( t ) is also called the selective likelihood.

  27. Selective central limit theorem Theorem (Selective CLT, T. and Taylor (2015)) If 1. Model selection is made with T ∗ = T ∗ ( T , ω ) 2. Selective likelihood ℓ ( t ) satisfies some regularity conditions 3. T has moment generating function in a neighbourhood of the origin then L ( T | H 0 selected by T ∗ ) ⇒ L ( N ( µ, Σ) | H 0 selected by T ∗ ) ,

  28. Power comparison Unrandomized y ∗ = y , randomized y ∗ = y + ω , ω ∼ N (0 , 0 . 1 σ 2 ). HIVDB http://hivdb.stanford.edu/ Parameter values Parameter values 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 P62V P62V Randomized Unrandomized P65R P65R P67N P67N P69i P69i P75I P77L P77L P83K P83K P90I P90I P115F P115F P151M P151M P181C P181C P184V P184V P190A P190A P215F P215F P215Y P215Y P219R P219R

  29. Tradeoff between power and model selection ◮ Setup y = X β + ǫ , n = 100 , p = 200, ǫ ∼ N (0 , I ), β = (7 , . . . , 7 , 0 , . . . , 0). X is equicorrelated with ρ = 0 . 3. � �� � 7 ◮ Use randomized y ∗ to fit Lasso, active set E : 1. Data splitting / Data carving: y ∗ = y 1 random subset of y , 2. Additive randomization: y ∗ = y + ω , ω ∼ N (0 , γ 2 I ). Data carving picture credit Fithian et al. (2014).

  30. Fithian, W., Sun, D. & Taylor, J. (2014), ‘Optimal inference after model selection’, arXiv:1410.2597 [math, stat] . arXiv: 1410.2597. URL: http://arxiv.org/abs/1410.2597

Recommend


More recommend