likelihood ratio test in high dimensional logistic
play

Likelihood Ratio Test in High-Dimensional Logistic Regression Is - PowerPoint PPT Presentation

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Yuxin Chen Electrical Engineering, Princeton University Coauthors Pragya Sur Emmanuel Cand` es Stanford Statistics Stanford Statistics


  1. Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square Yuxin Chen Electrical Engineering, Princeton University

  2. Coauthors Pragya Sur Emmanuel Cand` es Stanford Statistics Stanford Statistics & Math 2/ 26

  3. In memory of Tom Cover (1938 - 2012) Tom @ Stanford EE “ We all know the feeling that follows when one investigates a problem, goes through a large amount of algebra, and finally investigates the answer to find that the entire problem is illuminated not by the analysis but by the inspection of the answer ” 3/ 26

  4. Inference in regression problems Example: logistic regression � x ⊤ � , y i ∼ logistic-model 1 ≤ i ≤ n i β 4/ 26

  5. Inference in regression problems Example: logistic regression � x ⊤ � , y i ∼ logistic-model 1 ≤ i ≤ n i β One wishes to determine which covariate is of importance, i.e. β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) 4/ 26

  6. Classical tests β j � = 0 (1 ≤ j ≤ p ) β j = 0 vs. Standard approaches (widely used in R, Matlab, etc): use asymptotic distributions of certain statistics 5/ 26

  7. Classical tests β j � = 0 (1 ≤ j ≤ p ) β j = 0 vs. Standard approaches (widely used in R, Matlab, etc): use asymptotic distributions of certain statistics Wald statistic → χ 2 • Wald test: log-likelihood ratio statistic → χ 2 • Likelihood ratio test: • Score test: score → N ( 0 , Fisher Info ) • ... 5/ 26

  8. Example: logistic regression in R ( n = 100 , p = 30 ) > fit = glm(y ~ X, family = binomial) > summary(fit) Call: glm(formula = y ~ X, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.7727 -0.8718 0.3307 0.8637 2.3141 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.086602 0.247561 0.350 0.72647 X1 0.268556 0.307134 0.874 0.38190 X2 0.412231 0.291916 1.412 0.15790 X3 0.667540 0.363664 1.836 0.06642 . X4 -0.293916 0.331553 -0.886 0.37536 X5 0.207629 0.272031 0.763 0.44531 X6 1.104661 0.345493 3.197 0.00139 ** ... --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Can these inference calculations (e.g. p-values) be trusted? 6/ 26

  9. This talk: likelihood ratio test (LRT) β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) Log-likelihood ratio (LLR) statistic � − ℓ � � � � � LLR j := ℓ β β ( − j ) • ℓ ( · ) : log-likelihood • � β = arg max β ℓ ( β ) : unconstrained MLE 7/ 26

  10. This talk: likelihood ratio test (LRT) β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) Log-likelihood ratio (LLR) statistic � − ℓ � � � � � LLR j := ℓ β β ( − j ) • ℓ ( · ) : log-likelihood • � β = arg max β ℓ ( β ) : unconstrained MLE • � β ( − j ) = arg max β : β j =0 ℓ ( β ) : constrained MLE 7/ 26

  11. Wilks’ phenomenon ’1938 Samuel Wilks, Princeton β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) LRT asymptotically follows chi-square distribution (under null) d → χ 2 2 LLR j − ( p fixed , n → ∞ ) 1 8/ 26

  12. Wilks’ phenomenon ’1938 p-value ‰ 2 1 assess significance of coefficients Samuel Wilks, Princeton β j = 0 vs. β j � = 0 (1 ≤ j ≤ p ) LRT asymptotically follows chi-square distribution (under null) d → χ 2 2 LLR j − ( p fixed , n → ∞ ) 1 8/ 26

  13. Classical LRT in high dimensions 6000 p/n ∈ (1 , ∞ ) 4000 Linear regression 2000 y = Xβ + η ���� i.i.d. Gaussian 0 0.00 0.25 0.50 0.75 1.00 P − Values classical p-values are h onuniform For linear regression (with Gaussian noise) in high dimensions, 2 LLR j ∼ χ 2 (classical test always works) 1 9/ 26

  14. Classical LRT in high dimensions 15000 p = 1200 , n = 4000 10000 Logistic regression 5000 y ∼ logistic-model ( Xβ ) 0 0.00 0.25 0.50 0.75 1.00 P − Values classical p-values are highly nonuniform 10/ 26

  15. Classical LRT in high dimensions 15000 p = 1200 , n = 4000 10000 Logistic regression 5000 y ∼ logistic-model ( Xβ ) 0 0.00 0.25 0.50 0.75 1.00 P − Values classical p-values are highly nonuniform Wilks’ theorem seems inadequate in accommodating logistic regression in high dimensions 10/ 26

  16. Bartlett correction? ( n = 4000 , p = 1200 ) 30000 20000 Counts 10000 0 0.00 0.25 0.50 0.75 1.00 P−Values classical Wilks 2 LLR j 1+ α n /n ∼ χ 2 • Bartlett correction (finite sample effect): 1 11/ 26

  17. Bartlett correction? ( n = 4000 , p = 1200 ) 30000 10000 20000 Counts Counts 5000 10000 0 0 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 P−Values P−Values classical Wilks Bartlett-corrected 2 LLR j 1+ α n /n ∼ χ 2 • Bartlett correction (finite sample effect): 1 ◦ p-values are still non-uniform − → this is NOT finite sample effect 11/ 26

  18. Bartlett correction? ( n = 4000 , p = 1200 ) 30000 10000 20000 Counts Counts 5000 10000 0 0 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 P−Values P−Values classical Wilks Bartlett-corrected 2 LLR j 1+ α n /n ∼ χ 2 • Bartlett correction (finite sample effect): 1 ◦ p-values are still non-uniform − → this is NOT finite sample effect What happens in high dimensions? 11/ 26

  19. Our findings 6000 30000 10000 4000 20000 Counts Counts Counts 5000 2000 10000 0 0 0 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 P−Values P−Values P−Values rescaled χ 2 classical Wilks Bartlett-corrected 2 LLR j 1+ α n /n ∼ χ 2 • Bartlett correction (finite sample effect): 1 ◦ p-values are still non-uniform − → this is NOT finite sample effect • A glimpse of our theory: LRT follows a rescaled χ 2 distribution 11/ 26

  20. Problem formulation (formal) X y X y β n p n p X y ind. • Gaussian design: X i ∼ N ( 0 , Σ ) • Logistic model:  1  1 , with prob. 1+exp( − X ⊤ i β ) y i = 1 ≤ i ≤ n  1 − 1 , with prob. 1+exp( X ⊤ i β ) • Proportional growth: p/n → constant • Global null: β = 0 12/ 26

  21. When does MLE exist? n � � � 1 + exp( − y i X ⊤ maximize β ℓ ( β ) = − ( MLE ) log i β ) i =1 � �� � ≤ 0 2 { } 2 {− } y i = 1 y i = − 1 MLE is unbounded if ∃ perfect separating hyperplane 13/ 26

  22. When does MLE exist? n � � � 1 + exp( − y i X ⊤ maximize β ℓ ( β ) = − ( MLE ) log i β ) i =1 � �� � ≤ 0 If ∃ a hyperplane that perfectly separates { y i } , i.e. ∃ � i � s.t. y i X ⊤ β β > 0 for all i then MLE is unbounded a � a →∞ ℓ ( lim β ) = 0 ���� unbounded 13/ 26

  23. When does MLE exist? Separating capacity (Tom Cover, Ph. D. thesis ’1965) y i = − 1 y i = 1 n = 4 n = 12 n = 2 number of samples n increases = ⇒ more difficult to find separating hyperplane 14/ 26

  24. When does MLE exist? Separating capacity (Tom Cover, Ph. D. thesis ’1965) y i = − 1 y i = 1 n = 4 n = 12 n = 2 Theorem 1 (Cover ’1965) Under i.i.d. Gaussian design, a separating hyperplane exists with high prob. iff n/p < 2 (asymptotically) 14/ 26

  25. Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Cand` es ’2017) Suppose n/p > 2 . Under i.i.d. Gaussian design and global null, � p � d χ 2 2 LLR j − → α 1 n � �� � rescaled χ 2 15/ 26

  26. Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Cand` es ’2017) Suppose n/p > 2 . Under i.i.d. Gaussian design and global null, � p � d χ 2 2 LLR j − → α 1 n � �� � rescaled χ 2 • α ( p/n ) can be determined by solving a system of 2 nonlinear equations and 2 unknowns � (Ψ( τZ ; b )) 2 � τ 2 = n p E p � Ψ ′ ( τZ ; b ) � n = E where Z ∼ N (0 , 1) , Ψ is some operator, and α ( p/n ) = τ 2 /b 15/ 26

  27. Main result: asymptotic distribution of LRT Theorem 2 (Sur, Chen, Cand` es ’2017) Suppose n/p > 2 . Under i.i.d. Gaussian design and global null, � p � d χ 2 2 LLR j − → α 1 n � �� � rescaled χ 2 • α ( p/n ) can be determined by solving a system of 2 nonlinear equations and 2 unknowns ◦ α ( · ) depends only on aspect ratio p/n ◦ It is not a finite sample effect ◦ α (0) = 1 : matches classical theory 15/ 26

  28. Our adjusted LRT theory in practice rescaling constant α ( p/n ) 6000 ● 2.00 ● ● ● Rescaling Constant ● 1.75 ● 4000 ● ● ● Counts ● 1.50 ● ● ● ● ● ● ● ● 2000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.25 1.00 ● ● 0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 κ p/n 0.00 0.25 0.50 0.75 1.00 P−Values rescaling constant for logistic model empirical p-values ≈ Unif (0 , 1) Empirically, LRT ≈ rescaled χ 2 1 (as predicted) 16/ 26

Recommend


More recommend