greedy selection on the lasso solution grid
play

Greedy selection on the Lasso solution grid Piotr Pokarowski - PowerPoint PPT Presentation

Greedy selection on the Lasso solution grid Piotr Pokarowski Faculty of Mathematics, Informatics and Mechanics, University of Warsaw 1 Dec 2016 Piotr Pokarowski Penalized Loss Minimization Framework Data = { ( y 1 , x T 1 ) , . . . , ( y n ,


  1. Greedy selection on the Lasso solution grid Piotr Pokarowski Faculty of Mathematics, Informatics and Mechanics, University of Warsaw 1 Dec 2016 Piotr Pokarowski

  2. Penalized Loss Minimization Framework Data = { ( y 1 , x T 1 · ) , . . . , ( y n , x T n · ) } = Train ⊕ Valid ⊕ Test Fitting: � β ( λ ) = arg min { loss ( β, Train ) + penalty ( β, λ ) } β � � Selection: � � λ = arg min err β ( λ ) , Valid λ � � β ( � � Assessment: � err = err λ ) , Test Piotr Pokarowski

  3. Loss and Penalty Loss is relaxation of prediction error and tempered (partial, scaled etc.) negative log-likelihood n � loss ( β, Train ) = L ( y i , f ( x i · , β )) i = 1 Penalty on a model β = ( β 1 , . . . , β p ) T p � penalty ( β, λ ) = P λ ( | β j | ) j = 1 λ 1 ( t > 0 ) � P λ ( t ) � λ t 2 Piotr Pokarowski

  4. Loss Functions ⊃ linear, logistic models For i = 1 , . . . , n we have x i . ∈ R p and X = [ x 1 . , . . . , x n . ] T = [ x . 1 , . . . , x . p ] . y = ( y 1 , . . . , y n ) T , For simplicity of presentation y T 1 n = 0 and the columns are standardized such that x T x T . j 1 n = 0 , . j x . j = 1 for j = 1 , . . . , p . We consider a generalized linear model with a canonical link function g ( E y i ) = x T i . β ∗ . Let ε i = ( y i − E y i ) / sd ( y i ) . We assume that ε = ( ε 1 , . . . , ε n ) T ∈ R n is a vector of iid zero-mean errors having a subgaussian distribution with a constant σ that is E exp ( u ε i ) ≤ exp ( σ 2 u 2 / 2 ) for u ∈ R . Piotr Pokarowski

  5. Penalty Functions - Classics A. Hoerl and R. Kennard, Technometric 1970: Ridge Regression (RR) ≡ ℓ 2 -penalty P λ ( t ) = λ t 2 R. Nishi, Ann. Stat. 1984: Generalized Information Criterion (GIC) ≡ ℓ 0 -penalty P λ ( t ) = λ 1 ( t > 0 ) R. Tibshirani, JRRS-B 1996: Lasso ≡ ℓ 1 -penalty P λ ( t ) = λ t Piotr Pokarowski

  6. Penalty Functions - New Propositions H. Zou and T. Hastie, JRSS-B 2005 (1750 cit.): Elastic Net (EN) P λ 1 ,λ 2 ( t ) = λ 1 t + λ 2 2 t 2 P λ,α ( t ) = λ ( α t + 1 − α t 2 ) 2 CH. Zhang, Ann. Stat. 2010 (270 cit.): Minimax Concave Penalty (MCP) P λ,γ ( t ) = λ ( t ∧ γλ )( 1 − t ∧ γλ 2 γλ ) GIC � MCP � Lasso � EN � RR Piotr Pokarowski

  7. Elastic Net Penalty EN thresholding functions alpha = 0.1 alpha = 0.1 4 4 alpha = 0.5 alpha = 0.5 alpha = 0.9 alpha = 0.9 3 3 P 2 β 2 1 1 0 0 0 1 2 3 4 0 1 2 3 4 β t Piotr Pokarowski

  8. Minimax Concave Penalty MCP thresholding functions gamma = 25 gamma = 25 4 4 gamma = 2.5 gamma = 2.5 gamma = 1.1 gamma = 1.1 3 3 P 2 β 2 1 1 0 0 0 1 2 3 4 0 1 2 3 4 β t Piotr Pokarowski

  9. Algorithm 1 GIC-thresholded Lasso (SS) Input: y , X and λ Screening (Lasso) � β = argmin β { ℓ ( β ) + λ | β | 1 } ; order nonzero | � β j 1 | ≥ . . . ≥ | � β j s | , where s = | supp � β | ; � � { j 1 } , { j 1 , j 2 } , . . . , supp � set J = β ; Selection (GIC) � � � ℓ ( � β ML ) + λ 2 | J | T = argmin J ∈J ; J β SS = � Output: � T , � β ML b T Piotr Pokarowski

  10. Algorithm 2 Greedy Selection on the Lasso Solution Grid (SOSnet) Input: y , X and ( o , λ ≤ λ 1 < . . . < λ m ) Screening (Lasso) for k = 1 to m do β ( k ) = argmin β { ℓ ( β ) + λ k | β | 1 } ; � β ( k ) β ( k ) order nonzero | � j 1 | ≥ . . . ≥ | � j sk | , s k = | supp � β ( k ) | ; Ordering (squared Wald tests) for l = 1 to o do set J = { j 1 , j 2 , . . . , j s kl } , s kl = ⌊ s k · l o ⌋ ; compute � β ML ; J set predictors in J according to squared Wald tests: w 2 i 1 ≥ w 2 i 2 ≥ . . . ≥ w 2 i skl ; set J kl = {{ i 1 } , { i 1 , i 2 } , . . . , { i 1 , i 2 , . . . , i s kl }} end for ; end for ; Selection (Generalized Information Criterion, GIC) � � J = � m � o � ℓ J + λ 2 | J | l = 1 J kl T = argmin J ∈J k = 1 β SOSnet = � Output: � T , � β ML b T Piotr Pokarowski

  11. When thresholding separates a true model ? beta 5 hat_beta 4 coefficients 3 2 1 0 −1 1 2 3 4 5 6 7 8 indices Piotr Pokarowski

  12. Lasso separarion error (1) A true model is T = supp ( β ∗ ) = { j ∈ F : β ∗ j � = 0 } . β ∗ min = min j ∈ T | β ∗ j | and t = | T | . A Bregman divergence D ( β, β ∗ ) = ℓ ( β ) − ℓ ( β ∗ ) − ˙ ℓ ( β ∗ ) T ( β − β ∗ ) A symmetrized Bregman divergence ∆( β, β ∗ ) = D ( β, β ∗ ) + D ( β ∗ , β ) = ( β − β ∗ ) T ( ˙ ℓ ( β ) − ˙ ℓ ( β ∗ )) Piotr Pokarowski

  13. Lasso separarion error (2) For a ∈ ( 0 , 1 ) consider a cone � � T | 1 ≤ 1 + a ν ∈ R p : | ν ¯ C T , a = 1 − a | ν T | 1 . (1) A general invertibility factor defined in J. Huang and C-H Zhang JMLR 2012: ∆( β ∗ + ν, β ∗ ) ζ a = inf . (2) | ν T | 1 | ν | ∞ ν ∈C T , a Piotr Pokarowski

  14. Lasso separarion error (3) We have on A = { ˙ ℓ ( β ∗ ) ≤ λ } a so-called oracle inequality | ∆ | ∞ ≤ ( 1 + a ) λζ − 1 < β ∗ min / 2 . a It is easy to check that A a ⊆ { T ∈ J } Hence for λ < ( 1 + a ) − 1 ζ a β ∗ min / 2 we have � � − a 2 λ 2 P ( T �∈ J ) ≤ 2 p exp . 2 σ 2 Piotr Pokarowski

  15. GIC error (1) Let W ∗ = diag ( sd ( y 1 ) , ..., sd ( y n )) , X ∗ = W 1 / 2 X . ∗ Let X ∗ J be a submatrix of X ∗ with columns having indices in J . H ∗ J - orthogonal projection on columns of X ∗ J . A scaled K-L distances between T and its submodels is defined in X-T Shen et al JASA 2012: J ⊂ T , | T \ J | = k || ( I − H ∗ J ) X β ∗ || 2 , δ k = min ℓ ( x T ¨ iT β T ) / ¨ ℓ ( x T i . β ∗ ) c k = min min i β T : || X ∗ T β T − X ∗ β ∗ ||≤ δ k ˜ k c 2 δ = min k δ k / k Piotr Pokarowski

  16. GIC error (2) ˜ If t σ 2 < λ 2 < δ 2 ( 1 + a ) 2 then � � − a 2 λ 2 P ( T ∈ J , ˆ T ⊂ T ) ≤ exp . 2 σ 2 , log ( 3 p )) < λ 2 then If σ 2 a 2 min ( tc − 1 t � � − a 2 λ 2 P ( T ∈ J , ˆ T ⊃ T ) ≤ 3 p exp . 4 σ 2 Piotr Pokarowski

Recommend


More recommend