Gaussian model selection with an unknown variance Yannick Baraud Laboratoire J.A. Dieudonn´ e Universit´ e de Nice Sophia Antipolis baraud@unice.fr Joint work with C. Giraud and S. Huet logo
The statistical framework We observe � � µ, σ 2 I n Y ∼ N where both parameters µ ∈ R n and σ > 0 are unknown. Our aim: Estimate µ from the observation of Y . logo
Example : Variable selection p � � � µ, σ 2 I n Y ∼ N with µ = θ j X j . j = 1 and p possibly larger than n but expect that � �� �� � ≪ n j , θ j � = 0 � � Our aim: Estimate µ and j , θ j � = 0 . logo
The estimation strategy: model selection We start with a collection { S m , m ∈ M} of linear subspaces (models) of R n . S m − → ˆ µ m = Π S m Y Our aim : select ˆ m = ˆ m ( Y ) among M in such a way � m | 2 � � µ m | 2 � | µ − ˆ µ ˆ inf | µ − ˆ . E close to m ∈M E logo
Variable selection (continued) p � � � µ, σ 2 I n Y ∼ N with µ = θ j X j j = 1 For m ⊂ { 1 , . . . , p } , such that | m | ≤ D max < n we set � � S m = Span X j , j ∈ m . Ordered variable selection. Take M o = {{ 1 , . . . , D } , D ≤ D max } ∪ { ∅ } (Almost) complete variable selection. Take M c = { m ⊂ P ( { 1 , . . . , p } ) , | m | ≤ D max } logo
Some selection criteria � � µ m | 2 + pen ( m ) ˆ | Y − ˆ m = argmin m ∈M - Mallows’ C p (1973): pen ( m ) = 2 D m σ 2 where D m = dim ( S m ) . e & Massart (2001): pen ( m ) = pen ( m , σ 2 ) . - Birg´ logo
Advantages : - Non-asymptotic theory - Variable selection: no assumption on the predictors X j . - Bayesian flavor : allows (into some extent) to take into account knowlege/intuition Drawbacks : - The computation of ˆ m may not feasible if M is too large logo
For the problem of variable selection : Tibshirani(1996) Lasso : � � 2 � � p � � � θ λ = argmin ˆ � � Y − θ j X j + λ | θ | 1 . � � θ ∈ R p � � j = 1 Cand` es & Tao (2007) Dantzig selector: � � � � p � � � θ λ = arg ˆ � � min | θ | 1 , max � X j , Y − θ j ′ X j ′ � ≤ λ � � j = 1 ,..., p � � j ′ = 1 � � � m λ = j , ˆ ˆ ˆ θ λ θ λ − → j � = 0 and ˆ µ ˆ m λ = j X j j ∈ ˆ m λ logo
Advantages : - The computation is feasible even if p is very large - Non-asymptotic theory Drawbacks : - The procedure work under suitable assumptions on the predictors X j - There is no way to check these assumptions if p is very large - Blind to knowledge/intuition logo
For all these procedures, remains the problem of estimating σ 2 or choosing λ These parameters depends on the data distribution and must be estimated In general, there is no natural estimator of σ 2 (complete variable selection with p > n ) Cross-validation... The performance of the procedure crucially depends upon these parameters. logo
Other selection criteria � � 1 + pen ( m ) µ m | 2 Crit ( m ) = | Y − ˆ n − D m or � µ m | 2 � + pen ′ ( m ) Crit ′ ( m ) | Y − ˆ = log n Both criteria are the same if one takes � � 1 + pen ( m ) pen ′ ( m ) = n log ≈ pen ( m ) n − D m logo
� � 1 + pen ( m ) µ m | 2 Crit ( m ) = | Y − ˆ n − D m or � µ m | 2 � + pen ′ ( m ) Crit ( m ) = log | Y − ˆ n Akaike(1969) FPE : pen ( m ) = 2 D m Akaike(1973) AIC : pen ′ ( m ) = 2 D m Schwarz/Akaike (1978) BIC/SIC : pen ′ ( m ) = D m log ( n ) Saito(1994) AMDL : pen ′ ( m ) = 3 D m log ( n ) logo
Two questions What can be said about these selection criteria from a 1 non-asymptotic point of view? Is it possible to propose other penalties that would take into 2 account the complexity of the collection { S m , m ∈ M} ? logo
What do we mean by complexity? We shall say that that the collection { S m , m ∈ M} is a -complex (with a ≥ 0) if |{ m ∈ M , D m = D }| ≤ e aD ∀ D ≥ 1 . For the collection { S m , m ∈ M o } |{ m ∈ M , D m = D }| ≤ 1 = ⇒ a = 0 For the collection { S m , m ∈ M c } � p � ≤ p D |{ m ∈ M , D m = D }| ≤ = ⇒ a = log ( p ) D logo
Penalty choice with regard to complexity Let φ ( x ) = ( x − 1 − log ( x )) / 2 for x ≥ 1. Consider a a -complex collection { S m , m ∈ M} . If for some K , K ′ > 1 pen ( m ) ≤ K ′ , ∀ m ∈ M ∗ K ≤ φ − 1 ( a ) D m and select � � 1 + pen ( m ) µ m | 2 ˆ m = argmin m ∈M | Y − ˆ n − D m then � � m | 2 | µ − ˆ µ ˆ E σ 2 C ( K ) K ′ φ − 1 ( a ) ≤ � � µ m | 2 | µ − ˆ inf m ∈M E ∨ 1 σ 2 logo
Case of ordered variable selection a = 0, φ − 1 ( a ) = 1. For all m ∈ M such that D m � = 0 1 < K ≤ pen ( m ) ≤ K ′ D m one has � � m | 2 | µ − ˆ µ ˆ E σ 2 C ( K ) K ′ ≤ � � µ m | 2 | µ − ˆ inf m ∈M E ∨ 1 σ 2 − → FPE and AIC (for n large enough) logo
Case of the complete variable selection with p = n a = log ( n ) , φ − 1 ( a ) ≈ 2 log ( n ) . If for all m ∈ M such that D m � = 0 pen ( m ) 2 D m log ( n ) ≤ K ′ 1 < K ≤ then � � m | 2 | µ − ˆ µ ˆ E σ 2 C ( K ) K ′ log ( n ) ≤ � � µ m | 2 | µ − ˆ ∨ 1 inf m ∈M E σ 2 − → AMDL (but not AIC, FPE, BIC) logo
New penalties Definition Let X D ∼ χ 2 ( D ) , X N ∼ χ 2 ( N ) , be two independent χ 2 . Define �� � � 1 X D − x X N H D , N ( x ) = E ( X D ) × E , x ≥ 0 N + Definition To each S m with D m < n − 1 , we associate a weight L m ≥ 0 and the penalty � e − L m � 1 . 1 N m N m − 1 H − 1 pen ( m ) = where N m = n − D m . D m + 1 , N m − 1 logo
Theorem Let { S m , m ∈ M} be a collection of models and { L m , m ∈ M} a family of weights. Assume that N m ≥ 7 and D m ∨ L m ≤ n / 2 for all m ∈ M . Define � � 1 + pen ( m ) µ m | 2 ˆ m ∈M | Y − ˆ m = argmin n − D m The estimator ˆ µ ˆ m satisfies � � m | 2 | µ − ˆ µ ˆ � × E σ 2 � � � � µ m | 2 � | µ − ˆ ( D m + 1 ) e − L m . ≤ inf + L m + E σ 2 m ∈M m ∈M logo
Ordered variable selection For m ∈ M o , m = { 1 , . . . , D } , L m = | m | � ( D m + 1 ) e − L m ≤ 2 . 51 − → m ∈M If | m | ≤ D max ≤ [ n / 2 ] ∧ p , � � � � � � m | 2 µ m | 2 | µ − ˆ µ ˆ | µ − ˆ ≤ � inf ∨ 1 E E . σ 2 σ 2 m ∈M logo
Complete Variable selection For m ∈ M c , �� p �� + 2 log ( | m | + 1 ) L m = log | m | � ( D m + 1 ) e − L m ≤ log ( p ) . − → m ∈M If | m | ≤ D max ≤ [ n / ( 2 log ( p ))] ∧ p , � � � � � � m | 2 µ m | 2 | µ − ˆ µ ˆ | µ − ˆ ≤ � log ( p ) inf ∨ 1 . E E σ 2 σ 2 m ∈M logo
Complete Variable selection: order of magnitude of the penalty n=32 n=512 400 8000 K=1.1 AMDL 6000 300 penalty penalty 200 4000 100 2000 0 0 0 2 4 6 8 0 20 40 60 80 logo D D
Comparison with Lasso/Adaptive Lasso The ”Adaptive Lasso” Proposed by Zou(2006). � � 2 � � p p � � � � � � 1 θ λ = argmin ˆ � � � θ j � Y − θ j X j + λ � � γ × . � � � � θ ∈ R p � ˜ � � θ j � j = 1 j = 1 − → λ, γ obtained by cross-validation logo
Simulation 1 Consider the predictors X 1 , . . . , X 8 ∈ R 20 such that for all i = 1 , . . . , 20 X T = ( X 1 , i , . . . , X 8 , i ) are i.i.d. N ( 0 , Γ) with Γ j , k = 0 . 5 | j − k | . i and µ = 3 X 1 + 1 . 5 X 2 + 2 X 5 logo
σ = 1 E ( | � % { � % { � r m | ) m = m 0 } m ⊇ m 0 } Our procedure 1.57 3.34 72% 97.8% Lasso 2.09 5.21 10.8% 100% A. Lasso 1.99 4.56 16.8% 99% σ = 3 E ( | � m | ) % { � m = m 0 } % { � m ⊇ m 0 } r Our procedure 3.08 2.01 10.3% 15.7 Lasso 2.06 4.56 10.5% 100% A. Lasso 2.44 3.81 13.2 52% logo
Simulation 2 Let X 1 , X 2 , X 3 be three vectors of R n defined by √ X 1 = ( 1 , − 1 , 0 , . . . , 0 ) / 2 √ 1 + 1 . 001 2 X 2 = ( − 1 , 1 . 001 , 0 , . . . , 0 ) / √ √ � 1 + ( n − 2 ) / n 2 X 3 = ( 1 / 2 , 1 / 2 , 1 / n , . . . , 1 / n ) / and X j = e j for all j = 4 , . . . , n . We take p = n = 20, D max = 8 and µ = ( n , n , 0 , . . . , 0 ) ∈ Span { X 1 , X 2 } . − → µ almost ⊥ X 1 , X 2 and very correlated to X 3 . logo
The result E ( | � m | ) % { � m = m 0 } % { � m ⊇ m 0 } r Our procedure 2.24 2.19 83.4% 96.2% Lasso 285 6 0% 30% A. Lasso 298 5 0% 25% logo
Mixed strategy Let m ∈ M c . L m = | m | if m ∈ M o �� p �� = + log ( p ( | m | + 1 )) if m ∈ M c \ M o log | m | � ( D m + 1 ) e − L m ≤ 3 . 51 − → m ∈M � � m | 2 | µ − ˆ µ ˆ ≤ � E σ 2 � � � � � � � � µ m | 2 µ m | 2 | µ − ˆ | µ − ˆ inf ∨ 1 ∧ log ( p ) inf ∨ 1 . m ∈M o E m ∈M c E σ 2 σ 2 logo
Recommend
More recommend