Compressed sensing, sparsity and p-values Sara van de Geer April 16, 2015 (Leiden) Dantzig April 16, 2015 1 / 49
Basis Pursuit [Chen, Donoho and Saunders (1998)] X : given n × p (sensing) matrix and f 0 : given n -vector of measurements. We know f 0 = X β 0 . We want to recover β 0 ∈ R p . There are n equations and p unknowns. High-dimensional case: p ≫ n . Notation The ℓ 1 -norm is p � | β j | , β ∈ R p . � β � 1 := j = 1 β ∗ := arg min {� β � 1 : X β = f 0 } . Basis pursuit solution (Leiden) Dantzig April 16, 2015 2 / 49
Let S ⊂ { 1 , . . . , p } . Notation β S := { β j l { j ∈ S }} , β − S := β S c = β − β S . ← 1 ∈ S β 1 0 . . . . . . . . . ← j − 1 / ∈ S β j − 1 0 β S = , β − S = β j ← j ∈ S 0 . . . . . . . . . ← p / ∈ S 0 β p Definition The matrix X satisfies the null-space property at S if for all β � = 0 in null ( X ) it holds that � β − S � 1 > � β S � 1 . (Leiden) Dantzig April 16, 2015 3 / 49
Basis pursuit solution β ∗ := arg min {� β � 1 : X β = f 0 } . Let S 0 := { j : β 0 j � = 0 } be the active set of β 0 . Loose definition The vector β 0 is called sparse if S 0 is small. Theorem Suppose X has the null-space property at S 0 . Then we have exact recovery: β ∗ = β 0 . (Leiden) Dantzig April 16, 2015 4 / 49
Proof. Suppose β ∗ � = β 0 . Since X β ∗ = X β 0 = f 0 we have β ∗ − β 0 ∈ null ( X ) . By the null-space property � β ∗ − S 0 � 1 > � β ∗ S 0 − β 0 � 1 . Since β ∗ minimizes � · � 1 we have � β ∗ � 1 ≤ � β 0 � 1 . We can decompose the ℓ 1 -norm as � β ∗ � 1 = � β ∗ S 0 � 1 + � β ∗ − S 0 � 1 . Hence � β ∗ S 0 � 1 + � β ∗ − S 0 � 1 ≤ � β 0 � 1 . But then by the triangle inequality � β ∗ − S 0 � 1 ≤ � β ∗ S 0 − β 0 � 1 . Thus we arrived at a contradiction . ⊔ ⊓ (Leiden) Dantzig April 16, 2015 5 / 49
Definition [vdG (2007)] The compatibility constant for the set S and the stretching constant L > 0 is � | S | � ˆ φ 2 ( L , S ) = min n � X β S − X β − S � 2 2 : � β − S � 1 ≤ L , � β S � 1 = 1 . We have: X satisfies the null-space property at S ⇔ ˆ φ ( 1 , S ) > 0 . (Leiden) Dantzig April 16, 2015 6 / 49
X X , . . . , p 2 ˆ φ (1 , { 1 } ) X 1 The compatibility constant ˆ φ ( 1 , S ) for the case S = { 1 } . (Leiden) Dantzig April 16, 2015 7 / 49
Regularized formulation � β λ := arg min {� X β − f 0 � 2 2 / n + 2 λ � β � 1 . Lemma We have λ 2 | S 0 | � X ( β λ − β 0 ) � 2 2 / n ≤ . ˆ φ 2 ( 1 , S 0 ) (Leiden) Dantzig April 16, 2015 8 / 49
Adding noise Let Y = f 0 + ǫ with ǫ unobservable noise. Let β 0 be a solution of f 0 = X β 0 . Definition The Lasso is � � β := ˆ ˆ � Y − X β � 2 β λ := arg min 2 / n + 2 λ � β � 1 . β (Leiden) Dantzig April 16, 2015 9 / 49
Theorem (prediction error of the Lasso) Let λ ǫ ≥ � X T ǫ � ∞ / n . Take λ > λ ǫ . Then for ¯ λ λ := λ − λ ǫ , ¯ λ := λ + λ ǫ , L := λ we have ¯ λ 2 | S 0 | � X (ˆ β − β 0 ) � 2 2 / n ≤ . ˆ φ 2 ( L , S 0 ) (Leiden) Dantzig April 16, 2015 10 / 49
Note 1 � · � ∞ is the dual norm of � · � 1 . Note 2 Suppose ǫ ∼ N n ( 0 , σ 2 0 I ) and diag ( X T X ) / n = I . Then � � � 2 log ( 2 p /α ) � X T ǫ � ∞ / n ≥ σ 0 ≤ α. P n Note 3 Under compatibility conditions Lasso thus has prediction error 0 log p × | S 0 | � X (ˆ β − β 0 ) � 2 2 / n ∼ σ 2 n 0 log p × number of active parameters = σ 2 . number of observations = oracle inequality = adaptation (Leiden) Dantzig April 16, 2015 11 / 49
Note 1 � · � ∞ is the dual norm of � · � 1 . Note 2 Suppose ǫ ∼ N n ( 0 , σ 2 0 I ) and diag ( X T X ) / n = I . Then � � � 2 log ( 2 p /α ) � X T ǫ � ∞ / n ≥ σ 0 ≤ α. P n Note 3 Under compatibility conditions Lasso thus has prediction error 0 log p × | S 0 | � X (ˆ β − β 0 ) � 2 2 / n ∼ σ 2 n 0 log p × number of active parameters = σ 2 . number of observations = oracle inequality = adaptation (Leiden) Dantzig April 16, 2015 11 / 49
Note 1 � · � ∞ is the dual norm of � · � 1 . Note 2 Suppose ǫ ∼ N n ( 0 , σ 2 0 I ) and diag ( X T X ) / n = I . Then � � � 2 log ( 2 p /α ) � X T ǫ � ∞ / n ≥ σ 0 ≤ α. P n Note 3 Under compatibility conditions Lasso thus has prediction error 0 log p × | S 0 | � X (ˆ β − β 0 ) � 2 2 / n ∼ σ 2 n 0 log p × number of active parameters = σ 2 . number of observations = oracle inequality = adaptation (Leiden) Dantzig April 16, 2015 11 / 49
What if β 0 is only approximately sparse? Theorem (trade-off approximation error and sparsity) Let λ ǫ ≥ � X T ǫ � ∞ / n . Take λ > λ ǫ . Then for ¯ λ λ := λ − λ ǫ , ¯ λ := λ + λ ǫ , L := λ we have for all β and S ¯ λ 2 | S | � X (ˆ β − β 0 ) � 2 2 / n ≤ � X ( β − β 0 ) � 2 2 / n + 4 λ � β − S � 1 + . ˆ φ 2 ( L , S ) � �� � approximation error � �� � “effective sparsity” (Leiden) Dantzig April 16, 2015 12 / 49
Corollary Let S ⊂ { 1 , . . . , p } be arbitrary. Let f S be the projection of f 0 on the space spanned by { X j } j ∈ S . Then ¯ λ 2 | S | � X (ˆ β − β 0 ) � 2 2 / n ≤ � f S − f 0 � 2 2 / n + . ˆ φ 2 ( L , S ) So � � ¯ λ 2 | S | � X (ˆ β − β 0 ) � 2 � f S − f 0 � 2 2 / n ≤ min 2 / n + . ˆ φ 2 ( L , S ) S (Leiden) Dantzig April 16, 2015 13 / 49
What about the ℓ 1 -estimation error? Theorem(including the ℓ 1 -error) Let λ ǫ ≥ � X T ǫ � ∞ / n . Take λ > λ ǫ . Then for ¯ λ λ := λ − λ ǫ , ¯ λ := λ + λ ǫ + δλ, L := ( 1 − δ ) λ we have for all β and S ¯ λ 2 | S | 2 δλ � ˆ β − β � 1 + � X (ˆ β − β 0 ) � 2 2 / n ≤ � X ( β − β 0 ) � 2 + 4 λ � β − S � 1 . 2 / n + ˆ φ 2 ( L , S ) (Leiden) Dantzig April 16, 2015 14 / 49
Corollary (weak sparsity) Let p � ρ r | β 0 j | r , 0 < r < 1 , r := j = 1 S ∗ := { j : | β 0 j | > 3 λ ǫ } . We have (with δ = 1 / 5 , λ = 2 λ ǫ ) ρ r � ˆ β − β 0 � 1 ≤ 2 8 λ 1 − r r . ǫ ˆ φ 2 ( 4 , S ∗ ) Asymptopia Suppose 1 / ˆ φ 2 ( 4 , S ∗ ) = O ( 1 ) . � Let λ ǫ ≍ log p / n . 1 − r 2 ) we have � ˆ When ρ r β − β 0 � 1 = o P ( 1 ) . r = o (( n / log p ) (Leiden) Dantzig April 16, 2015 15 / 49
Question What is so special about the ℓ 1 -norm? Why does it lead to exact recovery and oracle inequalities? Answer Its decomposability: � β � 1 = � β S � 1 + � β − S � 1 . (Leiden) Dantzig April 16, 2015 16 / 49
Question What is so special about the ℓ 1 -norm? Why does it lead to exact recovery and oracle inequalities? Answer Its decomposability: � β � 1 = � β S � 1 + � β − S � 1 . (Leiden) Dantzig April 16, 2015 16 / 49
Question What is so special about the ℓ 1 -norm? Why does it lead to exact recovery and oracle inequalities? Answer Its decomposability: � β � 1 = � β S � 1 + � β − S � 1 . (Leiden) Dantzig April 16, 2015 16 / 49
Definition The sub-differential of β �→ � β � 1 is ∂ � β � 1 = { z : � z � ∞ = 1 , z T β = � β � 1 } . |ß | 1 +1 -1 subdifferential calculus (Leiden) Dantzig April 16, 2015 17 / 49
We invoke decomposability actually as the triangle property z T β ≥ � β − S 0 � 1 − � β S 0 � 1 . max z ∈ ∂ � β 0 � 1 (Leiden) Dantzig April 16, 2015 18 / 49
Other norms Let Ω be a norm on R p . Definition The dual norm of Ω is Ω( β ) ≤ 1 z T β, z ∈ R p . Ω ∗ ( z ) := max Definition The sub-differential of β �→ Ω( β ) is ∂ Ω( β ) := { z : Ω ∗ ( z ) = 1 , z T β = Ω( β ) } . (Leiden) Dantzig April 16, 2015 19 / 49
Other norms Let Ω be a norm on R p . Definition The dual norm of Ω is Ω( β ) ≤ 1 z T β, z ∈ R p . Ω ∗ ( z ) := max Definition The sub-differential of β �→ Ω( β ) is ∂ Ω( β ) := { z : Ω ∗ ( z ) = 1 , z T β = Ω( β ) } . (Leiden) Dantzig April 16, 2015 19 / 49
Definition We say that Ω is weakly decomposable at β 0 if there exists semi-norms Ω + and Ω − (depending on β 0 ) with Ω − ( β 0 ) = 0 such that for all β Ω( β ) ≥ Ω + ( β ) + Ω − ( β ) . Definition We say that Ω satisfies the triangle property at β 0 if there exists semi-norms Ω + and Ω − (depending on β 0 ) such that for all β z 0 ∈ ∂ Ω( β 0 ) z T ( β − β 0 ) ≥ Ω − ( β ) − Ω + ( β − β 0 ) . max (Leiden) Dantzig April 16, 2015 20 / 49
Recommend
More recommend