Low norm and ℓ 1 guarantees on Sparsifiability Shai Shalev-Shwartz & Nathan Srebro Toyota Technologica Institute--Chicago ICML/COLT/UAI workshop, July 2008
Motivation Problem 1: w 0 = argmin E [ L ( � w , x � , y )] s . t . � w � 0 ≤ S w
Motivation Problem 1: w 0 = argmin E [ L ( � w , x � , y )] s . t . � w � 0 ≤ S w Problem II: w 1 = argmin E [ L ( � w , x � , y )] s . t . � w � 1 ≤ B w
Motivation Problem 1: w 0 = argmin E [ L ( � w , x � , y )] s . t . � w � 0 ≤ S w Problem II: w 1 = argmin E [ L ( � w , x � , y )] s . t . � w � 1 ≤ B w Strict assumptions on data distribution ⇒ w 1 is also sparse But, what if w 1 is not sparse ?
Motivation Problem 1: w 0 = argmin E [ L ( � w , x � , y )] s . t . � w � 0 ≤ S w Problem II: w 1 = argmin E [ L ( � w , x � , y )] s . t . � w � 1 ≤ B w features not correlated Strict assumptions on data distribution ⇒ w 1 is also sparse But, what if w 1 is not sparse ?
Sparsification Predictor w with � w � 1 = B Sparsification procedure w with � ˜ w � 0 = S Predictor ˜
Sparsification Predictor w with � w � 1 = B Sparsification procedure w with � ˜ w � 0 = S Predictor ˜ Constraint: E [ L ( � ˜ w , x � , y )] ≤ E [ L ( � w , x � , y )] + ǫ Goal: Minimal S that satisfies constraint Question: How S depends on B and ǫ ?
Main Result Theorem: For any predictor w , λ -Lipschitz loss function L , distribu- tion D over X × Y , desired accuracy ǫ Exists ˜ w s.t. E [ L ( � ˜ w , x � , y )] ≤ E [ L ( � w , x � , y )] + ǫ and �� λ � w � 1 � 2 � � ˜ w � 0 = O ǫ Tightness: Data distribution, loss function, dense predictor w with loss l , but need Ω (( � w � 2 1 / ǫ ) 2 ) features for loss l + ǫ Sparsifying by taking largest weights or following ℓ 1 regu- larization path might fail Low ℓ 2 norm predictor �⇒ sparse predictor
Main Result (cont.) Distribution D Loss L
Main Result (cont.) Distribution D Loss L Convex opt. Low ℓ 1 predictor w
Main Result (cont.) Distribution D Loss L Convex opt. Low ℓ 1 predictor w Sparse predictor ˜ w Randomized sparsification
Main Result (cont.) Distribution D Loss L Forward selection procedure Convex opt. Low ℓ 1 predictor w Sparse predictor ˜ w Randomized sparsification
Randomized Sparsification Procedure | w 1 | | w n | Sparsification Procedure Z Z For j = 1 , . . . , S Sample r i from distribu- tion P i ∝ | w i | Add | ˜ w i | ← | ˜ w i | + 1 | ˜ w 1 | | ˜ w n | Z ′ Z ′
Randomized Sparsification Procedure | w 1 | | w n | Sparsification Procedure Z Z For j = 1 , . . . , S Sample r i from distribu- tion P i ∝ | w i | Add | ˜ w i | ← | ˜ w i | + 1 | ˜ w 1 | | ˜ w n | Z ′ Z ′
Randomized Sparsification Procedure Sparsification Procedure For j = 1 , . . . , S Sample r i from distribution P i ∝ | w i | Add | ˜ w i | ← | ˜ w i | + 1 Guarantee Assume: X = { x : � x � ∞ ≤ 1 } , Y = arbitrary set, D = arbitrary distribution over X × Y , Loss L : R × Y → R is λ -Lipschitz w.r.t. 1st argument λ 2 � w � 2 � � 1 log(1 / δ ) If: S ≥ Ω ǫ 2 Then, with probability at least 1 − δ , E [ L ( � ˜ w , x � , y )] − E [ L ( � w , x � , y )] ≤ ǫ
Randomized Sparsification Procedure Distribution D Loss L Convex • Requires access to w opt. • Does not require access to D Randomized sparsification Low ℓ 1 predictor w Sparse predictor ˜ w
Tightness Data distribution: spread ‘information’ about label among all features Y P ( Y = ± 1) = 1 2 X 1 X n X i P ( X i = y | y ) = 1 + 1 /B 2
Tightness (cont.) Y P ( Y = ± 1) = 1 2 Dense predictor: w i = B n and thus � w � 1 = B B E [ | � w , x � − y | ] ≤ X 1 X i X n √ n P ( X i = y | y ) = 1 + 1 /B Sparse predictor: 2 Any u with E [ | � u , x � − y | ] ≤ ǫ must satisfy: � B 2 � � u � 0 = Ω ǫ 2
Tightness (cont.) Y P ( Y = ± 1) = 1 2 Dense predictor: w i = B n and thus � w � 1 = B B E [ | � w , x � − y | ] ≤ X 1 X i X n √ n P ( X i = y | y ) = 1 + 1 /B Sparse predictor: 2 Any u with E [ | � u , x � − y | ] ≤ ǫ must satisfy: � B 2 � � u � 0 = Ω ǫ 2 Proof uses a generalization of Khintchine inequality: If x = ( x 1 , . . . , x n ) are independent random variables with P [ x k = 1] ∈ (5% , 95%) and Q is degree d polynomial, then: 1 E [ | Q ( x ) | ] ≥ (0 . 2) d E [ | Q ( x ) | 2 ] 2
Low L2 norm does not guarantee sparsifiability Same data distribution as before with B = ǫ √ n Dense predictor: w i = B n B E [ | � w , x � − y | ] ≤ √ n = ǫ B � w � 2 = √ n = ǫ Sparse predictor: Any u with E [ | � u , x � − y | ] ≤ 2 ǫ must use almost all features: � B 2 � � u � 0 = Ω = Ω ( n ) ǫ 2 ℓ 1 captures sparsity but ℓ 2 doesn’t !
Sparsifying by zeroing small weights fails P ( Y = ± 1) = 1 Y 2 P ( Z 1 = y | y ) = 1 + 2 / 3 P ( Z s = y | y ) = 1 + 1 / 3 2 2 Z 1 Z s P ( X j = z ⌈ j/s ⌉ | z ⌈ j/s ⌉ ) = 7 8 X 1 X n X sn
Sparsifying by zeroing small weights fails P ( Y = ± 1) = 1 Y 2 P ( Z 1 = y | y ) = 1 + 2 / 3 P ( Z s = y | y ) = 1 + 1 / 3 2 2 Z 1 Z s P ( X j = z ⌈ j/s ⌉ | z ⌈ j/s ⌉ ) = 7 8 X 1 X n X sn larger weights
Sparsifying by zeroing small weights fails P ( Y = ± 1) = 1 Y 2 P ( Z 1 = y | y ) = 1 + 2 / 3 P ( Z s = y | y ) = 1 + 1 / 3 2 2 Z 1 Z s P ( X j = z ⌈ j/s ⌉ | z ⌈ j/s ⌉ ) = 7 8 X 1 X n X sn initial weights on regularization larger weights path also fails on this example
Intermediate Summary We answer a fundamental question: How much sparsity does low ℓ 1 norm guarantee ? � � w � 2 � � ˜ w � 0 ≤ O 1 ǫ 2 This is tight Achievable by simple randomized procedure Coming next: Direct approach also works !
Intermediate Summary We answer a fundamental question: How much sparsity does low ℓ 1 norm guarantee ? � � w � 2 � � ˜ w � 0 ≤ O 1 ǫ 2 This is tight Achievable by simple randomized procedure Coming next: Direct approach also works ! Distribution D Forward selection procedure Loss L Convex opt. Randomized Low ℓ 1 predictor w Sparse predictor ˜ sparsification w
Greedy Forward Selection Step 1: Define a slightly modified loss function λ 2 ǫ ( u − v ) 2 + L ( u, y ) ˜ L ( v, y ) = min u Using infimal convolution theory, it can be shown that ˜ L has Lipschitz continuous derivative ∀ v, y | L ( v, y ) − ˜ L ( v, y ) | ≤ ǫ / 4 Step 2: Apply forward greedy selection on ˜ L Initialize w 1 = 0 Choose feature using largest element of gradient Choose step size η t (closed form solution exists) Update w t +1 = (1 − η t ) w t + η t B e j t
Greedy Forward Selection Example – Hinge loss: 0 if v > 1 ˜ 1 if v ∈ [1 − 1 ǫ ( v − 1) 2 L ( v, y ) = max { 0 , 1 − v } ; L ( v, y ) = ǫ , 1] (1 − ǫ 4 ) − v else
Greedy Forward Selection Example – Hinge loss: 0 if v > 1 ˜ 1 if v ∈ [1 − 1 ǫ ( v − 1) 2 L ( v, y ) = max { 0 , 1 − v } ; L ( v, y ) = ǫ , 1] (1 − ǫ 4 ) − v else
Guarantees Theorem X = { x : � x � ∞ ≤ 1 } , Y = arbitrary set D = arbitrary distribution over X × Y Loss L : R × Y → R is proper, convex, and λ -Lipschitz w.r.t. 1st argument Forward greedy selection on ˜ L finds ˜ w s.t. λ 2 B 2 � � � ˜ w � 0 = O ǫ 2 For any w with � w � 1 ≤ B we have: E [ L ( � ˜ w , x � , y )] − E [ L ( � w , x � , y )] ≤ ǫ
Related Work ℓ 1 norm and sparsity: Donoho provides su ffi cient conditions for when minimizer of ℓ 1 norm is also sparse. But, what if these conditions are not met? Compressed sensing: ℓ 1 norm recovers sparse predictor, but only under server assumptions on the design matrix (in our case, the training examples) ? Converse question: Small � ˜ w � 0 ⇒ Small � w � 1 ? Servedio: partial answer for the case of linear classification Wainwright: partial answer for the Lasso Sparsification: Randomized sparsification procedure previously proposed by Schapire et al. However, their bound depends on training set size Lee, Bartlett, and Williamson addressed similar question for the special case of squared-error loss Zhang presented forward greedy procedure for twice di ff erentiable losses
Summary Distribution D Loss L Convex opt. Sparse predictor ˜ w � � w � 2 � Randomized � ˜ w � 0 ≤ O Low ℓ 1 predictor w 1 ǫ 2 This is tight largest weights
Summary Distribution D Loss L Convex opt. Forward selection regularization path Sparse predictor ˜ w � � w � 2 � Randomized � ˜ w � 0 ≤ O Low ℓ 1 predictor w 1 ǫ 2 This is tight largest weights
Summary Distribution D Convex opt. Low ℓ 2 predictor Loss L Convex opt. Forward selection regularization path Sparse predictor ˜ w � � w � 2 � Randomized � ˜ w � 0 ≤ O Low ℓ 1 predictor w 1 ǫ 2 This is tight largest weights
Recommend
More recommend