Feature Selection & the Shapley-Folkman Theorem. Alexandre d’Aspremont , CNRS & D.I., ´ Ecole Normale Sup´ erieure. With Armin Askari, Laurent El Ghaoui (UC Berkeley) and Quentin Rebjock (EPFL) Alex d’Aspremont CIRM, Luminy, March 2020. 1/32
Introduction Feature Selection. � Reduce number of variables while preserving classification performance. � Often improves test performance, especially when samples are scarce. � Helps interpretation. Classical examples: LASSO, ℓ 1 -logistic regression, RFE-SVM, . . . Alex d’Aspremont CIRM, Luminy, March 2020. 2/32
Introduction: feature selection RNA classification. Find genes which best discriminate cell type (lung cancer vs control). 35238 genes, 2695 examples. [Lachmann et al., 2018] ×10 11 3.430 3.435 3.440 Objective 3.445 3.450 3.455 0 5000 10000 15000 20000 25000 30000 35000 Number of features ( k ) Best ten genes: MT-CO3, MT-ND4, MT-CYB, RP11-217O12.1, LYZ, EEF1A1, MT-CO1, HBA2, HBB, HBA1. Alex d’Aspremont CIRM, Luminy, March 2020. 3/32
Introduction: feature selection Applications. Mapping brain activity by fMRI. From PARIETAL team at INRIA. Alex d’Aspremont CIRM, Luminy, March 2020. 4/32
Introduction: feature selection fMRI. Many voxels, very few samples leads to false discoveries. Wired article on Bennett et al. “Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction” Journal of Serendipitous and Unexpected Results, 2010. Alex d’Aspremont CIRM, Luminy, March 2020. 5/32
Introduction: linear models Linear models. Select features from large weights w . � LASSO solves min w � Xw − y � 2 2 + λ � w � 1 with linear prediction given by w T x . � i max { 0 , 1 − y i w T x i } + λ � w � 2 � Linear SVM, solves min w 2 with linear classification rule sign( w T x ) . In practice. � Relatively high complexity on very large-scale data sets. � Recovery results require uncorrelated features (incoherence, RIP, etc.). � Cheaper featurewise methods (ANOVA, TF-IDF, etc.) have relatively poor performance. Alex d’Aspremont CIRM, Luminy, March 2020. 6/32
Outline � Sparse Naive Bayes � The Shapley-Folkman theorem � Duality gap bounds � Numerical performance Alex d’Aspremont CIRM, Luminy, March 2020. 7/32
Multinomial Naive Bayse Multinomial Naive Bayse. In the multinomial model � � ( � m j =1 x j )! log Prob ( x | C ± ) = x ⊤ log θ ± + log . � m j =1 x j ! Training by maximum likelihood f + ⊤ log θ + + f −⊤ log θ − ( θ + ∗ , θ − ∗ ) = argmax 1 ⊤ θ + = 1 ⊤ θ − =1 θ + ,θ − ∈ [0 , 1] m Linear classification rule: for a given test point x ∈ R m , set y ( x ) = sign ( v + w ⊤ x ) , ˆ where w � log θ + ∗ − log θ − v � log Prob ( C + ) − log Prob ( C − ) , and ∗ Alex d’Aspremont CIRM, Luminy, March 2020. 8/32
Sparse Naive Bayse Naive Feature Selection. Make w � log θ + ∗ − log θ − ∗ sparse. Solve f + ⊤ log θ + + f −⊤ log θ − ( θ + ∗ , θ − ∗ ) = argmax � θ + − θ − � 0 ≤ k subject to (SMNB) 1 ⊤ θ + = 1 ⊤ θ − = 1 θ + , θ + ≥ 0 where k ≥ 0 is a target number of features. Features for which θ + i = θ − i can be discarded. Nonconvex problem. � Convex relaxation? � Approximation bounds? Alex d’Aspremont CIRM, Luminy, March 2020. 9/32
Sparse Naive Bayse Convex Relaxation. The dual is very simple. Sparse Multinomial Naive Bayes [Askari, A., El Ghaoui, 2019] Let φ ( k ) be the optimal value of (SMNB). Then φ ( k ) ≤ ψ ( k ) , where ψ ( k ) is the optimal value of the following one-dimensional convex optimization problem ψ ( k ) := C + min α ∈ [0 , 1] s k ( h ( α )) , (USMNB) where C is a constant, s k ( · ) is the sum of the top k entries of its vector argument, and for α ∈ (0 , 1) , h ( α ) := f + ◦ log f + + f − ◦ log f − − ( f + + f − ) ◦ log( f + + f − ) − f + log α − f − log(1 − α ) . Solved by bisection, linear complexity O ( n + k log k ) . Approximation bounds? Alex d’Aspremont CIRM, Luminy, March 2020. 10/32
Outline � Sparse Naive Bayes � The Shapley-Folkman theorem � Duality gap bounds � Numerical performance Alex d’Aspremont CIRM, Luminy, March 2020. 11/32
Shapley-Folkman Theorem Minkowski sum. Given sets X, Y ⊂ R d , we have X + Y = { x + y : x ∈ X, y ∈ Y } (CGAL User and Reference Manual) Convex hull. Given subsets V i ⊂ R d , we have �� � � V i = Co ( V i ) Co i i Alex d’Aspremont CIRM, Luminy, March 2020. 12/32
Shapley-Folkman Theorem The ℓ 1 / 2 ball, Minkowsi average of two and ten balls, convex hull. + + + + = Minkowsi sum of five first digits (obtained by sampling). Alex d’Aspremont CIRM, Luminy, March 2020. 13/32
Shapley-Folkman Theorem Shapley-Folkman Theorem [Starr, 1969] Suppose V i ⊂ R d , i = 1 , . . . , n, and � n � n � � x ∈ Co V i = Co ( V i ) i =1 i =1 then � � x ∈ V i + Co ( V i ) S x [1 ,n ] \S x where |S x | ≤ d . Alex d’Aspremont CIRM, Luminy, March 2020. 14/32
Shapley-Folkman Theorem Proof sketch. Write x ∈ � n i =1 Co ( V i ) , or n d +1 � � � � x v ij � � = λ ij , for λ ≥ 0 , e i 1 n i =1 j =1 Conic Carath´ eodory then yields representation with at most n + d nonzero coefficients. Use a pigeonhole argument λ ij } d } n x i ∈ V i x i ∈ Co ( V i ) Number of nonzero λ ij controls gap with convex hull. Alex d’Aspremont CIRM, Luminy, March 2020. 15/32
Shapley-Folkman: geometric consequences Consequences. � If the sets V i ⊂ R d are uniformly bounded with rad( V i ) ≤ R , then � �� n �� n �� i =1 V i i =1 V i min { n, d } d H , Co ≤ R n n n where rad( V ) = inf x ∈ V sup y ∈ V � x − y � . � In particular, when d is fixed and n → ∞ �� n �� n � � i =1 V i i =1 V i → Co n n in the Hausdorff metric with rate O (1 /n ) . � Holds for many other nonconvexity measures [Fradelizi et al., 2017]. Alex d’Aspremont CIRM, Luminy, March 2020. 16/32
Outline � Sparse Naive Bayes � The Shapley-Folkman theorem � Duality gap bounds � Numerical performance Alex d’Aspremont CIRM, Luminy, March 2020. 17/32
Nonconvex Optimization Separable nonconvex problem. Solve � n minimize i =1 f i ( x i ) (P) subject to Ax ≤ b, in the variables x i ∈ R d i with d = � n i =1 d i , where f i are lower semicontinuous and A ∈ R m × d . Take the dual twice to form a convex relaxation , � n i =1 f ∗∗ minimize i ( x i ) (CoP) subject to Ax ≤ b in the variables x i ∈ R d i . Alex d’Aspremont CIRM, Luminy, March 2020. 18/32
Nonconvex Optimization Convex envelope. Biconjugate f ∗∗ satisfies epi ( f ∗∗ ) = Co ( epi ( f )) , which means that f ∗∗ ( x ) and f ( x ) match at extreme points of epi ( f ∗∗ ) . Define lack of convexity as ρ ( f ) � sup x ∈ dom ( f ) { f ( x ) − f ∗∗ ( x ) } . Example. | x | 1 Card ( x ) 0 − 1 1 x The l 1 norm is the convex envelope of Card ( x ) in [ − 1 , 1] . Alex d’Aspremont CIRM, Luminy, March 2020. 19/32
Nonconvex Optimization Writing the epigraph of problem (P) as in [Lemar´ echal and Renaud, 2001], � � n � ( r 0 , r ) ∈ R 1+ m : f i ( x i ) ≤ r 0 , Ax − b ≤ r, x ∈ R d G r � , i =1 we can write the dual function of (P) as � � r 0 + λ ⊤ r : ( r 0 , r ) ∈ G ∗∗ Ψ( λ ) � inf , r in the variable λ ∈ R m , where G ∗∗ = Co ( G ) is the closed convex hull of the epigraph G . Affine constraints means (P) and (CoP) have the same dual [Lemar´ echal and Renaud, 2001, Th. 2.11], given by sup Ψ( λ ) (D) λ ≥ 0 in the variable λ ∈ R m . Roughly, if G ∗∗ = G , no duality gap in (P). Alex d’Aspremont CIRM, Luminy, March 2020. 20/32
Nonconvex Optimization Epigraph & duality gap. Define � i ( x i ) , A i x i ) : x i ∈ R d i � ( f ∗∗ F i = where A i ∈ R m × d i is the i th block of A . � The epigraph G ∗∗ can be written as a Minkowski sum of F i r n � G ∗∗ F i + (0 , − b ) + R m +1 = r + i =1 � Shapley-Folkman at x ∈ G ∗∗ shows f ∗∗ ( x i ) = f ( x i ) for all but at most m + 1 r terms in the objective. � As n → ∞ , with m/n → 0 , G r gets closer to its convex hull G ∗∗ and the r duality gap becomes negligible. Alex d’Aspremont CIRM, Luminy, March 2020. 21/32
Bound on duality gap A priori bound on duality gap of � n minimize i =1 f i ( x i ) subject to Ax ≤ b, where A ∈ R m × d . Proposition [Aubin and Ekeland, 1976, Ekeland and Temam, 1999] A priori bounds on the duality gap Suppose the functions f i in (P) satisfy Assumption (. . . ). There is a point x ⋆ ∈ R d at which the primal optimal value of (CoP) is attained, such that n n n m +1 � � � � f ∗∗ i ( x ⋆ x ⋆ f ∗∗ i ( x ⋆ i ) ≤ f i (ˆ i ) ≤ i ) + ρ ( f [ i ] ) i =1 i =1 i =1 i =1 � �� � � �� � � �� � � �� � gap CoP P CoP x ⋆ is an optimal point of (P) and ρ ( f [1] ) ≥ ρ ( f [2] ) ≥ . . . ≥ ρ ( f [ n ] ) . where ˆ Alex d’Aspremont CIRM, Luminy, March 2020. 22/32
Recommend
More recommend