High-dimensional classification by sparse logistic regression Felix Abramovich Tel Aviv University (based on joint work with Vadim Grinshtein, The Open University of Israel and Tomer Levy, Tel Aviv University) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 1 / 29
Outline 1 Review on (binary) classification 2 High-dimensional (binary) classification by sparse logistic regression ◮ model, feature selection by penalized maximum likelihood ◮ theory: misclassification excess bounds, adaptive minimax classifiers ◮ computational issues: logistic Lasso and Slope 3 Multiclass extensions ◮ sparse multinomial logistic regression ◮ theory ◮ multinomial logistic group Lasso and Slope Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 2 / 29
Binary Classification ( X , Y ) ∼ F : Y | X = x ∼ B (1 , p ( x )) , X ∈ R d ∼ f ( x ) Classifier η : R d → { 0 , 1 } Missclassification error R ( η ) = P ( Y � = η ( x )) Bayes classifier η ∗ ( x ) = arg min η R ( η ) η ∗ ( x ) = I { p ( x ) ≥ 1 / 2 } , R ( η ∗ ) = E X (min( p ( X ) , 1 − p ( X ))) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 3 / 29
Binary Classification ( X , Y ) ∼ F : Y | X = x ∼ B (1 , p ( x )) , X ∈ R d ∼ f ( x ) Classifier η : R d → { 0 , 1 } Missclassification error R ( η ) = P ( Y � = η ( x )) Bayes classifier η ∗ ( x ) = arg min η R ( η ) η ∗ ( x ) = I { p ( x ) ≥ 1 / 2 } , R ( η ∗ ) = E X (min( p ( X ) , 1 − p ( X ))) Data D = ( X 1 , Y 1 ) , . . . , ( X n , Y n )) ∼ F (conditional) Missclassification error R (ˆ η ) = P ( Y � = ˆ η ( x ) | D ) η, η ∗ ) = ER (ˆ η ) − R ( η ∗ ) Misclassification excess risk E (ˆ Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 3 / 29
Vapnik-Chervonenkis (VC) dimension Definition Let C be a set of classifiers. VC ( C ) is the maximal number of points in X that can be arbitrarily classified by classifiers in C . Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 4 / 29
Vapnik-Chervonenkis (VC) dimension Definition Let C be a set of classifiers. VC ( C ) is the maximal number of points in X that can be arbitrarily classified by classifiers in C . Example: VC of linear classifiers C = { η ( x ) = I { β t x ≥ 0 } , β ∈ R d } X = R 2 , C = { η ( x ) = I { β 0 + β 1 x 1 + β 2 x 2 ≥ 0 } VC ( C ) = 3 (= d ) X = R d − 1 , β ∈ R d ( x 0 = 1) VC ( C ) = d Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 4 / 29
Example: VC of sine classifiers: X = R , C = { η ( x ) = I { x ≥ sin( θ x ) , θ > 0 } Can classify any finite subset of points, VC ( C ) = ∞ Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 5 / 29
Minimax lower bound Minimax lower bound. Let 2 ≤ VC ( C ) < ∞ , n ≥ VC ( C ) and R ( η ∗ ) > 0. Then, � VC ( C ) η, η ∗ ) ≥ C inf sup E (˜ n η ˜ η ∗ ∈C , f ( x ) (e.g., Devroye, Gy¨ orfi and Lugosi, ’96). In particular, for linear classifiers � d η, η ∗ ) ≥ C inf sup E (˜ n η ˜ η ∗ ∈C , f ( x ) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 6 / 29
Two main approaches 1. Empirical Risk Minimization (ERM) n � 1 ˆ η = arg min ˆ R ( η ) = arg min I ( Y i � = η ( x i )) n η ∈C η ∈C i =1 well-developed theory (Devroye, Gy¨ orfi and Lugosi ’96; Vapnik ’00; see also Boucheron, Bousquet and Lugosi ’05 for review) � VC ( C ) η, η ∗ ) ≤ C sup E (ˆ ( optimal order ) n η ∗ ∈C computationally infeasible, various convex surrogates (e.g., SVM) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 7 / 29
2. Plug-in Classifiers estimate p ( x ) from the data p ( x ) 1 − p ( x ) = β t x or (e.g, (parametric) logistic regression: ln nonparametic: Yang ’99, Koltchinskii and Beznosova ’05, Audibert and Tsybakov ’07) plug-in ˆ η ( x ) = I (ˆ p ( x ) ≥ 1 / 2) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 8 / 29
2. Plug-in Classifiers estimate p ( x ) from the data p ( x ) 1 − p ( x ) = β t x or (e.g, (parametric) logistic regression: ln nonparametic: Yang ’99, Koltchinskii and Beznosova ’05, Audibert and Tsybakov ’07) plug-in ˆ η ( x ) = I (ˆ p ( x ) ≥ 1 / 2) Logistic regression classifier p ( x ) 1 ln 1 − p ( x ) = β t x 2 estimate β by MLE t x ≥ 0) – linear classifier p ( x ) ≥ 1 / 2) = I (ˆ 3 plug-in ˆ η = I (ˆ β Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 8 / 29
Big Data era – curse of dimensionality For large d classification without feature (model) selection is as bad as just pure random guessing (e.g., Bickel and Levina ’04; Fan and Fan ’08) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 9 / 29
Big Data era – curse of dimensionality For large d classification without feature (model) selection is as bad as just pure random guessing (e.g., Bickel and Levina ’04; Fan and Fan ’08) Sparse logistic regression classifier 1 model/feature selection – � M t M = I (ˆ 2 plug-in ˆ η � β M x ≥ 0) � Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 9 / 29
Sparse logistic regression ( X , Y ) ∼ F : Y | X = x ∼ B (1 , p ( x )) , X ∈ R d ∼ f ( x ) p ( x ) 1 − p ( x ) = β t x logit ( p ( x )) = ln sparsity assumption: || β || 0 ≤ d 0 Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 10 / 29
Sparse logistic regression ( X , Y ) ∼ F : Y | X = x ∼ B (1 , p ( x )) , X ∈ R d ∼ f ( x ) p ( x ) 1 − p ( x ) = β t x logit ( p ( x )) = ln sparsity assumption: || β || 0 ≤ d 0 Lemma (thanks to Noga Alon) Let C ( d 0 ) = { η ( x ) = I { β t x ≥ 0 } : β ∈ R d , || β || 0 ≤ d 0 } . � 2 d � � de � d 0 log 2 ≤ VC ( C ( d 0 )) ≤ 2 d 0 log 2 , i . e . d 0 d 0 � de � VC ( C ( d 0 )) ∼ d 0 ln d 0 Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 10 / 29
Model/feature selection by penalized MLE For a given model M ⊆ { 1 , . . . , d } , MLE: � � �� n � t � � 1 + exp( � β M ) t x i β M = arg max M x i Y i − ln , β � β ∈B M i =1 where B M = { β ∈ R d : β j = 0 iff j / ∈ M } Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 11 / 29
Model/feature selection by penalized MLE For a given model M ⊆ { 1 , . . . , d } , MLE: � � �� n � t � � 1 + exp( � β M ) t x i β M = arg max M x i Y i − ln , β � β ∈B M i =1 where B M = { β ∈ R d : β j = 0 iff j / ∈ M } �� n � � � � � t t � 1 + exp( � − � M = arg min M ln M x i ) + Pen ( | M | ) β β M x i Y i i =1 t exp( � β M x ) � � M ( x ) = p � t 1 + exp( � M x ) β � t M ( x ) ≥ 1 / 2) = I ( � η � � M ( x ) = I ( � p � β M x ≥ 0) � Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 11 / 29
Complexity Penalties linear-type penalties Pen ( | M | ) = λ | M | λ = 1 AIC (Akaike, ’73) λ = ln( n ) / 2 BIC (Schwarz, ’78) λ = ln d RIC (Foster and George, ’94) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 12 / 29
Complexity Penalties linear-type penalties Pen ( | M | ) = λ | M | λ = 1 AIC (Akaike, ’73) λ = ln( n ) / 2 BIC (Schwarz, ’78) λ = ln d RIC (Foster and George, ’94) k ln( d / k )-type nonlinear penalties Pen ( | M | ) ∼ C | M | ln( de / | M | ) (Birg´ e and Massart, ’01, ’07; Bunea et al. ’07; AG ’10 for Gaussian regression; AG ’16 for GLM) � d � k ln( d / k ) ∼ ln − log( number of models of size k ) k In addition, for classification, k ln( d / k ) ∼ VC ( C ( k )) (recall Lemma) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 12 / 29
Various complexity penalties AIC RIC 2kln(de/k) Pen(k) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 13 / 29 k
Let supp ( f ( x )) be bounded, w.l.o.g. || x || 2 ≤ 1 for all x ∈ X Assumption (boundedness) There exists 0 < δ < 1 / 2 such that δ < p ( x ) < 1 − δ or, equivalently, there exists C 0 > 0 such that | β t x | < C 0 for all x ∈ X . The assumption prevents the variance Var ( Y ) = p ( x )(1 − p ( x )) to be infinitely close to zero. Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 14 / 29
Recommend
More recommend