High-dimensional classification by sparse logistic regression Felix - PowerPoint PPT Presentation

High-dimensional classification by sparse logistic regression Felix Abramovich Tel Aviv University (based on joint work with Vadim Grinshtein, The Open University of Israel and Tomer Levy, Tel Aviv University) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 1 / 29

Outline 1 Review on (binary) classification 2 High-dimensional (binary) classification by sparse logistic regression ◮ model, feature selection by penalized maximum likelihood ◮ theory: misclassification excess bounds, adaptive minimax classifiers ◮ computational issues: logistic Lasso and Slope 3 Multiclass extensions ◮ sparse multinomial logistic regression ◮ theory ◮ multinomial logistic group Lasso and Slope Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 2 / 29

Binary Classification ( X , Y ) ∼ F : Y | X = x ∼ B (1 , p ( x )) , X ∈ R d ∼ f ( x ) Classifier η : R d → { 0 , 1 } Missclassification error R ( η ) = P ( Y � = η ( x )) Bayes classifier η ∗ ( x ) = arg min η R ( η ) η ∗ ( x ) = I { p ( x ) ≥ 1 / 2 } , R ( η ∗ ) = E X (min( p ( X ) , 1 − p ( X ))) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 3 / 29

Binary Classification ( X , Y ) ∼ F : Y | X = x ∼ B (1 , p ( x )) , X ∈ R d ∼ f ( x ) Classifier η : R d → { 0 , 1 } Missclassification error R ( η ) = P ( Y � = η ( x )) Bayes classifier η ∗ ( x ) = arg min η R ( η ) η ∗ ( x ) = I { p ( x ) ≥ 1 / 2 } , R ( η ∗ ) = E X (min( p ( X ) , 1 − p ( X ))) Data D = ( X 1 , Y 1 ) , . . . , ( X n , Y n )) ∼ F (conditional) Missclassification error R (ˆ η ) = P ( Y � = ˆ η ( x ) | D ) η, η ∗ ) = ER (ˆ η ) − R ( η ∗ ) Misclassification excess risk E (ˆ Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 3 / 29

Vapnik-Chervonenkis (VC) dimension Definition Let C be a set of classifiers. VC ( C ) is the maximal number of points in X that can be arbitrarily classified by classifiers in C . Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 4 / 29

Vapnik-Chervonenkis (VC) dimension Definition Let C be a set of classifiers. VC ( C ) is the maximal number of points in X that can be arbitrarily classified by classifiers in C . Example: VC of linear classifiers C = { η ( x ) = I { β t x ≥ 0 } , β ∈ R d } X = R 2 , C = { η ( x ) = I { β 0 + β 1 x 1 + β 2 x 2 ≥ 0 } VC ( C ) = 3 (= d ) X = R d − 1 , β ∈ R d ( x 0 = 1) VC ( C ) = d Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 4 / 29

Example: VC of sine classifiers: X = R , C = { η ( x ) = I { x ≥ sin( θ x ) , θ > 0 } Can classify any finite subset of points, VC ( C ) = ∞ Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 5 / 29

Minimax lower bound Minimax lower bound. Let 2 ≤ VC ( C ) < ∞ , n ≥ VC ( C ) and R ( η ∗ ) > 0. Then, � VC ( C ) η, η ∗ ) ≥ C inf sup E (˜ n η ˜ η ∗ ∈C , f ( x ) (e.g., Devroye, Gy¨ orfi and Lugosi, ’96). In particular, for linear classifiers � d η, η ∗ ) ≥ C inf sup E (˜ n η ˜ η ∗ ∈C , f ( x ) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 6 / 29

Two main approaches 1. Empirical Risk Minimization (ERM) n � 1 ˆ η = arg min ˆ R ( η ) = arg min I ( Y i � = η ( x i )) n η ∈C η ∈C i =1 well-developed theory (Devroye, Gy¨ orfi and Lugosi ’96; Vapnik ’00; see also Boucheron, Bousquet and Lugosi ’05 for review) � VC ( C ) η, η ∗ ) ≤ C sup E (ˆ ( optimal order ) n η ∗ ∈C computationally infeasible, various convex surrogates (e.g., SVM) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 7 / 29

2. Plug-in Classifiers estimate p ( x ) from the data p ( x ) 1 − p ( x ) = β t x or (e.g, (parametric) logistic regression: ln nonparametic: Yang ’99, Koltchinskii and Beznosova ’05, Audibert and Tsybakov ’07) plug-in ˆ η ( x ) = I (ˆ p ( x ) ≥ 1 / 2) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 8 / 29

2. Plug-in Classifiers estimate p ( x ) from the data p ( x ) 1 − p ( x ) = β t x or (e.g, (parametric) logistic regression: ln nonparametic: Yang ’99, Koltchinskii and Beznosova ’05, Audibert and Tsybakov ’07) plug-in ˆ η ( x ) = I (ˆ p ( x ) ≥ 1 / 2) Logistic regression classifier p ( x ) 1 ln 1 − p ( x ) = β t x 2 estimate β by MLE t x ≥ 0) – linear classifier p ( x ) ≥ 1 / 2) = I (ˆ 3 plug-in ˆ η = I (ˆ β Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 8 / 29

Big Data era – curse of dimensionality For large d classification without feature (model) selection is as bad as just pure random guessing (e.g., Bickel and Levina ’04; Fan and Fan ’08) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 9 / 29

Big Data era – curse of dimensionality For large d classification without feature (model) selection is as bad as just pure random guessing (e.g., Bickel and Levina ’04; Fan and Fan ’08) Sparse logistic regression classifier 1 model/feature selection – � M t M = I (ˆ 2 plug-in ˆ η � β M x ≥ 0) � Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 9 / 29

Sparse logistic regression ( X , Y ) ∼ F : Y | X = x ∼ B (1 , p ( x )) , X ∈ R d ∼ f ( x ) p ( x ) 1 − p ( x ) = β t x logit ( p ( x )) = ln sparsity assumption: || β || 0 ≤ d 0 Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 10 / 29

Sparse logistic regression ( X , Y ) ∼ F : Y | X = x ∼ B (1 , p ( x )) , X ∈ R d ∼ f ( x ) p ( x ) 1 − p ( x ) = β t x logit ( p ( x )) = ln sparsity assumption: || β || 0 ≤ d 0 Lemma (thanks to Noga Alon) Let C ( d 0 ) = { η ( x ) = I { β t x ≥ 0 } : β ∈ R d , || β || 0 ≤ d 0 } . � 2 d � � de � d 0 log 2 ≤ VC ( C ( d 0 )) ≤ 2 d 0 log 2 , i . e . d 0 d 0 � de � VC ( C ( d 0 )) ∼ d 0 ln d 0 Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 10 / 29

Model/feature selection by penalized MLE For a given model M ⊆ { 1 , . . . , d } , MLE: � � �� n � t � � 1 + exp( � β M ) t x i β M = arg max M x i Y i − ln , β � β ∈B M i =1 where B M = { β ∈ R d : β j = 0 iff j / ∈ M } Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 11 / 29

Model/feature selection by penalized MLE For a given model M ⊆ { 1 , . . . , d } , MLE: � � �� n � t � � 1 + exp( � β M ) t x i β M = arg max M x i Y i − ln , β � β ∈B M i =1 where B M = { β ∈ R d : β j = 0 iff j / ∈ M } �� n � � � � � t t � 1 + exp( � − � M = arg min M ln M x i ) + Pen ( | M | ) β β M x i Y i i =1 t exp( � β M x ) � � M ( x ) = p � t 1 + exp( � M x ) β � t M ( x ) ≥ 1 / 2) = I ( � η � � M ( x ) = I ( � p � β M x ≥ 0) � Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 11 / 29

Complexity Penalties linear-type penalties Pen ( | M | ) = λ | M | λ = 1 AIC (Akaike, ’73) λ = ln( n ) / 2 BIC (Schwarz, ’78) λ = ln d RIC (Foster and George, ’94) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 12 / 29

Complexity Penalties linear-type penalties Pen ( | M | ) = λ | M | λ = 1 AIC (Akaike, ’73) λ = ln( n ) / 2 BIC (Schwarz, ’78) λ = ln d RIC (Foster and George, ’94) k ln( d / k )-type nonlinear penalties Pen ( | M | ) ∼ C | M | ln( de / | M | ) (Birg´ e and Massart, ’01, ’07; Bunea et al. ’07; AG ’10 for Gaussian regression; AG ’16 for GLM) � d � k ln( d / k ) ∼ ln − log( number of models of size k ) k In addition, for classification, k ln( d / k ) ∼ VC ( C ( k )) (recall Lemma) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 12 / 29

Various complexity penalties AIC RIC 2kln(de/k) Pen(k) Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 13 / 29 k

Let supp ( f ( x )) be bounded, w.l.o.g. || x || 2 ≤ 1 for all x ∈ X Assumption (boundedness) There exists 0 < δ < 1 / 2 such that δ < p ( x ) < 1 − δ or, equivalently, there exists C 0 > 0 such that | β t x | < C 0 for all x ∈ X . The assumption prevents the variance Var ( Y ) = p ( x )(1 − p ( x )) to be infinitely close to zero. Felix Abramovich (Tel Aviv University) High-dimensional classification by sparse logistic regression 14 / 29

High-dimensional classification by sparse logistic regression Felix - PowerPoint PPT Presentation

High-dimensional classification by sparse logistic regression Felix Abramovich Tel Aviv University (based on joint work with Vadim Grinshtein, The Open University of Israel and Tomer Levy, Tel Aviv University) Felix Abramovich (Tel Aviv

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Machine Learning - MT 2016 8. Classification: Logistic Regression Varun Kanade University of

Logistic Regression Lecture 6 Logistic Regression Classification Model CS 335

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

High-Dimensional Pattern Recognition via Sparse Representation Allen Y. Yang University of

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Spruce Budworm Eddie Koch May 14th, 2008 Eddie Koch Spruce Budworm Logistic Equation Logistic

Perceptron and Logistic Regression Milan Straka October 19, 2020 Charles University in Prague

Bayesian logistic regression Already covered in lectures on classification Laplace and

Logistic Regression Two Worlds: Probabilistic & Algorithmic We know two conceptual approaches

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a

Model selection and parameter estimation with covariates in logistic regression missing

CS3157: Advanced Programming Lecture #2 Sept 12 Shlomo Hershkop shlomo@cs.columbia.edu

1 #$$

High-dimensional classification by sparse logistic regression Felix - PowerPoint PPT Presentation

High-dimensional classification by sparse logistic regression Felix Abramovich Tel Aviv University (based on joint work with Vadim Grinshtein, The Open University of Israel and Tomer Levy, Tel Aviv University) Felix Abramovich (Tel Aviv

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Machine Learning - MT 2016 8. Classification: Logistic Regression Varun Kanade University of

Logistic Regression Lecture 6 Logistic Regression Classification Model CS 335

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

High-Dimensional Pattern Recognition via Sparse Representation Allen Y. Yang University of

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Spruce Budworm Eddie Koch May 14th, 2008 Eddie Koch Spruce Budworm Logistic Equation Logistic

Perceptron and Logistic Regression Milan Straka October 19, 2020 Charles University in Prague

Bayesian logistic regression Already covered in lectures on classification Laplace and

Logistic Regression Two Worlds: Probabilistic &amp; Algorithmic We know two conceptual approaches

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a

Model selection and parameter estimation with covariates in logistic regression missing

CS3157: Advanced Programming Lecture #2 Sept 12 Shlomo Hershkop shlomo@cs.columbia.edu

1 #$$

Logistic Regression Two Worlds: Probabilistic & Algorithmic We know two conceptual approaches