XLII Conference on Mathematical Statistics CCnet: joint multi-label classification and feature selection using classifier chains and elastic net regularization Paweł Teisseyre Institute of Computer Science, Polish Academy of Sciences
Outline ◮ Multi-label classification. ◮ Novel method: CCnet. ◮ Theoretical results. ◮ Results of experiments.
Single-label (binary) classification: . . . x 1 x 2 x p y 1.0 2.2 . . . 4.2 1 2.4 1.3 . . . 3.1 1 0.9 1.4 . . . 3.2 0 . . . . . . . . . 1.7 3.5 . . . 4.2 0 3.9 2.5 . . . 4.1 ? Tabela : Single-label classification. ◮ y ∈ { 0 , 1 } - target variable (label). ◮ x = ( x 1 , . . . , x p ) T - vector of explanatory variables (features). TASK: build a model which predicts y using x .
Multi-label classification: x 1 x 2 . . . x p y 1 y 2 . . . y K 1.0 2.2 . . . 4.2 1 0 . . . 1 2.4 1.3 . . . 3.1 1 0 . . . 1 0.9 1.4 . . . 3.2 0 0 . . . 1 . . . . . . . . . . . . 1.7 3.5 . . . 4.2 0 1 . . . 0 3.9 2.5 . . . 4.1 ? ? . . . ? Tabela : Multi-label classification. ◮ y = ( y 1 , . . . , y K ) ′ - vector of target variables (labels). ◮ x = ( x 1 , . . . , x p ) T - vector of explanatory variables (features). TASK: build a model which predicts y using x .
Motivation example: modelling multimorbidity 1 BMI Weight Glucose ... Diabetes Hypotension Liver disease ... 31 84 10 ... 1 0 1 ... 26 63 6 ... 1 0 0 ... 27 60 7 ... 0 0 0 ... Features x: characteristics of patients. Labels y: occurrences of diseases.. ◮ Task 1: predict which diseases are likely to occur based on patients characteristics.(PREDICTION). ◮ Task 2: select features that influence the occurrence of diseases (FEATURE SELECTION). 1 co-occurrence two or more diseases in one patient
Multi-label classification Standard approach: 1. Estimate posterior probability: p ( y | x ) . 2. Make prediction for some new observation x 0 : ˆ y ( x 0 ) = arg y ∈{ 0 , 1 } K ˆ max p ( y | x 0 ) , where ˆ p ( y | x 0 ) is estimated probability. Both steps are more difficult than for a single-label classification.
Posterior probability estimation Possible approaches: 1. Direct modelling of posterior probability : assume some parametric form of p ( y | x ) , e.g. Ising model. 2. Binary Relevance : assume conditional independence of labels: K � p ( y | x ) = p ( y 1 , . . . , y K | x ) = p ( y k | x ) k = 1 and estimate the marginal probabilities. 3. Classifier chains : use chain rule K � p ( y | x ) = p ( y 1 , . . . , y K | x ) = p ( y 1 | x ) p ( y k | x , y 1 , . . . , y k − 1 ) . k = 2 and estimate the conditional probabilities.
CCnet 1. Use chain rule for posterior probability: K � p ( y | x , θ ) = p ( y 1 | x , θ 1 ) p ( y k | y − k , x , θ k ) , k = 2 where: y − k = ( y 1 , . . . , y k − 1 ) T , θ = ( θ 1 , . . . , θ K ) T . 2. Assume that conditional probabilities are of the logistic form. 3. Use penalized maximum likelihood method (with elastic-net penalty) to estimate parameters θ k : n θ k {− 1 log p ( y ( i ) k | y ( i ) ˆ � − k , x ( i ) , θ k )+ λ 1 || θ k || 1 + λ 2 || θ k || 2 θ k = arg min 2 } , n i = 1 for k = 1 , . . . , K , based on training data ( x ( i ) , y ( i ) ) , i = 1 , . . . , n .
Theoretical results ◮ Stability of CCnet with respect to subset loss: a small perturbation in the training data does not affect the value of the subset loss function. ◮ Generalization error bound for CCnet: we use an idea described in Bousquet & Elisseeff (JMLR 2002) to show that the difference between expected error and empirical error can be bounded by the term related to the stability.
Loss functions Let: g ( x , y , θ ) = p ( y | x , θ ) − max y ′ � = y p ( y ′ | x , θ ) . ◮ Subset loss (equal 0 if all labels are predicted correctly and 1 otherwise): � 1 if g ( x , y , θ ) < 0 l ( x , y , θ ) = (1) 0 if g ( x , y , θ ) � 0 . ◮ Modification of subset loss: 1 if g ( x , y , θ ) < 0 l γ ( x , y , θ ) = 1 − g ( x , y , θ ) /γ if 0 � g ( x , y , θ ) < γ 0 if g ( x , y , θ ) � γ.
Loss functions Let: g ( x , y , θ ) = p ( y | x , θ ) − max y ′ � = y p ( y ′ | x , θ ) . ◮ Subset loss (equal 0 if all labels are predicted correctly and 1 otherwise): � 1 if g ( x , y , θ ) < 0 l ( x , y , θ ) = (1) 0 if g ( x , y , θ ) � 0 . ◮ Modification of subset loss: 1 if g ( x , y , θ ) < 0 l γ ( x , y , θ ) = 1 − g ( x , y , θ ) /γ if 0 � g ( x , y , θ ) < γ 0 if g ( x , y , θ ) � γ.
Recommend
More recommend