1 Redundant Feature Elimination Redundant Feature Elimination for Multi-Class Problems for Multi-Class Problems Annalisa Appice, Michelangelo Ceci Dipartimento di Informatica, Università degli Studi di Bari, Italy Simon Rawles, Peter Flach Department of Computer Science, University of Bristol, UK
2 Re dundant dundant fe fe ature ature r r eduction eduction Re • REFER : an efficient, scalable, logic- based method for eliminating Boolean features which are redundant for multi- class classifier learning. – Why? Size of hypothesis space, predictive performance, model comprehensibility. – Distinct from feature selection.
3 Overview of this talk Overview of this talk • Redundant feature reduction – What is feature redundancy? – Doing multi-class reduction • Related approaches • Theoretical and experimental results • Summary • Current and future work
4 Example: Redundancy of features Example: Redundancy of features f 1 f 2 f 3 class e 1 1 1 0 a e 2 0 1 0 a e 3 0 0 0 a e 4 0 0 0 b e 5 1 0 0 b A fixed number of Boolean features One of several class labels (‘multiclass’)
5 Discriminating a against b Discriminating a against b f 1 f 2 f 3 class e 1 1 1 0 a e 2 0 1 0 a e 3 0 0 0 a e 4 0 0 0 b e 5 1 0 0 b True values in examples of class a make the feature better for distinguishing a from b in a classification rule.
6 Discriminating a against b Discriminating a against b f 1 f 2 f 3 class e 1 1 1 0 a e 2 0 1 0 a e 3 0 0 0 a e 4 0 0 0 b e 5 1 0 0 b False values in examples of class b make the feature better for distinguishing a from b in a rule.
7 Discriminating a against b Discriminating a against b f 1 f 2 f 3 class e 1 1 1 0 a e 2 0 1 0 a e 3 0 0 0 a e 4 0 0 0 b e 5 1 0 0 b f 2 covers f 1 and f 3 is useless . f 1 and f 3 are redundant . Negated features are not automatically considered.
8 More formally... More formally... For discriminating class a examples from class b , • f covers g if T a (g) ⊆ T a (f) and F b (g) ⊆ F b (f). • A feature is redundant if another feature covers it. f 1 f 2 class T a (f 2 ) = {e 1 , e 2 }. 1 1 a e 1 T a (f 1 ) = {e 1 }. e 2 0 1 a F b (f 2 ) = {e 4 , e 5 }. e 3 0 0 a F b (f 1 ) = {e 5 }. e 4 0 0 b a is the ‘positive e 5 1 0 b class’ here
9 Neighbourhoods of examples Neighbourhoods of examples • A way to upgrade to multi-class data. • Each class is partitioned into subsets of similar examples. – REFER-N finds non-redundant features between each neighbourhood pair in turn. – Builds up list of non-redundant features between each neighbourhood pair in turn. • Efficient, more reduction, logic-based.
10 Neighbourhood construction Neighbourhood construction
11 Neighbourhood construction Neighbourhood construction 1
12 Neighbourhood construction Neighbourhood construction 1
13 Neighbourhood construction Neighbourhood construction 1
14 Neighbourhood construction Neighbourhood construction 1 1 1 1
15 Neighbourhood construction Neighbourhood construction 2
16 Neighbourhood construction Neighbourhood construction 2
17 Neighbourhood construction Neighbourhood construction 2
18 Neighbourhood construction Neighbourhood construction 2 2 2
19 Neighbourhood construction Neighbourhood construction 3
20 Neighbourhood construction Neighbourhood construction 3 3 3 3 3
21 Neighbourhood construction Neighbourhood construction 4
22 Neighbourhood construction Neighbourhood construction 4
23 Neighbourhood construction Neighbourhood construction 5
24 Neighbourhood construction Neighbourhood construction 5 5 5
25 Neighbourhood construction Neighbourhood construction Groups of similar examples 1 with the same class label 1 1 1 2 5 2 2 3 5 5 3 3 3 3 4
26 Neighbourhood comparison Neighbourhood comparison 1 2 5 3 4
27 Neighbourhood comparison Neighbourhood comparison 1 2 5 3 4
28 Neighbourhood comparison Neighbourhood comparison 1 2 5 3 Comparing all 4 neighbourhoods of differing class
29 Ancestry of REFER Ancestry of REFER • REDUCE (Lavra č et al. 1999) – Feature reduction for propositionalised ILP datasets – Preserves learnability of a complete and consistent hypothesis • REFER uses a variant of REDUCE – Redundant features found between the examples in each neighbourhood pair – Prefers features already found non-redundant
30 Related multiclass filters Related multiclass filters • FOCUS for noise-free Boolean data (Almuallim & Dietterich 1991) – Exhaustive evaluation of all subsets – A time complexity of O( n p ) • SCRAP relevance filter (Raman 2003) – Also uses neighbourhood approach – No guarantee that selected features (still) discriminate among all classes.
31 Theoretical results Theoretical results • REFER preserves the learnability of a complete and consistent theory. – If a C&C rule was in the original data, it’ll be in the reduced data. • REFER is efficient. Time complexity is – … linear in number of examples – … quadratic in number of features
32 Experimental results Experimental results • Mutagenesis data from SINUS – Feature set greatly reduced (13118 → 44) – Accuracy still competitive (approx. 85%) # of reduced features 120 100 80 REFER 60 REDUCE 40 20 0 0 5000 10000 15000 # of original features
33 Experimental results Experimental results • Thirteen UCI benchmark datasets – Compared with LVF, CFS and Relief using discrete/discretised data – Generally conservative – Faster: 8 out of 13 faster, 3 very close. – Competitive predictive accuracy using several classifiers: JRIP NB C4.5 SVM Winner 6 7 3 6 Within 3 0 2 4 1%
34 Experimental results Experimental results • Reuters-21578 large-scale high- dimensionality sparse data – 16,582 preprocessed features were reduced to 1450. – REFER supports parallel execution well. • REFER runs in parallel on subsets of the feature set and again on the combination.
35 Summary Summary • A method for eliminating redundant Boolean features for multi-class classification tasks. • Uses logical coverage of examples • Efficient and scalable – requiring less time than the three feature selection algorithms we used • Amenable to parallel execution
36 Current and future investigations Current and future investigations • Interaction between feature selection and feature reduction – Benefits of combination • Noise handling using non-pure neighbourhoods (‘relaxed REFER’) – Overcoming sensitivity to noise • REFER for example reduction
37 Questions Questions
38
39 Average reduction on UCI data Average reduction on UCI data # of reduced features 160 140 120 LVF 100 CFS 80 RELIEFF 60 REFER 40 20 0 0 50 100 150 200 Number of original features
40 Effect of choice of starting point Effect of choice of starting point Number of reduced features 120 100 80 60 40 20 0 Aud Brid Car F1C F1M F3C F3M Mus Nur Post Tic Pim Yea Number of neighbourhoods constructed 1000 800 600 400 200 0 Aud Brid Car F1C F1M F3C F3M Mus Nur Post Tic Pim Yea
41 Comparison of running times Comparison of running times Time (s) # in- # fea- Dataset stances tures LVF CFS R ELIEF F R EFER Audiology 398 184 3.37 0.80 3.84 0.72 Bridge 108 83 0.89 0.38 0.67 0.22 Car 1728 21 1.94 0.44 15.92 0.50 Flare1066/C 1066 40 2.62 0.48 11.51 0.61 Flare1066/M 1066 42 0.82 0.51 11.63 0.20 Flare323/C 323 37 0.72 0.38 1.19 0.12 Flare323/M 323 36 0.80 0.39 1.25 0.21 Mushroom 8124 116 29.48 5.30 1838.36 1.66 Nursery 12960 27 34.24 1.64 1038.31 20.38 Post-operative 90 23 0.33 0.30 0.32 0.08 Tic-tac-toe 950 27 1.03 0.37 5.49 0.20 Pima 768 120 12.2 1 14.1 2.6537 Yeast 1484 120 55 19.1 57.1 26.7132 Machine spec: Pentium IV 1.4GHz PC running Windows XP
Recommend
More recommend