Ph. D. THESIS New approaches for improving New approaches for improving Data mining feature selection Data mining feature selection techniques techniques Supervised by: Elaborated by: Pr. Y. Slimani M. A. Esseghir Pr. G. Goncalves 21/09/06 1 1 M. T. Hsu
Outline Outline � Introduction Introduction � � Feature selection problem Feature selection problem � � Existing approaches Existing approaches � � The proposed approaches The proposed approaches � � Search topics and perspectives Search topics and perspectives � � Conclusion Conclusion � 2 2
Introduction Introduction � The ability of machines in storing increasing data The ability of machines in storing increasing data � volume outpass outpass their ability to analyze them. their ability to analyze them. volume � Applied Data mining techniques : Applied Data mining techniques : � � Computational cost Computational cost number of features number of features � � Classification accuracy Classification accuracy high dimensionality high dimensionality � � Identification of Identification of representative features representative features to build to build � classification models. models. classification 3 3
Feature Selection (FS) (FS) problem problem Feature Selection Definition Feature selection studies how to select a subset or list of attributes or variables that are used to construct models describing data .”Huan Liu” IEEE senior member Definition 2 A process that chooses an optimal subset of features according to a certain criterion Objectives � Identification of Identification of salient salient features features � � Discarding: Discarding: irrelevant irrelevant , , redundant redundant , noisy data. , noisy data. � � Enhance the models comprehensibility. Enhance the models comprehensibility. � � Avoid models Avoid models overfitting overfitting. . � � Improve classification and time response (time and Improve classification and time response (time and � complexity) capabilities. complexity) capabilities. 4 4
Existing Approaches Existing Approaches � Wrappers Wrappers and Filters and Filters � � Filters Filters: selects subsets using their general : selects subsets using their general � characteristics (intrinsic properties). characteristics (intrinsic properties). Search: Forward and backward search based one criterion. Search: Forward and backward search based one criterion. � � � Dependency measures Dependency measures � � Information measures Information measures � � Consistency measures Consistency measures � � Wrappers Wrappers: apply a learning algorithm to evaluate : apply a learning algorithm to evaluate � selected subsets. selected subsets. � Search: Exhaustive, random, heuristic ( Search: Exhaustive, random, heuristic ( GA,SA,HC,GrS GA,SA,HC,GrS ). ). � � Evaluation: Evaluation: ANN ANN , , ID3,C4.5, NB,SVM ID3,C4.5, NB,SVM . . � 5 5
6 6 Huan Liu FS process FS process Exploration Validation
Existing Approaches (2) Existing Approaches (2) Advantages drawbacks Advantages drawbacks Simple to implement � Simple to implement Not well performing Filters � � Not well performing Filters � Low search cost O(N 2 ) � Low search cost O(N 2 � ) Independent criterion � Independent criterion � 1 feature at a time � 1 feature at a time � � High subsets qualities High subsets qualities � Exponential exploration Exponential exploration Wrappers � � Wrappers search (2 search (2 n n ) ) Improves classification � Improves classification � � high evaluation cost high evaluation cost � All features are � All features are � considered unadapted for large data for large data considered � unadapted � sets sets 7 7
The proposed approaches The proposed approaches � Genetic Algorithm (AG) Genetic Algorithm (AG) � � Standard Standard � � Mimetic algorithms: hybrid global+ local Mimetic algorithms: hybrid global+ local � search search � Parallel Parallel FS FS for high dimensional data for high dimensional data � � ISLAND model ISLAND model � � Multi Multi- -agent System agent System � 8 8
The proposed approaches(2) The proposed approaches(2) � Ant Colony Optimizer (ACO): Ant Colony Optimizer (ACO): � � AS and ACS adaptation: AS and ACS adaptation: � � 2Graph complete 2Graph complete � � Nodes corresponds to attributes Nodes corresponds to attributes � � Polarized edges Polarized edges � � Hybrid search: Hybrid search: � � Combining wrappers and filters Combining wrappers and filters � � Correlation guided search Correlation guided search � � Discarding redundant features. Discarding redundant features. � 9 9
Search Topics and perspectives Search Topics and perspectives � New feature section search strategies, based on New feature section search strategies, based on � metaheuristic adaptations, as: adaptations, as: metaheuristic � Multi agent genetic algorithms Multi agent genetic algorithms � � Ant colony optimization (ACO) Ant colony optimization (ACO) � � Particle swarm optimizer (PSO) Particle swarm optimizer (PSO) � � Cultural algorithms. Cultural algorithms. � � Improving evaluation quality: multi Improving evaluation quality: multi- -objective objective � optimization. optimization. � Parallelization, distribution, load balancing, Parallelization, distribution, load balancing, � integration into a common framework (DM grid integration into a common framework (DM grid service) service) 10 10
Search Topics(2) Search Topics(2) � Hybridization of wrapper and filter Hybridization of wrapper and filter � approaches. approaches. � New feature selection approaches for New feature selection approaches for � unsupervised classification . . unsupervised classification 11 11
Conclusion Conclusion � Fs is a multi Fs is a multi- -disciplinary search topic: disciplinary search topic: � � Statistics;Optimization;Data Statistics;Optimization;Data mining mining � � FS is an Essential KDD step to face new FS is an Essential KDD step to face new � data mining challenges. data mining challenges. � High dimensionality, Biological data, Streaming Data mining, High dimensionality, Biological data, Streaming Data mining, � � FS poses new challenges to data mining FS poses new challenges to data mining � community. community. � New efficient search strategies, hybrid strategies. New efficient search strategies, hybrid strategies. � 12 12
13 13
Recommend
More recommend