new approaches for improving new approaches for improving
play

New approaches for improving New approaches for improving Data - PowerPoint PPT Presentation

Ph. D. THESIS New approaches for improving New approaches for improving Data mining feature selection Data mining feature selection techniques techniques Supervised by: Elaborated by: Pr. Y. Slimani M. A. Esseghir Pr. G. Goncalves


  1. Ph. D. THESIS New approaches for improving New approaches for improving Data mining feature selection Data mining feature selection techniques techniques Supervised by: Elaborated by: Pr. Y. Slimani M. A. Esseghir Pr. G. Goncalves 21/09/06 1 1 M. T. Hsu

  2. Outline Outline � Introduction Introduction � � Feature selection problem Feature selection problem � � Existing approaches Existing approaches � � The proposed approaches The proposed approaches � � Search topics and perspectives Search topics and perspectives � � Conclusion Conclusion � 2 2

  3. Introduction Introduction � The ability of machines in storing increasing data The ability of machines in storing increasing data � volume outpass outpass their ability to analyze them. their ability to analyze them. volume � Applied Data mining techniques : Applied Data mining techniques : � � Computational cost Computational cost number of features number of features � � Classification accuracy Classification accuracy high dimensionality high dimensionality � � Identification of Identification of representative features representative features to build to build � classification models. models. classification 3 3

  4. Feature Selection (FS) (FS) problem problem Feature Selection Definition Feature selection studies how to select a subset or list of attributes or variables that are used to construct models describing data .”Huan Liu” IEEE senior member Definition 2 A process that chooses an optimal subset of features according to a certain criterion Objectives � Identification of Identification of salient salient features features � � Discarding: Discarding: irrelevant irrelevant , , redundant redundant , noisy data. , noisy data. � � Enhance the models comprehensibility. Enhance the models comprehensibility. � � Avoid models Avoid models overfitting overfitting. . � � Improve classification and time response (time and Improve classification and time response (time and � complexity) capabilities. complexity) capabilities. 4 4

  5. Existing Approaches Existing Approaches � Wrappers Wrappers and Filters and Filters � � Filters Filters: selects subsets using their general : selects subsets using their general � characteristics (intrinsic properties). characteristics (intrinsic properties). Search: Forward and backward search based one criterion. Search: Forward and backward search based one criterion. � � � Dependency measures Dependency measures � � Information measures Information measures � � Consistency measures Consistency measures � � Wrappers Wrappers: apply a learning algorithm to evaluate : apply a learning algorithm to evaluate � selected subsets. selected subsets. � Search: Exhaustive, random, heuristic ( Search: Exhaustive, random, heuristic ( GA,SA,HC,GrS GA,SA,HC,GrS ). ). � � Evaluation: Evaluation: ANN ANN , , ID3,C4.5, NB,SVM ID3,C4.5, NB,SVM . . � 5 5

  6. 6 6 Huan Liu FS process FS process Exploration Validation

  7. Existing Approaches (2) Existing Approaches (2) Advantages drawbacks Advantages drawbacks Simple to implement � Simple to implement Not well performing Filters � � Not well performing Filters � Low search cost O(N 2 ) � Low search cost O(N 2 � ) Independent criterion � Independent criterion � 1 feature at a time � 1 feature at a time � � High subsets qualities High subsets qualities � Exponential exploration Exponential exploration Wrappers � � Wrappers search (2 search (2 n n ) ) Improves classification � Improves classification � � high evaluation cost high evaluation cost � All features are � All features are � considered unadapted for large data for large data considered � unadapted � sets sets 7 7

  8. The proposed approaches The proposed approaches � Genetic Algorithm (AG) Genetic Algorithm (AG) � � Standard Standard � � Mimetic algorithms: hybrid global+ local Mimetic algorithms: hybrid global+ local � search search � Parallel Parallel FS FS for high dimensional data for high dimensional data � � ISLAND model ISLAND model � � Multi Multi- -agent System agent System � 8 8

  9. The proposed approaches(2) The proposed approaches(2) � Ant Colony Optimizer (ACO): Ant Colony Optimizer (ACO): � � AS and ACS adaptation: AS and ACS adaptation: � � 2Graph complete 2Graph complete � � Nodes corresponds to attributes Nodes corresponds to attributes � � Polarized edges Polarized edges � � Hybrid search: Hybrid search: � � Combining wrappers and filters Combining wrappers and filters � � Correlation guided search Correlation guided search � � Discarding redundant features. Discarding redundant features. � 9 9

  10. Search Topics and perspectives Search Topics and perspectives � New feature section search strategies, based on New feature section search strategies, based on � metaheuristic adaptations, as: adaptations, as: metaheuristic � Multi agent genetic algorithms Multi agent genetic algorithms � � Ant colony optimization (ACO) Ant colony optimization (ACO) � � Particle swarm optimizer (PSO) Particle swarm optimizer (PSO) � � Cultural algorithms. Cultural algorithms. � � Improving evaluation quality: multi Improving evaluation quality: multi- -objective objective � optimization. optimization. � Parallelization, distribution, load balancing, Parallelization, distribution, load balancing, � integration into a common framework (DM grid integration into a common framework (DM grid service) service) 10 10

  11. Search Topics(2) Search Topics(2) � Hybridization of wrapper and filter Hybridization of wrapper and filter � approaches. approaches. � New feature selection approaches for New feature selection approaches for � unsupervised classification . . unsupervised classification 11 11

  12. Conclusion Conclusion � Fs is a multi Fs is a multi- -disciplinary search topic: disciplinary search topic: � � Statistics;Optimization;Data Statistics;Optimization;Data mining mining � � FS is an Essential KDD step to face new FS is an Essential KDD step to face new � data mining challenges. data mining challenges. � High dimensionality, Biological data, Streaming Data mining, High dimensionality, Biological data, Streaming Data mining, � � FS poses new challenges to data mining FS poses new challenges to data mining � community. community. � New efficient search strategies, hybrid strategies. New efficient search strategies, hybrid strategies. � 12 12

  13. 13 13

Recommend


More recommend