A Distributed-Population Genetic Algorithm for Discovering Interesting Prediction Rules Edgar Noda 1 Alex A. Freitas 2 Akebo Yamakami 1 1 School of Electrical and Computer Engineering (FEEC) State University of Campinas (Unicamp), Brazil 2 Computing Laboratory University of Kent at Canterbury, UK
Introduction ! Data Mining – Extraction of knowledge from data. – Data mining task: • Classification. – One goal Attribute, prediction. • Dependence Modeling. – Classification generalization, more than one possible goal attribute. ! Prediction rules form. – IF conditions on the values of predicting attributes are true THEN predict a value for some goal attribute
Discovered Knowledge ! Desirable properties: – In principle, 3 properties. – 1. Predicative accuracy. • Most emphasized in the literature. • Discovered knowledge should have high predictive accuracy – 2. Comprehensibility. • High-level rules. • The output of rule discovery algorithms tends to be more comprehensible than the output of other kinds of algorithms
Discovered Knowledge ! Desirable properties: – 3. Interestingness. • Discovered knowledge should be interesting to the user. • Among the three above-mentioned desirable properties, interestingness seems to be the most difficult one to be quantified and to be achieved. • By "interesting" we mean that discovered knowledge should be novel or surprising to the user. • The notion of interestingness goes beyond the notions of predictive accuracy and comprehensibility.
Motivation for using a Genetic Algorithm (GA) in rule discovery ! Genetic Algorithm. – A GA is essentially a search algorithm inspired by the principle of natural selection. – In general, GAs tend to cope better with attribute interaction problems than greedy rule induction algorithms. – GAs perform a global search. – GAs use stochastic search operators, which contributes to make them more robust and less sensitive to noise. – The execution of a GA can be regarded as a parallel search engine acting upon a population of candidate rules.
Motivation for using a Genetic Algorithm (GA) in rule discovery ! Distributed Genetic Algorithm (DGA). – Basic idea lies in the partition of the population into several small semi-isolated subpopulations. – Each subpopulation being associated to an independent GA, possibly exploring different promising regions. – Occasionally, these subpopulations interact with other subpopulations through the exchange of few individuals, simulating a seasonal migratory process. – The new injected genetic material hopefully ensures that good genetic material is shared from time to time. – This approach also contributes to minimize the early convergence problem and restricts the occurrence of “illegal matting”.
GA-Nuggets ! Overview. – Designed to the dependence modeling task. – Individual encoding: • Genotype: fixed-length individual. • Phenotype: rules with variable number of attributes. – Fitness Function. • Two Parts: – Degree of interestingness. » Objective (Information-theoretical) measure. » Antecedent and consequent interestingness. – Predictive accuracy.
GA-Nuggets ! The fitness function: + AntInt ConsInt + w . w . P redAcc 1 2 . 2 Fitness = + w 1 w 2 – AntInt – Antecedent degree of interestingness. – ConsInt – Consequent degree of interestingness. – PredAcc – Predicative accuracy. – W 1 and W 2 are user-defined weights.
GA-Nuggets ! Selection method: – Tournament selection (factor:2). ! Genetic operators: – Uniform crossover. – Mutation. – Condition Insertion / Removal operators. • Influence in the size of the discovered predictive rule. – Consequent formation. – All operators guarantee the maintenance of valid genetic material.
DGA-Nuggets ! Fitness, selection and genetic operators. – The same as in the single population version. ! Subpopulations – A specific fitness function in each subpopulation (search for different goals attributes). – Number of subpopulations = number of possible goals attributes. ! Migration policy. – Migration take places every m generations. – Each subpolutaion send a best individual based in the “foreign ” fitness.
Computational Results ! Datasets. – Obtained from the UCI repository of machine learning databases ( http://www.ics.uci.edu/AI/Machine- Learning.html ). The data sets used are Zoo, Car Evaluation, Auto Imports and Nursery • Zoo - 101 instances and 18 attributes. • Car evaluation - 1728 instances and 6 attributes. • Auto-imports 85M - 205 instances and 26 categorical attributes. • Nursery school - 12960 instances and 9 attributes.
Computational Results ! Summary of results. – Predicative accuracy. • DGA-Nuggets obtained somewhat better results than single- population GA-Nuggets. • In one case the GA-Nuggets found rules with significantly higher predictive accuracy. DGA-Nuggets significantly outperformed single-population GA in six cases – Degree of interestingness. • DGA-Nuggets obtained results considerably better than single- population GA-Nuggets. • DGA-nuggets outperformed the latter in 22 out of 44 cases – considering all the discovered rules in all the four data sets – whereas the reverse was true in just five out of 44 cases. In the other cases the difference between the two algorithms was not statistically significant �
Discussion ! ���������������������������������� � �������������������� ������������������������������ � ������������������������������������������������������� ! ���������������������������� � ������������������������������������������������������������ ������������������������������������������������������������� ����������������������������������������������������������� ����������������������������������������������������������� ������ ! ������������������ � ��������������������������������������������������������� ������������ ���������� ���� ���������������� �������� ���� �������������������������
Future Works ! ����������� �� ���� �������� ��� ���� ����������������������� ��� ������ ����� �������������� ��� ����������� ����� �� ����� ���������� ������� ������� ����� ����� �� ����� ���������� ��� ��� ���� �������� �������������������� ! ���������� ���� ������������ ��� ����� ������� �������� ����� ���� ������������ ��� ���� �������� ������������ ��������� ��� ������ ��� ������������ ���������� ���� ������������������� ��� ������ ������������ ! ��������������������������������������������������������������� ������������������������������������������������
Recommend
More recommend