Presentation Outline • Introduction to Data Mining • Rule Induction for Classification • AntMiner – Overview: Input/Output – Rule Construction – Quality Measurement – Pheromone: Initial/Updating – Experiments/Results – Performance/Complexity • Swarm-based Genetic Programming – Introduction to GP, Symbolic Regression – Crossover problems – Ant Colony Crossover – Experiments and Results Introduction • Data Mining tries to find: – hidden knowledge – unexpected patterns – new rules – in large databases. • "Discovery of useful summaries of data" • Is a key element of much more elaborate process: Knowledge Discovery in Databases (KDD) 1
Goals of Rule Induction • Stage of Data Mining: Rule Induction • Find rules to describe data in some way – Not only accurate… – …but also comprehensible for a human user… – …to support decision making decision making Focus in this Talk • Rule Induction for Classification using ACO – Given: training set (instances/cases to classify) – Goal: to come up with (preferably simple) rules to classify data • Algorithm by Parpinelli, Lopes and Freitas: AntMiner • ACO + Genetic Programming – Symbolic regression Rule Induction • Possible Outputs for Rule Induction – decision trees if <attribute1>=<value1> and – (ordered) decision <attribute2>=<value2> and… lists [here] then <class>=<class1> else if… – … 2
AntMiner Input • Training set / test set • Attribute / value pairs • Given classes / classification AntMiner Output • Ordered decision list – Ordered list of IF-THEN-Rules like IF <condition> THEN <class> • <condition> = <term1> AND <term2> AND… – <term> = <attribute> ‘=‘ <value> – + Default rule (majority value) – First rule “fires”. • Only discrete attributes supported so far. – Continuous values must be discretized before. • This is a quite limited version of a decision list. Prerequisites for an ACO (Review) • Problem-dependent heuristic function ( η ) for measuring the quality of items that could be added to the partial solution so far. • Pheromone updating rule ( τ ) • Probabilistic transition rule based on η and τ • Difference to most ACO algorithms mentioned in class: Does not use a graph representation of the problem. 3
AntMiner Algorithm: Top-Level • Pseudo-Code for finding one rule set: trainingSet = {all training cases} discoveredRuleList = [ ] WHILE(| trainingSet | still too big) Initialize pheromone (equally distributed) Ants try to find a good classification rule by the ACO heuristic Add best rule found to discoveredRuleList Remove correctly covered examples from trainingSet AntMiner Algorithm: Mid-Level • Pseudo-Code for finding one rule: Repeat Start new ant with empty rule (antecedent) Construct rule by adding one term at a time and choosing the rule consequent subsequently Prune rule Increase pheromone on trail which ant used according to the quality of the rule Until (maximum number z of ants exceeded) or (no improvement any more during the last k iterations) • Actually only the population of one ant at a time working. AntMiner Algorithm: Bottom- Level • Repeat as long as possible: – Add one condition to the rule. • Use probabilistic approach referring to pheromone concentration and heuristic. • Do not use attributes twice . • Resulting rule must cover at least a minimum of cases. • After having finished the antecedent, calculate the resulting class. 4
Rule Construction • Probability for adding <A i >=<V ij > P ij = � ij � ij (t) [normalized] • where – A i the i-th attribute – V ij the j-th possible value of the i-th attribute – η heuristic function, τ pheromone trail Heuristic Function ( η ) • Analogous to: – Proximity function in TSP – Colouring matrix in graph colouring problem. • Uses information theory (entropy). – Split instances using rule. – Quality corresponds to entropy of remaining “buckets”; the less, the better. k � H(W|A j = V ij ) = � (P(w | A j = V ij ). log 2 P(w | A j = V ij )) w = 1 � ij � log 2 k � H(W | A j = V ij ) [normalized] where k is number of classes Information Heuristic Example For T, high = >80, mild = 70<T ≤ 80, cold = 0<T ≤ 70 (for later) P(play|outlook=sunny)=2/14=0.143, P(don’t play|outlook=sunny)=3/14=0.214 H(W,outlook=sunny)=-0.143.log(0.143)-0.214.log(0.214)=0.877 η = log 2 k − H(W,outlook=sunny) = 1 − 0.877=0.123 5
Information Heuristic Example For H, high = >85, normal = 0<T ≤ 85, (for later) P(play|outlook=overcast)=4/14=0.286, P(don’t play|outlook=overcast)=0/14=0 H(W,outlook=sunny)=-0.286.log(0.286)=0.516 η = log 2 k − H(W,outlook=sunny) = 1 − 0.516=0.484 Quality Function • Measuring the classification quality of a rule / several rules. – For one rule: sensitivity · specificity TP TN Q = TP + FN . FP + TN where T=true, F=false, P=positive, N=negative – The bigger the value of Q, the better • Measuring the simplicity of a rule: – number of rules · average number of terms per rule – The less, the simpler, thus the better. Rule Pruning • Iteratively remove one-term-at-a-time from the rule while this process improves the classification accuracy of the rule. – Majority class might change. – If ambiguous, remove term that improves the accuracy the most. – Simplicity improves anyway. 6
Pheromone • Initial pheromone value: 1 � ij (t = 0 ) = [normalized] a � b i i = 1 where a is the total number of attributes and b i is the number of possible values of A i . Pheromone Updating ( τ ) • Values before (1). • First increase pheromone of used terms regarding rule quality (2): � ij (t + 1 ) = � ij (t).( 1 + Q) • Then normalize the pheromone level of all terms → pheromone evaporation (3) Using the Discovered Rules • Apply in the order they were discovered. • First rule that covers case is applied. • If no rule covers case, apply default result (majority value). 7
Possible Discretization of Continuous Attributes • Use C4.5-Disc • Quick overview: – Extract reduced data set that only contains attribute to discretize and desired classification. – From that build up decision tree using the C4.5 algorithm (another rule induction algorithm). – Result: Decision tree with binary decisions x ≤ a → go left; x > a → go right – Each path corresponds to the definition of a categorical interval. AntMiner’s Parameters • Number of ants (3000 used in experiments). Also limits the maximum number of rules found for a classification. Is not necessarily exploited because algorithm might converge before. • Minimum number of cases per rule (10). Each rule must at least cover so many cases. Avoids overfitting. • Maximum number of uncovered classes in the training set (10). The algorithm stops when there are only fewer instances left. • Number of rules to test for the convergence of the ants (10). The algorithm waits so long for an improvement. Sample Run Start • Deciding whether to play outside • Sample run for finding one rule set. – Attributes: outlook, temperature, humidity, • Start: I={all}, R={} windy, play • Ant 1: Choose probabilistically – Classes: play (yes), do not play (no) outlook=overcast (then play=yes) • Ant 1: Chooses values for other – sunny,hot,high,FALSE,no (1) attributes… – sunny,hot,high,TRUE,no (2) • Ant 1: Finishes because all attributes are – overcast,hot,normal,FALSE,yes (3) used. – rainy,mild,high,FALSE,yes (4) • Ant 1: Last three conditions are pruned – rainy,cool,normal,FALSE,yes (5) away. – rainy,cool,normal,TRUE,no (6) • I={1,2,4,5,6,8,9,10,11,14}, – overcast,cool,normal,TRUE,yes (7) R={outlook=overcast → yes) – sunny,mild,high,FALSE,no (8) • Ant 2: Choose outlook=rainy (then – sunny,cool,normal,FALSE,yes (9) play=yes) – rainy,mild,normal,FALSE,yes (10) – sunny,mild,normal,TRUE,yes (11) • Rule is not good enough (3:2) – overcast,mild,high,TRUE,yes (12) • Ant 2: Choose windy=true (then play=no) – overcast,hot,normal,FALSE,yes (13) • Ant 2 finishes because otherwise covered – rainy,mild,high,TRUE,no (14) set would be too small. • No pruning possible either. • … 8
Sample Run Result • Possible result (not most simple): – outlook=overcast → play=yes outlook=rainy, windy=false → play=yes outlook=sunny, humidity=normal → play=yes otherwise → play=no Comparison to CN2 Algorithm • Uses beam search (limited breadth first search with beam width b). • Add all possible terms to current partial rules, evaluate, and retain only the b best ones. • No feedback for constructing new rules. • Output format is the same (ordered rule list). • Uses entropy heuristic as well. Experiment Setup • Dimension roughly: 100…1000 cases, 9…34 attributes, 2…6 classes • Tests run using a 10-fold cross-validation procedure – Divide data into 10 partitions. – For each partition do • Treat it as the test data and use the other 90% as the training data. • Measure the performance. – Take the average value. • This helps to achieve significant results. 9
