Data Mining: Classification Jay Urbain, PhD Credits: Tom Mitchell, - PowerPoint PPT Presentation

Occam ’ s Razor Why prefer short hypotheses? Argument in favor: Fewer short hypo ’ s than long hypo ’ s n A short hypo that fits data unlikely to be coincidence n A long hypo that fits data might be coincidence n Argument opposed: There are many ways to define small sets of hypo ’ s n n e.g., all trees with a prime number of nodes use attributes with “ Z ” What ’ s so special about small sets based on size of hypothesis? n

Overfitting in Decision Trees Consider adding noisy training example D15: n Sunny, Hot, Normal, Strong, Play tennis=No

Overfitting n Consider error of hypothesis h over: n Training data: error train (h ) n Entire distribution D of data: error D (h) n Hypothesis h ϵ H overfits training data if there is an alternative hypothesis h ’ ϵ H such that:

Overfitting

Avoiding Overfitting Avoid Overfitting: 1. Stop growing when data split is not statistically significant 2. Grow full tree, then prune 3. Grow many trees w/ randomly selected reduced numbers of attributes, learn function to weight trees (Random Forests – state of the art technique) How to select best tree: n Measure performance over training data n Measure performance over separate validation set Minimum Description Length n MDL: minimize (size(tree)+size(misclassification))

Reduced-Error Pruning Start with completed tree n Split data into training and validation set n Do until further pruning is harmful: n Evaluate impact on validation set of pruning each 1. possible node (plus those below it) Greedily remove the one that most improves 2. validation set accuracy Produces smallest version of most accurate subtree n

Rule Post-Pruning Convert tree to equivalent rules 1. Prune each rule independently of others 2. Evaluate impact on validation set 3. Sort final rule into desired sequence of use. 4. Most frequently used method, C4.5 n

Summary n Learning needed for unknown environments n For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples that performs well on unseen data n Decision tree learning using information gain n Learning performance = prediction accuracy measured on test set n Overfitting hampers a machine learners ability to generalize to unseen examples

STOP DT 39

Classification: Basic Concepts n Classifica(on: ¡Basic ¡Concepts ¡ n Decision ¡Tree ¡Induc(on ¡ n Bayes ¡Classifica(on ¡Methods ¡ n Rule-‑Based ¡Classifica(on ¡ n Model ¡Evalua(on ¡and ¡Selec(on ¡ n Techniques ¡to ¡Improve ¡Classifica(on ¡Accuracy: ¡ Ensemble ¡Methods ¡ n Summary ¡ 40 ¡

Bayesian Classification: Why? n A ¡sta(s(cal ¡classifier: ¡performs ¡ probabilis,c ¡predic,on, ¡i.e., ¡ predicts ¡class ¡membership ¡probabili(es ¡ n Founda(on: ¡Based ¡on ¡Bayes ’ ¡Theorem. ¡ ¡ n Performance: ¡A ¡simple ¡Bayesian ¡classifier, ¡ naïve ¡Bayes , ¡makes ¡ strong ¡independence ¡assump(ons, ¡but ¡can ¡perform ¡well. ¡ n Comparable ¡performance ¡with ¡basic ¡decision ¡trees ¡and ¡selected ¡ neural ¡network ¡classifiers ¡ n Incremental: ¡Each ¡training ¡example ¡can ¡incrementally ¡increase/ decrease ¡the ¡probability ¡that ¡a ¡hypothesis ¡is ¡correct ¡(online ¡ learning) ¡— ¡prior ¡knowledge ¡can ¡be ¡combined ¡with ¡observed ¡data ¡ n Standard: ¡Provides ¡a ¡theore(cally ¡sound ¡standard ¡of ¡op(mal ¡ decision ¡for ¡which ¡other ¡methods ¡can ¡be ¡measured ¡ 41

Na ï ve Bayes Classifier n Bayes theorem n Combines probability of each feature with respect to a class label. n Makes strong independence assumption between features, i.e., independence between features n Sample applications: n Classify email as spam based on sender, and text n Diagnose meningitis based on chest-xray, symptom n Classify fruit from shape and color n Determine life style from education and salary April 13, 2015 42

Na ï ve Bayes Classifier Lets say we have a hypothesis H, & we want to calculate n the probability of the hypothesis being correct. Hypothesis: given feature x 1 , x 2 => object is a Peach n Calculate probability that x 1 , x 2 is a Peach n P(H: x 1 , x 2 is a Peach) n P(H: x 1 , x 2 is an Apricot) n Calculate each of these probabilities 1. Choose the highest probability 2. April 13, 2015 44

Na ï ve Bayes Classifier P(H|X) Posterior probability of hypothesis H n X: {x 1 , x 2…, x n } n Shows the confidence/probability of H given X n x 1 : shape=round, x 2 : color=orange n H: x 1 , x 2 is a peach n • P(H) Prior probability of hypothesis H • Represents the probability of H just happening, regardless of data. • E.g. What is the probability of picking a peach from a fruit bin without knowledge of shape and color. April 13, 2015 45

Bayes Theorem - Learning P(X|H) Likelihood - the evidence X conditioned on n hypothesis H Shows the confidence ( probability ) of X given H n Given H is true (X is a peach) calculate probability n that X is round and orange, i.e., x 1 =round, x 2 =orange. • P(X) Prior probability of X • Represents the probability that sample is round and orange. April 13, 2015 46

Bayes Theorem - Classification Likelihood April 13, 2015 47

Naïve Bayes Classification n Hypothesis H is the class C i . n Note: P (X) can be ignored (for classification) as it is constant for all classes. n Assuming the independence assumption, P(X|C i ) is: n Therefore: n P(C i ) is the ratio of total samples in class C i to all samples. April 13, 2015 48

Naïve Bayes Classification n For categorical attribute: n P(x k |C i ) is the frequency of samples having value x k in class C i n For continuous (numeric) attribute: n P(x k |C i ) is calculated via a Gaussian density function. with a mean µ and standard deviation σ 2 − µ ( x ) 1 − µ σ = 2 g ( x , , ) e σ 2 π σ 2 = µ σ P ( X | ) g ( x , , ) Ci k C C i i April 13, 2015 49

Naïve Bayes Classification Having pre-calculated all P(x k |C i ), an unknown n example X is classified as follows: For all classes calculate P(C i |X ) 1. Assign X to the class with the highest P(C i |X ) 2. April 13, 2015 50

Play Tennis? An incoming sample: X = <sunny, cool, high, true> April 13, 2015 51

Play Tennis Example: estimating P(x i | C) April 13, 2015 52

Avoiding the Zero-Probability Problem n Naïve ¡Bayesian ¡predic(on ¡requires ¡each ¡condi(onal ¡prob. ¡be ¡ non-‑zero . ¡ ¡Otherwise, ¡the ¡predicted ¡prob. ¡will ¡be ¡zero ¡ n = ∏ P ( X | ) P ( | ) Ci xk Ci ¡ ¡ = k 1 n Ex. ¡Suppose ¡a ¡dataset ¡with ¡1000 ¡tuples, ¡income=low ¡(0), ¡ income= ¡medium ¡(990), ¡and ¡income ¡= ¡high ¡(10) ¡ n Use ¡ Laplacian ¡correc2on ¡(or ¡Laplacian ¡smoothing) ¡ n Adding ¡1 ¡to ¡each ¡case ¡ Prob(income ¡= ¡low) ¡= ¡(0+1)/(1000+3) ¡= ¡1/1003 ¡ Prob(income ¡= ¡medium) ¡= ¡991/1003 ¡ Prob(income ¡= ¡high) ¡= ¡11/1003 ¡ n The ¡ “ corrected ” ¡prob. ¡es(mates ¡are ¡close ¡to ¡their ¡ “ uncorrected ” ¡counterparts ¡ 54

Avoiding the Zero-Probability Problem n Standard ¡approach: ¡use ¡logarithms ¡ n Log ¡of ¡products ¡is ¡sum ¡of ¡logs ¡ n ∑ P ( C | X ) ≈ log 2 ( ) = ∏ P ( k x | C ) log 2 (( P ( k x | C )) i i k = 1 55

Naïve Bayes Classifier: Comments n Advantages ¡ ¡ n Easy ¡to ¡implement ¡ ¡ n Good ¡results ¡obtained ¡in ¡most ¡of ¡the ¡cases ¡ n Disadvantages ¡ n Assump(on: ¡class ¡condi(onal ¡independence, ¡therefore ¡loss ¡of ¡ accuracy ¡ n Prac(cally, ¡dependencies ¡exist ¡among ¡variables ¡ ¡ n E.g., ¡ ¡hospitals: ¡pa(ents: ¡Profile: ¡age, ¡family ¡history, ¡etc. ¡ ¡ ¡ Symptoms: ¡fever, ¡cough ¡etc., ¡Disease: ¡lung ¡cancer, ¡ diabetes, ¡etc. ¡ ¡ n Dependencies ¡among ¡these ¡cannot ¡be ¡modeled ¡by ¡Naïve ¡ Bayes ¡Classifier ¡ n How ¡to ¡deal ¡with ¡these ¡dependencies? ¡Bayesian ¡Belief ¡Networks ¡ 56

Using IF-THEN Rules for Classification Represent ¡the ¡knowledge ¡in ¡the ¡form ¡of ¡IF-‑THEN ¡rules ¡ n R: ¡ ¡IF ¡ age ¡= ¡youth ¡AND ¡ student ¡= ¡yes ¡ ¡THEN ¡ buys_computer ¡= ¡yes ¡ n Rule ¡antecedent/precondi(on ¡vs. ¡rule ¡consequent ¡ Assessment ¡of ¡a ¡rule: ¡ coverage ¡(support) ¡and ¡ accuracy ¡(confidence) ¡ n n n covers ¡ = ¡# ¡of ¡tuples ¡covered ¡by ¡R ¡ n n correct ¡ = ¡# ¡of ¡tuples ¡correctly ¡classified ¡by ¡R ¡ coverage(R) ¡= ¡n covers ¡ /|D| ¡ ¡ ¡/* ¡D: ¡training ¡data ¡set ¡*/ ¡ accuracy(R) ¡= ¡n correct ¡ / ¡n covers ¡ If ¡more ¡than ¡one ¡rule ¡are ¡triggered, ¡need ¡ conflict ¡resolu2on ¡or ¡priority ¡scheme ¡ n n Size ¡ordering: ¡assign ¡the ¡highest ¡priority ¡to ¡the ¡triggering ¡rules ¡that ¡has ¡the ¡ “ toughest ” ¡most ¡specific ¡requirement ¡(i.e., ¡ most ¡aCribute ¡tests ) ¡ n Class-‑based ¡ordering: ¡decreasing ¡order ¡of ¡ prevalence ¡or ¡misclassifica,on ¡ cost ¡per ¡class ¡ n Rule-‑based ¡ordering ¡( decision ¡list ): ¡rules ¡are ¡organized ¡into ¡one ¡long ¡ priority ¡list, ¡according ¡to ¡some ¡measure ¡of ¡rule ¡quality ¡or ¡by ¡experts ¡ 58

Rule Extraction from a Decision Tree Rules ¡are ¡ easier ¡to ¡understand ¡than ¡large ¡trees ¡ n age? One ¡rule ¡is ¡created ¡ for ¡each ¡path ¡from ¡the ¡root ¡to ¡a ¡ n leaf ¡ <=30 31..40 >40 Each ¡aNribute-‑value ¡pair ¡along ¡a ¡path ¡forms ¡a ¡ n student? credit rating? conjunc(on: ¡the ¡leaf ¡holds ¡the ¡class ¡predic(on ¡ ¡ yes Rules ¡are ¡mutually ¡exclusive ¡and ¡exhaus(ve ¡ excellent fair no yes n o n yes no yes Example: ¡Rule ¡extrac(on ¡from ¡our ¡ buys_computer ¡decision-‑tree ¡ n IF ¡ age ¡= ¡young ¡AND ¡ student ¡= ¡ no ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡THEN ¡ buys_computer ¡= ¡ no ¡ IF ¡ age ¡= ¡young ¡AND ¡ student ¡= ¡ yes ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡THEN ¡ buys_computer ¡= ¡ yes ¡ IF ¡ age ¡= ¡mid-‑age ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡THEN ¡ buys_computer ¡= ¡ yes ¡ IF ¡ age ¡= ¡old ¡AND ¡ credit_ra,ng ¡= ¡ excellent ¡ ¡THEN ¡ buys_computer ¡ = ¡ no ¡ IF ¡ age ¡= ¡old ¡AND ¡ credit_ra,ng ¡= ¡ fair ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡THEN ¡ buys_computer ¡= ¡ yes ¡ 59

Rule Induction: Sequential Covering Method Sequen(al ¡covering ¡algorithm: ¡Extracts ¡rules ¡directly ¡from ¡training ¡data ¡ n Typical ¡sequen(al ¡covering ¡algorithms: ¡FOIL, ¡AQ, ¡CN2, ¡RIPPER ¡ n Rules ¡are ¡learned ¡ sequen,ally , ¡each ¡for ¡a ¡given ¡class ¡C i ¡ will ¡cover ¡many ¡tuples ¡ n of ¡C i ¡ but ¡none ¡(or ¡few) ¡of ¡the ¡tuples ¡of ¡other ¡classes ¡ Steps: ¡ ¡ n n Rules ¡are ¡learned ¡one ¡at ¡a ¡(me ¡ n Each ¡(me ¡a ¡rule ¡is ¡learned, ¡the ¡tuples ¡covered ¡by ¡the ¡rules ¡are ¡removed ¡ n Repeat ¡the ¡process ¡on ¡the ¡remaining ¡tuples ¡un(l ¡ termina,on ¡condi,on , ¡ e.g., ¡when ¡there ¡are ¡no ¡more ¡training ¡examples, ¡or ¡when ¡the ¡quality ¡of ¡a ¡ rule ¡returned ¡is ¡below ¡a ¡user-‑specified ¡threshold ¡ 60

Sequential Covering Algorithm ¡ ¡ while ¡ (enough ¡target ¡tuples ¡leg) ¡ ¡generate ¡a ¡rule ¡ ¡remove ¡posi(ve ¡target ¡tuples ¡sa(sfying ¡this ¡rule ¡ Examples covered by Rule 2 Examples covered by Rule 1 Examples covered by Rule 3 Positive examples 61

Rule Generation n To ¡generate ¡a ¡rule ¡ while (true) ¡ ¡find ¡the ¡ best ¡predicate ¡ p ¡ ¡ if ¡ foil-‑gain(p) ¡> ¡threshold ¡ then ¡add ¡ p ¡to ¡current ¡rule ¡ ¡ else ¡break ¡ A3 =1&& A1 =2 A3 =1&& A1 =2 &&A8 =5 A3 =1 Positive Negative examples examples 62

How to Learn-One-Rule? n Start ¡with ¡the ¡ most ¡general ¡rule ¡possible: ¡ condi,on ¡= ¡empty ¡ n Add ¡new ¡aCributes ¡by ¡adop(ng ¡a ¡greedy ¡depth-‑first ¡strategy ¡ n Picks ¡the ¡aNribute ¡that ¡most ¡improves ¡the ¡rule ¡quality ¡ n Rule-‑Quality ¡measures: ¡ consider ¡both ¡coverage ¡and ¡accuracy ¡ n Foil-‑gain ¡(in ¡FOIL ¡& ¡RIPPER): ¡assesses ¡ info_gain ¡by ¡extending ¡ condi(on: ¡ pos’, neg’ pos ' pos = × − FOIL _ Gain pos ' (log log ) predicted 2 2 + + pos ' neg ' pos neg n favors ¡rules ¡that ¡have ¡high ¡accuracy ¡and ¡cover ¡many ¡posi,ve ¡tuples ¡ n Rule ¡pruning ¡based ¡on ¡an ¡independent ¡set ¡of ¡test ¡tuples ¡ − pos neg ¡ = FOIL _ Prune ( R ) + pos neg ¡ Pos/neg ¡are ¡# ¡of ¡posi(ve/nega(ve ¡tuples ¡covered ¡by ¡R. ¡ If ¡ FOIL_Prune ¡is ¡higher ¡for ¡the ¡pruned ¡version ¡of ¡R, ¡prune ¡R ¡ 63

Model Evaluation and Selection Evalua(on ¡metrics: ¡How ¡can ¡we ¡measure ¡accuracy? ¡ ¡Other ¡metrics ¡to ¡ n consider? ¡ Use ¡ valida2on ¡test ¡set ¡of ¡class-‑labeled ¡tuples ¡instead ¡of ¡training ¡set ¡when ¡ n assessing ¡accuracy ¡ Methods ¡for ¡es(ma(ng ¡a ¡classifier ’ s ¡accuracy: ¡ ¡ n n Holdout ¡method, ¡random ¡subsampling ¡ n Cross-‑valida(on ¡ n Bootstrap ¡ Comparing ¡classifiers: ¡ n n Confidence ¡intervals ¡ n Cost-‑benefit ¡analysis ¡and ¡ROC ¡Curves ¡ 65 ¡

Classifier Evaluation Metrics: Confusion Matrix Confusion ¡Matrix: ¡ Actual ¡class\Predicted ¡class ¡ C 1 ¡ ¬ ¡C 1 ¡ C 1 ¡ True ¡Posi2ves ¡(TP) ¡ False ¡Nega2ves ¡(FN) ¡ ¬ ¡C 1 ¡ False ¡Posi2ves ¡(FP) ¡ True ¡Nega2ves ¡(TN) ¡ Example of Confusion Matrix: BOLD red is correct classification Actual ¡class\Predicted ¡ buy_computer ¡ buy_computer ¡ Total ¡ class ¡ = ¡ ¡yes ¡ = ¡no ¡ buy_computer ¡= ¡yes ¡ 6954 ¡ 46 ¡ 7000 ¡ buy_computer ¡= ¡no ¡ 412 ¡ 2588 ¡ 3000 ¡ Total ¡ 7366 ¡ 2634 ¡ 10000 ¡ n Given ¡m ¡classes, ¡an ¡entry, ¡ CM i,j ¡ ¡in ¡a ¡ confusion ¡matrix ¡indicates ¡ # ¡of ¡tuples ¡in ¡class ¡ i ¡ ¡that ¡were ¡labeled ¡by ¡the ¡classifier ¡as ¡class ¡ j ¡ n May ¡have ¡extra ¡rows/columns ¡to ¡provide ¡totals ¡ 66 ¡

Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity A\P ¡ C ¡ ¬C ¡ n Class ¡Imbalance ¡Problem : ¡ ¡ C ¡ TP ¡ FN ¡ P ¡ n One ¡class ¡may ¡be ¡ rare , ¡e.g. ¡ ¬C ¡ FP ¡ TN ¡ N ¡ fraud, ¡or ¡HIV-‑posi(ve ¡ P ’ ¡ N ’ ¡ All ¡ n Significant ¡ majority ¡of ¡the ¡ nega,ve ¡class ¡and ¡minority ¡of ¡ n Classifier ¡Accuracy : ¡percentage ¡ the ¡posi(ve ¡class ¡ of ¡test ¡set ¡tuples ¡that ¡are ¡ correctly ¡classified ¡ n Sensi2vity ¡ : ¡True ¡Posi(ve ¡ recogni(on ¡rate ¡ Accuracy ¡= ¡(TP ¡+ ¡TN)/All ¡ n Sensi2vity ¡= ¡TP/P ¡ n Error ¡rate: ¡ 1 ¡– ¡ accuracy , ¡or ¡ n Specificity : ¡True ¡Nega(ve ¡ Error ¡rate ¡= ¡(FP ¡+ ¡FN)/All ¡ recogni(on ¡rate ¡ n Specificity ¡= ¡TN/N ¡ 67 ¡

Classifier Evaluation Metrics: Precision and Recall, and F-measures n Precision : ¡exactness ¡– ¡what ¡% ¡of ¡tuples ¡that ¡the ¡classifier ¡ labeled ¡as ¡posi(ve ¡are ¡actually ¡posi(ve ¡ n Recall: ¡ completeness ¡– ¡what ¡% ¡of ¡posi(ve ¡tuples ¡did ¡the ¡ classifier ¡label ¡as ¡posi(ve? ¡ n Perfect ¡score ¡is ¡1.0 ¡ n Inverse ¡rela(onship ¡between ¡precision ¡& ¡recall ¡ (show ¡curve) ¡ n F ¡measure ¡( F 1 ¡ or ¡ F -‑score) : ¡harmonic ¡mean ¡of ¡precision ¡and ¡ recall, ¡ ¡ n F ß : ¡ ¡ weighted ¡measure ¡of ¡precision ¡and ¡recall ¡ n ß ¡= ¡0.5 ¡weighs ¡precision ¡twice ¡the ¡weight ¡of ¡recall ¡ 68 ¡

Classifier Evaluation Metrics: Example Actual ¡Class\Predicted ¡class ¡ cancer ¡= ¡yes ¡ cancer ¡= ¡no ¡ Total ¡ Recogni(on(%) ¡ cancer ¡= ¡yes ¡ 90 ¡ 210 ¡ 300 ¡ 30.00 ¡( sensi,vity ¡TP/P) ¡ cancer ¡= ¡no ¡ 140 ¡ 9560 ¡ 9700 ¡ 98.56 ¡( specificity ¡TN/N) ¡ Total ¡ 230 ¡ 9770 ¡ 10000 ¡ 96.40 ¡( accuracy ¡TP+TN/ All ) ¡ n Precision ¡= ¡90/230 ¡= ¡39.13% ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Recall ¡= ¡90/300 ¡= ¡30.00% ¡ 69 ¡

Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods Holdout ¡method ¡ n n Given ¡data ¡is ¡randomly ¡par((oned ¡into ¡two ¡independent ¡sets ¡ n Training ¡set ¡(e.g., ¡2/3) ¡for ¡model ¡construc(on ¡ n Test ¡set ¡(e.g., ¡1/3) ¡for ¡accuracy ¡es(ma(on ¡ n Random ¡sampling: ¡a ¡varia(on ¡of ¡holdout ¡ n Repeat ¡holdout ¡k ¡(mes, ¡accuracy ¡= ¡avg. ¡of ¡the ¡accuracies ¡obtained ¡ ¡ Cross-‑valida2on ¡( k -‑fold, ¡where ¡k ¡= ¡10 ¡is ¡most ¡popular) ¡ n n Randomly ¡par((on ¡the ¡data ¡into ¡ k ¡ mutually ¡exclusive ¡subsets, ¡each ¡ approximately ¡equal ¡size ¡ n At ¡ i -‑th ¡itera(on, ¡use ¡D i ¡ as ¡test ¡set ¡and ¡others ¡as ¡training ¡set ¡ n Leave-‑one-‑out: ¡ k ¡folds ¡where ¡ k ¡= ¡# ¡of ¡tuples, ¡for ¡small ¡sized ¡data ¡ n *Stra2fied ¡cross-‑valida2on* : ¡folds ¡are ¡stra(fied ¡so ¡that ¡class ¡distribu(on ¡in ¡ each ¡fold ¡is ¡approximately ¡the ¡same ¡as ¡the ¡ini(al ¡data ¡ 70 ¡

Evaluating Classifier Accuracy: Bootstrap Bootstrap ¡ n n Works ¡well ¡with ¡small ¡data ¡sets ¡ n Samples ¡the ¡given ¡training ¡tuples ¡uniformly ¡ with ¡replacement ¡ n i.e., ¡each ¡(me ¡a ¡tuple ¡is ¡selected, ¡it ¡is ¡equally ¡likely ¡to ¡be ¡selected ¡ again ¡and ¡re-‑added ¡to ¡the ¡training ¡set ¡ Several ¡bootstrap ¡methods, ¡a ¡common ¡one ¡is ¡ .632 ¡boostrap ¡ n n A ¡data ¡set ¡with ¡ d ¡tuples ¡is ¡sampled ¡ d ¡(mes, ¡with ¡replacement, ¡resul(ng ¡in ¡ a ¡training ¡set ¡of ¡ d ¡samples. ¡ ¡The ¡data ¡tuples ¡that ¡did ¡not ¡make ¡it ¡into ¡the ¡ training ¡set ¡end ¡up ¡forming ¡the ¡test ¡set. ¡ ¡About ¡63.2% ¡of ¡the ¡original ¡data ¡ end ¡up ¡in ¡the ¡bootstrap, ¡and ¡the ¡remaining ¡36.8% ¡form ¡the ¡test ¡set ¡(since ¡ (1 ¡– ¡1/d) d ¡≈ ¡e -‑1 ¡= ¡0.368) ¡ n Repeat ¡the ¡sampling ¡procedure ¡ k ¡(mes, ¡overall ¡accuracy ¡of ¡the ¡model: ¡ Note: Training set accuracy Used in model accuracy 71 ¡

Estimating Confidence Intervals: Classifier Models M 1 vs. M 2 n Suppose ¡we ¡have ¡2 ¡classifiers, ¡M 1 ¡and ¡M 2 , ¡which ¡one ¡is ¡beNer? ¡ n Use ¡10-‑fold ¡cross-‑valida(on ¡to ¡obtain ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡and ¡ n These ¡mean ¡error ¡rates ¡are ¡just ¡ es,mates ¡of ¡error ¡on ¡the ¡true ¡ popula(on ¡of ¡ future ¡data ¡cases ¡ n What ¡if ¡the ¡difference ¡between ¡the ¡2 ¡error ¡rates ¡is ¡just ¡ aNributed ¡to ¡ chance ? ¡ n Use ¡a ¡ test ¡of ¡sta2s2cal ¡significance ¡ n Obtain ¡ confidence ¡limits ¡for ¡our ¡error ¡es(mates ¡ 72 ¡

Estimating Confidence Intervals: Null Hypothesis n Perform ¡10-‑fold ¡cross-‑valida(on ¡ n Assume ¡samples ¡follow ¡a ¡ t ¡distribu2on ¡with ¡ k–1 ¡ degrees ¡of ¡ freedom ¡ (here, ¡ k=10 ) ¡ n Use ¡ t-‑test ¡(or ¡ Student ’ s ¡t-‑test ) ¡ n Null ¡Hypothesis : ¡M 1 ¡& ¡M 2 ¡are ¡the ¡same ¡ n If ¡we ¡can ¡ reject ¡null ¡hypothesis, ¡then ¡ ¡ n we ¡conclude ¡that ¡the ¡difference ¡between ¡M 1 ¡& ¡M 2 ¡is ¡ sta2s2cally ¡significant ¡ n Chose ¡model ¡with ¡lower ¡error ¡rate ¡ 73 ¡

Estimating Confidence Intervals: t-test n If ¡only ¡1 ¡test ¡set ¡available: ¡ pairwise ¡comparison ¡ n For ¡i th ¡round ¡of ¡10-‑fold ¡cross-‑valida(on, ¡the ¡same ¡cross ¡ par((oning ¡is ¡used ¡to ¡obtain ¡ err(M 1 ) i ¡ and ¡ err(M 2 ) i ¡ n Average ¡over ¡10 ¡rounds ¡to ¡get ¡ ¡ and n t-‑test ¡computes ¡ t-‑sta2s2c ¡with ¡ k-‑1 ¡ degrees ¡of ¡ freedom: ¡ where n If ¡two ¡test ¡sets ¡available: ¡use ¡ non-‑paired ¡(2-‑sample) ¡t-‑test ¡ where where k 1 & k 2 are # of cross-validation samples used for M 1 & M 2 , resp. 74 ¡

Estimating Confidence Intervals: Table for t-distribution Symmetric ¡ n Significance ¡level , ¡e.g., ¡ sig ¡ n = ¡0.05 ¡ or ¡5% ¡ means ¡M 1 ¡& ¡ M 2 ¡are ¡ significantly ¡ different ¡for ¡95% ¡of ¡ popula(on ¡ Confidence ¡limit , ¡ z ¡= ¡sig/2 ¡ n 75 ¡

Estimating Confidence Intervals: Statistical Significance Are ¡M 1 ¡& ¡M 2 ¡ significantly ¡different ? ¡ n n Compute ¡ t. ¡ Select ¡ significance ¡level ¡(e.g. ¡ sig ¡= ¡5%) ¡ n Consult ¡table ¡for ¡t-‑distribu(on: ¡Find ¡ t ¡value ¡corresponding ¡to ¡ k-‑1 ¡degrees ¡ of ¡freedom ¡(here, ¡9) ¡ n t-‑distribu(on ¡is ¡symmetric: ¡typically ¡upper ¡% ¡points ¡of ¡distribu(on ¡ shown ¡→ ¡look ¡up ¡value ¡for ¡ confidence ¡limit ¡ z=sig/2 ¡(here, ¡0.025) ¡ n If ¡ t ¡> ¡z ¡or ¡t ¡< ¡-‑z , ¡then ¡t ¡value ¡lies ¡in ¡rejec(on ¡region: ¡ n Reject ¡null ¡hypothesis ¡that ¡mean ¡error ¡rates ¡of ¡M 1 ¡& ¡M 2 ¡are ¡same ¡ n Conclude: ¡sta(s(cally ¡significant ¡difference ¡between ¡M 1 ¡& ¡M 2 ¡ ¡ n Otherwise , ¡conclude ¡that ¡any ¡difference ¡is ¡ chance ¡ 76 ¡

Model Selection: ROC Curves ROC ¡(Receiver ¡Opera(ng ¡Characteris(cs) ¡ n curves: ¡for ¡visual ¡comparison ¡of ¡classifica(on ¡ models ¡ Originated ¡from ¡signal ¡detec(on ¡theory ¡ n Shows ¡the ¡trade-‑off ¡between ¡the ¡true ¡posi(ve ¡ n rate ¡and ¡the ¡false ¡posi(ve ¡rate ¡ The ¡area ¡under ¡the ¡ROC ¡curve ¡is ¡a ¡measure ¡of ¡ n the ¡accuracy ¡of ¡the ¡model ¡ Ver(cal ¡axis ¡represents ¡ n the ¡true ¡posi(ve ¡rate ¡ Rank ¡the ¡test ¡tuples ¡in ¡decreasing ¡order: ¡the ¡ n Horizontal ¡axis ¡rep. ¡the ¡ one ¡that ¡is ¡ most ¡likely ¡ to ¡belong ¡to ¡the ¡posi(ve ¡ n false ¡posi(ve ¡rate ¡ class ¡appears ¡at ¡the ¡top ¡of ¡the ¡list ¡ The ¡plot ¡also ¡shows ¡a ¡ n The ¡closer ¡to ¡the ¡diagonal ¡line ¡(i.e., ¡the ¡closer ¡ n diagonal ¡line ¡ the ¡area ¡is ¡to ¡0.5), ¡the ¡less ¡accurate ¡is ¡the ¡ A ¡model ¡with ¡perfect ¡ n model ¡ accuracy ¡will ¡have ¡an ¡area ¡ of ¡1.0 ¡ 77 ¡

Issues Affecting Model Selection Accuracy ¡ n n classifier ¡accuracy: ¡predic(ng ¡class ¡label ¡ Speed ¡ n n (me ¡to ¡construct ¡the ¡model ¡(training ¡(me) ¡ n (me ¡to ¡use ¡the ¡model ¡(classifica(on/predic(on ¡(me) ¡ Robustness : ¡handling ¡noise ¡and ¡missing ¡values ¡ n Scalability : ¡efficiency ¡in ¡disk-‑resident ¡databases ¡ ¡ n Interpretability ¡ n n understanding ¡and ¡insight ¡provided ¡by ¡the ¡model ¡ Other ¡measures, ¡e.g., ¡goodness ¡of ¡rules, ¡such ¡as ¡decision ¡tree ¡size ¡or ¡ n compactness ¡of ¡classifica(on ¡rules ¡ 78 ¡

Ensemble Methods: Increasing the Accuracy Ensemble ¡methods ¡ n n Use ¡a ¡combina,on ¡of ¡models ¡to ¡increase ¡accuracy ¡ n Combine ¡a ¡series ¡of ¡k ¡learned ¡models, ¡M 1 , ¡M 2 , ¡…, ¡M k , ¡with ¡the ¡aim ¡of ¡ crea(ng ¡an ¡improved ¡model ¡M* ¡ Popular ¡ensemble ¡methods ¡ n n Bagging: ¡averaging ¡the ¡predic(on ¡over ¡a ¡collec(on ¡of ¡classifiers ¡ n Boos(ng: ¡weighted ¡vote ¡with ¡a ¡collec(on ¡of ¡classifiers ¡ n Ensemble: ¡combining ¡a ¡set ¡of ¡heterogeneous ¡classifiers ¡ 80 ¡

Bagging: Boostrap Aggregation Analogy: ¡Diagnosis ¡based ¡on ¡mul(ple ¡doctors ’ ¡majority ¡vote ¡ n Training ¡ n n Given ¡a ¡set ¡D ¡of ¡ d ¡ tuples, ¡at ¡each ¡itera(on ¡ i , ¡a ¡training ¡set ¡D i ¡of ¡ d ¡tuples ¡is ¡ sampled ¡with ¡replacement ¡from ¡D ¡(i.e., ¡bootstrap) ¡ n A ¡classifier ¡model ¡M i ¡is ¡learned ¡for ¡each ¡training ¡set ¡D i ¡ Classifica(on: ¡classify ¡an ¡unknown ¡sample ¡X ¡ ¡ n n Each ¡classifier ¡M i ¡returns ¡its ¡class ¡predic(on ¡ n The ¡bagged ¡classifier ¡M* ¡counts ¡the ¡votes ¡and ¡assigns ¡the ¡class ¡with ¡the ¡ most ¡votes ¡to ¡ X ¡ Predic(on: ¡can ¡be ¡applied ¡to ¡the ¡predic(on ¡of ¡con(nuous ¡values ¡by ¡taking ¡ n the ¡average ¡value ¡of ¡each ¡predic(on ¡for ¡a ¡given ¡test ¡tuple ¡ Accuracy ¡ n n Ogen ¡significantly ¡beNer ¡than ¡a ¡single ¡classifier ¡derived ¡from ¡D ¡ n For ¡noisy ¡data: ¡not ¡considerably ¡worse, ¡more ¡robust ¡ ¡ n Proved ¡improved ¡accuracy ¡in ¡predic(on ¡ 81 ¡

Boosting Analogy: ¡Consult ¡several ¡doctors, ¡based ¡on ¡a ¡combina(on ¡of ¡weighted ¡ n diagnoses—weight ¡assigned ¡based ¡on ¡the ¡previous ¡diagnosis ¡accuracy ¡ How ¡boos(ng ¡works? ¡ n Weights ¡are ¡assigned ¡to ¡each ¡training ¡tuple ¡ n A ¡series ¡of ¡k ¡classifiers ¡is ¡itera(vely ¡learned ¡ n Ager ¡a ¡classifier ¡M i ¡is ¡learned, ¡the ¡weights ¡are ¡updated ¡to ¡allow ¡the ¡ n subsequent ¡classifier, ¡M i+1 , ¡to ¡ pay ¡more ¡aden2on ¡to ¡the ¡training ¡tuples ¡ that ¡were ¡misclassified ¡by ¡M i ¡ The ¡final ¡ M* ¡combines ¡the ¡votes ¡of ¡each ¡individual ¡classifier, ¡where ¡the ¡ n weight ¡of ¡each ¡classifier's ¡vote ¡is ¡a ¡func(on ¡of ¡its ¡accuracy ¡ Boos(ng ¡algorithm ¡can ¡be ¡extended ¡for ¡numeric ¡predic(on ¡ n Compared ¡with ¡bagging: ¡Boos(ng ¡tends ¡to ¡have ¡greater ¡accuracy, ¡but ¡it ¡also ¡ n risks ¡overfi{ng ¡the ¡model ¡to ¡misclassified ¡data ¡ Note: Weighted Majority 82 ¡

Adaboost (Freund and Schapire, 1997) Given ¡a ¡set ¡of ¡ d ¡class-‑labeled ¡tuples, ¡( X 1 , ¡y 1 ), ¡…, ¡( X d , ¡y d ) ¡ n Ini(ally, ¡all ¡the ¡weights ¡of ¡tuples ¡are ¡set ¡the ¡same ¡(1/d) ¡ n Generate ¡k ¡classifiers ¡in ¡k ¡rounds. ¡ ¡At ¡round ¡i, ¡ n Tuples ¡from ¡D ¡are ¡sampled ¡(with ¡replacement) ¡to ¡form ¡a ¡training ¡set ¡ n D i ¡of ¡the ¡same ¡size ¡ Each ¡tuple ’ s ¡chance ¡of ¡being ¡selected ¡is ¡based ¡on ¡its ¡weight ¡ n A ¡classifica(on ¡model ¡M i ¡is ¡derived ¡from ¡D i ¡ n Its ¡error ¡rate ¡is ¡calculated ¡using ¡D i ¡ as ¡a ¡test ¡set ¡ n If ¡a ¡tuple ¡is ¡misclassified, ¡its ¡weight ¡is ¡increased, ¡o.w. ¡it ¡is ¡decreased ¡ n Error ¡rate: ¡err( X j ) ¡is ¡the ¡misclassifica(on ¡error ¡of ¡tuple ¡ X j . ¡Classifier ¡M i ¡ n error ¡rate ¡is ¡the ¡sum ¡of ¡the ¡weights ¡of ¡the ¡misclassified ¡tuples: ¡ ¡ d ∑ = × error ( M ) w err ( X ) i j j j The ¡weight ¡of ¡classifier ¡M i ’ s ¡vote ¡is ¡ − n 1 error ( M ) i log error ( M ) i 83 ¡

Random Forest ( Breiman 2001) Random ¡Forest: ¡ ¡ n n Each ¡classifier ¡in ¡the ¡ensemble ¡is ¡a ¡ decision ¡tree ¡ classifier ¡and ¡is ¡ generated ¡using ¡a ¡ random ¡selec,on ¡of ¡aCributes ¡ at ¡each ¡node ¡to ¡ determine ¡the ¡split ¡ n During ¡classifica(on, ¡each ¡tree ¡votes ¡and ¡the ¡most ¡popular ¡class ¡is ¡ returned ¡ Two ¡Methods ¡to ¡construct ¡Random ¡Forest: ¡ n n Forest-‑RI ¡( random ¡input ¡selec,on ): ¡ ¡Randomly ¡select, ¡at ¡each ¡node, ¡F ¡ aNributes ¡as ¡candidates ¡for ¡the ¡split ¡at ¡the ¡node. ¡The ¡CART ¡(Gini) ¡or ¡Info ¡ Gain ¡methodology ¡is ¡used ¡to ¡grow ¡the ¡trees ¡to ¡maximum ¡size ¡ n Forest-‑RC ¡( random ¡linear ¡combina,ons ) : ¡ ¡Creates ¡new ¡aNributes ¡(or ¡ features) ¡that ¡are ¡a ¡linear ¡combina(on ¡of ¡the ¡exis(ng ¡aNributes ¡(reduces ¡ the ¡correla(on ¡between ¡individual ¡classifiers) ¡ Comparable ¡in ¡accuracy ¡to ¡Adaboost, ¡but ¡more ¡robust ¡to ¡errors ¡and ¡outliers ¡ ¡ n Insensi(ve ¡to ¡the ¡number ¡of ¡aNributes ¡selected ¡for ¡considera(on ¡at ¡each ¡ n split, ¡and ¡faster ¡than ¡bagging ¡or ¡boos(ng ¡ 84 ¡

Classification of Class-Imbalanced Data Sets Class-‑imbalance ¡problem: ¡Rare ¡posi(ve ¡example ¡but ¡numerous ¡nega(ve ¡ones, ¡ n e.g., ¡medical ¡diagnosis, ¡fraud, ¡oil-‑spill, ¡fault, ¡etc. ¡ ¡ Tradi(onal ¡methods ¡assume ¡a ¡balanced ¡distribu(on ¡of ¡classes ¡and ¡equal ¡error ¡ n costs: ¡not ¡suitable ¡for ¡class-‑imbalanced ¡data ¡ Typical ¡methods ¡for ¡imbalance ¡data ¡in ¡2-‑class ¡classifica(on: ¡ ¡ n n Oversampling : ¡re-‑sampling ¡of ¡data ¡from ¡posi(ve ¡class ¡ n Under-‑sampling : ¡randomly ¡eliminate ¡ ¡tuples ¡from ¡nega(ve ¡class ¡ n Threshold-‑moving : ¡moves ¡the ¡decision ¡threshold, ¡ t , ¡so ¡that ¡the ¡rare ¡class ¡ tuples ¡are ¡easier ¡to ¡classify, ¡and ¡hence, ¡less ¡chance ¡of ¡costly ¡false ¡nega(ve ¡ errors ¡ n Ensemble ¡techniques : ¡Ensemble ¡mul(ple ¡classifiers ¡introduced ¡above ¡ S(ll ¡difficult ¡for ¡class ¡imbalance ¡problem ¡on ¡mul(class ¡tasks ¡ n 85 ¡

Summary (I) Classifica(on ¡is ¡a ¡form ¡of ¡data ¡analysis ¡that ¡extracts ¡models ¡describing ¡ n important ¡data ¡classes. ¡ ¡ Effec(ve ¡and ¡scalable ¡methods ¡have ¡been ¡developed ¡for ¡decision ¡tree ¡ n induc(on, ¡Naive ¡Bayesian ¡classifica(on, ¡rule-‑based ¡classifica(on, ¡and ¡many ¡ other ¡classifica(on ¡methods. ¡ Evalua(on ¡metrics ¡include: ¡accuracy, ¡sensi(vity, ¡specificity, ¡precision, ¡recall, ¡ F ¡ n measure, ¡and ¡ F ß ¡ measure. ¡ Stra(fied ¡k-‑fold ¡cross-‑valida(on ¡is ¡recommended ¡for ¡accuracy ¡es(ma(on. ¡ ¡ ¡ n Bagging ¡and ¡boos(ng ¡can ¡be ¡used ¡to ¡increase ¡overall ¡accuracy ¡by ¡learning ¡and ¡ n combining ¡a ¡series ¡of ¡individual ¡models. ¡ 87 ¡

Summary (II) n Significance ¡tests ¡and ¡ROC ¡curves ¡are ¡useful ¡for ¡model ¡selec(on. ¡ n There ¡have ¡been ¡numerous ¡comparisons ¡of ¡the ¡different ¡ classifica(on ¡methods; ¡the ¡maNer ¡remains ¡a ¡research ¡topic ¡ n No ¡single ¡method ¡has ¡been ¡found ¡to ¡be ¡superior ¡over ¡all ¡others ¡ for ¡all ¡data ¡sets ¡ n Issues ¡such ¡as ¡accuracy, ¡training ¡(me, ¡robustness, ¡scalability, ¡ and ¡interpretability ¡must ¡be ¡considered ¡and ¡can ¡involve ¡trade-‑ offs, ¡further ¡complica(ng ¡the ¡quest ¡for ¡an ¡overall ¡superior ¡ method ¡ 88 ¡

References (1) C. ¡Apte ¡and ¡S. ¡Weiss. ¡ Data ¡mining ¡with ¡decision ¡trees ¡and ¡decision ¡rules . ¡Future ¡ n Genera(on ¡Computer ¡Systems, ¡13, ¡1997 ¡ C. ¡M. ¡Bishop, ¡ ¡ Neural ¡Networks ¡for ¡Padern ¡Recogni2on . ¡ ¡Oxford ¡University ¡Press, ¡ n 1995 ¡ L. ¡Breiman, ¡J. ¡Friedman, ¡R. ¡Olshen, ¡and ¡C. ¡Stone. ¡ Classifica2on ¡and ¡Regression ¡Trees . ¡ n Wadsworth ¡Interna(onal ¡Group, ¡1984 ¡ C. ¡J. ¡C. ¡Burges. ¡ A ¡Tutorial ¡on ¡Support ¡Vector ¡Machines ¡for ¡Padern ¡Recogni2on . ¡ Data ¡ n Mining ¡and ¡Knowledge ¡Discovery , ¡2(2): ¡121-‑168, ¡1998 ¡ P. ¡K. ¡Chan ¡and ¡S. ¡J. ¡Stolfo. ¡ Learning ¡arbiter ¡and ¡combiner ¡trees ¡from ¡par22oned ¡data ¡ n for ¡scaling ¡machine ¡learning . ¡KDD'95 ¡ H. ¡Cheng, ¡X. ¡Yan, ¡J. ¡Han, ¡and ¡C.-‑W. ¡Hsu, ¡ n Discrimina2ve ¡Frequent ¡Padern ¡Analysis ¡for ¡Effec2ve ¡Classifica2on , ¡ICDE'07 ¡ H. ¡Cheng, ¡X. ¡Yan, ¡J. ¡Han, ¡and ¡P. ¡S. ¡Yu, ¡ n Direct ¡Discrimina2ve ¡Padern ¡Mining ¡for ¡Effec2ve ¡Classifica2on , ¡ICDE'08 ¡ W. ¡Cohen. ¡ ¡ Fast ¡effec2ve ¡rule ¡induc2on . ¡ICML'95 ¡ n G. ¡Cong, ¡K.-‑L. ¡Tan, ¡A. ¡K. ¡H. ¡Tung, ¡and ¡X. ¡Xu. ¡ ¡ Mining ¡top-‑k ¡covering ¡rule ¡groups ¡for ¡ n gene ¡expression ¡data . ¡ ¡SIGMOD'05 ¡ 89 ¡

References (2) A. ¡J. ¡Dobson. ¡ ¡ An ¡Introduc2on ¡to ¡Generalized ¡Linear ¡Models . ¡ ¡Chapman ¡& ¡Hall, ¡1990. ¡ n G. ¡Dong ¡and ¡J. ¡Li. ¡ Efficient ¡mining ¡of ¡emerging ¡paderns: ¡Discovering ¡trends ¡and ¡ n differences . ¡KDD'99. ¡ R. ¡O. ¡Duda, ¡P. ¡E. ¡Hart, ¡and ¡D. ¡G. ¡Stork. ¡ Padern ¡Classifica2on , ¡2ed. ¡John ¡Wiley, ¡2001 ¡ n U. ¡M. ¡Fayyad. ¡ Branching ¡on ¡adribute ¡values ¡in ¡decision ¡tree ¡genera2on . ¡AAAI ’ 94. ¡ n Y. ¡Freund ¡and ¡R. ¡E. ¡Schapire. ¡ A ¡decision-‑theore2c ¡generaliza2on ¡of ¡on-‑line ¡learning ¡and ¡ n an ¡ ¡applica2on ¡to ¡boos2ng . ¡J. ¡Computer ¡and ¡System ¡Sciences, ¡1997. ¡ J. ¡Gehrke, ¡R. ¡Ramakrishnan, ¡and ¡V. ¡Gan(. ¡ Rainforest: ¡A ¡framework ¡for ¡fast ¡decision ¡tree ¡ n construc2on ¡of ¡large ¡datasets . ¡VLDB ’ 98. ¡ J. ¡Gehrke, ¡V. ¡Gant, ¡R. ¡Ramakrishnan, ¡and ¡W.-‑Y. ¡Loh, ¡ BOAT ¡-‑-‑ ¡Op2mis2c ¡Decision ¡Tree ¡ n Construc2on . ¡SIGMOD'99 . ¡ T. ¡Has(e, ¡R. ¡Tibshirani, ¡and ¡J. ¡Friedman. ¡ The ¡Elements ¡of ¡Sta2s2cal ¡Learning: ¡Data ¡ n Mining, ¡Inference, ¡ ¡and ¡Predic2on. ¡Springer-‑Verlag, ¡2001. ¡ D. ¡Heckerman, ¡D. ¡Geiger, ¡and ¡D. ¡M. ¡Chickering. ¡ Learning ¡Bayesian ¡networks: ¡The ¡ n combina2on ¡of ¡knowledge ¡and ¡sta2s2cal ¡data . ¡Machine ¡Learning, ¡1995. ¡ W. ¡Li, ¡J. ¡Han, ¡and ¡J. ¡Pei, ¡ CMAR: ¡Accurate ¡and ¡Efficient ¡Classifica2on ¡Based ¡on ¡Mul2ple ¡ n Class-‑Associa2on ¡Rules , ¡ICDM'01. ¡ ¡ 90 ¡

References (3) T.-‑S. ¡Lim, ¡W.-‑Y. ¡Loh, ¡and ¡Y.-‑S. ¡Shih. ¡ A ¡comparison ¡of ¡predic2on ¡accuracy, ¡complexity, ¡ n and ¡training ¡2me ¡of ¡ ¡thirty-‑three ¡old ¡and ¡new ¡classifica2on ¡algorithms. ¡ ¡Machine ¡ Learning, ¡2000. ¡ ¡ J. ¡Magidson. ¡ ¡ The ¡Chaid ¡approach ¡to ¡segmenta2on ¡modeling: ¡ ¡Chi-‑squared ¡ n automa2c ¡interac2on ¡detec2on . ¡In ¡R. ¡P. ¡Bagozzi, ¡editor, ¡Advanced ¡Methods ¡of ¡ Marke(ng ¡Research, ¡Blackwell ¡Business, ¡1994. ¡ M. ¡Mehta, ¡R. ¡Agrawal, ¡and ¡J. ¡Rissanen. ¡ SLIQ ¡: ¡A ¡fast ¡scalable ¡classifier ¡for ¡data ¡ n mining . ¡EDBT'96. ¡ T. ¡M. ¡Mitchell. ¡ Machine ¡Learning . ¡McGraw ¡Hill, ¡1997. ¡ ¡ n S. ¡K. ¡Murthy, ¡ Automa2c ¡Construc2on ¡of ¡Decision ¡Trees ¡from ¡Data: ¡A ¡Mul2-‑ n Disciplinary ¡Survey , ¡Data ¡Mining ¡and ¡Knowledge ¡Discovery ¡2(4): ¡345-‑389, ¡1998 ¡ J. ¡R. ¡Quinlan. ¡ Induc2on ¡of ¡decision ¡trees . ¡ Machine ¡Learning , ¡1:81-‑106, ¡1986. ¡ ¡ n J. ¡R. ¡Quinlan ¡and ¡R. ¡M. ¡Cameron-‑Jones. ¡ FOIL: ¡A ¡midterm ¡report . ¡ECML ’ 93. ¡ n J. ¡R. ¡Quinlan. ¡ C4.5: ¡Programs ¡for ¡Machine ¡Learning . ¡Morgan ¡Kaufmann, ¡1993. ¡ n J. ¡R. ¡Quinlan. ¡ ¡ Bagging, ¡boos2ng, ¡and ¡c4.5 . ¡AAAI'96. ¡ n 91 ¡

References (4) R. ¡Rastogi ¡and ¡K. ¡Shim. ¡ Public: ¡A ¡decision ¡tree ¡classifier ¡that ¡integrates ¡building ¡and ¡ n pruning . ¡VLDB ’ 98. ¡ J. ¡Shafer, ¡R. ¡Agrawal, ¡and ¡M. ¡Mehta. ¡ SPRINT ¡: ¡A ¡scalable ¡parallel ¡classifier ¡for ¡data ¡ n mining . ¡VLDB ’ 96. ¡ J. ¡W. ¡Shavlik ¡and ¡T. ¡G. ¡DieNerich. ¡ Readings ¡in ¡Machine ¡Learning . ¡Morgan ¡Kaufmann, ¡ n 1990. ¡ P. ¡Tan, ¡M. ¡Steinbach, ¡and ¡V. ¡Kumar. ¡ Introduc2on ¡to ¡Data ¡Mining . ¡Addison ¡Wesley, ¡ n 2005. ¡ S. ¡M. ¡Weiss ¡and ¡C. ¡A. ¡Kulikowski. ¡ ¡ Computer ¡Systems ¡that ¡Learn: ¡ ¡Classifica2on ¡and ¡ n Predic2on ¡Methods ¡from ¡Sta2s2cs, ¡Neural ¡Nets, ¡Machine ¡Learning, ¡and ¡Expert ¡ Systems . ¡ ¡Morgan ¡Kaufman, ¡1991. ¡ ¡ S. ¡M. ¡Weiss ¡and ¡N. ¡Indurkhya. ¡ Predic2ve ¡Data ¡Mining . ¡Morgan ¡Kaufmann, ¡1997. ¡ ¡ n I. ¡H. ¡WiNen ¡and ¡E. ¡Frank. ¡ Data ¡Mining: ¡Prac2cal ¡Machine ¡Learning ¡Tools ¡and ¡ n Techniques , ¡ ¡2ed. ¡ ¡Morgan ¡Kaufmann, ¡2005. ¡ X. ¡Yin ¡and ¡J. ¡Han. ¡ CPAR: ¡Classifica2on ¡based ¡on ¡predic2ve ¡associa2on ¡rules . ¡SDM'03 ¡ n H. ¡Yu, ¡J. ¡Yang, ¡and ¡J. ¡Han. ¡ Classifying ¡large ¡data ¡sets ¡using ¡SVM ¡with ¡hierarchical ¡ n clusters . ¡KDD'03. ¡ 92 ¡

Decision Tree Induction: An Example age income student credit_rating buys_computer <=30 high no fair no q Training ¡data ¡set: ¡Buys_computer ¡ <=30 high no excellent no q The ¡data ¡set ¡follows ¡an ¡example ¡of ¡ 31 … 40 high no fair yes >40 medium no fair yes Quinlan ’ s ¡ID3 ¡(Playing ¡Tennis) ¡ >40 low yes fair yes q Resul(ng ¡tree: ¡ >40 low yes excellent no 31 … 40 low yes excellent yes age? <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes <=30 overcast 31 … 40 medium no excellent yes 31..40 >40 31 … 40 high yes fair yes >40 medium no excellent no student? credit rating? yes excellent fair no yes no yes no yes 94

Algorithm for Decision Tree Induction n Basic ¡algorithm ¡(a ¡greedy ¡algorithm) ¡ n Tree ¡is ¡constructed ¡in ¡a ¡top-‑down ¡recursive ¡divide-‑and-‑ conquer ¡manner ¡ n At ¡start, ¡all ¡the ¡training ¡examples ¡are ¡at ¡the ¡root ¡ n ANributes ¡are ¡categorical ¡(if ¡con(nuous-‑valued, ¡they ¡are ¡ discre(zed ¡in ¡advance) ¡ n Examples ¡are ¡par((oned ¡recursively ¡based ¡on ¡selected ¡ aNributes ¡ n Test ¡aNributes ¡are ¡selected ¡on ¡the ¡basis ¡of ¡a ¡heuris(c ¡or ¡ sta(s(cal ¡measure ¡(e.g., ¡informa(on ¡gain) ¡ n Condi(ons ¡for ¡stopping ¡par((oning ¡ n All ¡samples ¡for ¡a ¡given ¡node ¡belong ¡to ¡the ¡same ¡class ¡ n There ¡are ¡no ¡remaining ¡aNributes ¡for ¡further ¡par((oning ¡– ¡ majority ¡vo(ng ¡is ¡employed ¡for ¡classifying ¡the ¡leaf ¡ n There ¡are ¡no ¡samples ¡leg ¡ 95

Brief Review of Entropy n m = 2 96

Attribute Selection Measure: Information Gain (ID3/C4.5) n Select ¡the ¡aNribute ¡with ¡the ¡highest ¡informa(on ¡gain ¡ n Let ¡ p i ¡be ¡the ¡probability ¡that ¡an ¡arbitrary ¡tuple ¡in ¡D ¡belongs ¡to ¡ class ¡C i , ¡es(mated ¡by ¡|C i , ¡D |/|D| ¡ n Expected ¡informa(on ¡(entropy) ¡needed ¡to ¡classify ¡a ¡tuple ¡in ¡D: ¡ m ∑ = − Info ( D ) p log ( p ) i 2 i = i 1 n Informa(on ¡needed ¡(ager ¡using ¡A ¡to ¡split ¡D ¡into ¡v ¡par((ons) ¡to ¡ classify ¡D: ¡ | D | v = ∑ j × Info ( D ) Info ( D ) A j | D | = j 1 n Informa(on ¡gained ¡by ¡branching ¡on ¡aNribute ¡A ¡ = − Gain(A) Info(D) Info (D) A 97

Attribute Selection: Information Gain 5 4 g Class ¡P: ¡buys_computer ¡= ¡ “ yes ” ¡ = + Info age ( D ) I ( 2 , 3 ) I ( 4 , 0 ) 14 14 g Class ¡N: ¡buys_computer ¡= ¡ “ no ” ¡ 5 9 9 5 5 + = = I = − − = I ( 3 , 2 ) 0 . 694 Info ( D ) ( 9 , 5 ) log ( ) log ( ) 0 . 940 2 2 14 14 14 14 14 age p i n i I(p i , n i ) 5 I ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡means ¡ “ age ¡<=30 ” ¡has ¡5 ¡out ¡of ¡ ( 2 , 3 ) <=30 2 3 0.971 14 14 ¡samples, ¡with ¡2 ¡yes ’ es ¡ ¡and ¡3 ¡ 31 … 40 4 0 0 no ’ s. ¡ ¡ ¡Hence ¡ >40 3 2 0.971 ¡ = − = age income student credit_rating buys_computer Gain ( age ) Info ( D ) Info ( D ) 0 . 246 age <=30 high no fair no ¡ <=30 high no excellent no 31 … 40 high no fair yes Similarly, ¡ >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no = Gain ( income ) 0 . 029 31 … 40 low yes excellent yes <=30 medium no fair no = Gain ( student ) 0 . 151 <=30 low yes fair yes >40 medium yes fair yes = <=30 medium yes excellent yes Gain ( credit _ rating ) 0 . 048 31 … 40 medium no excellent yes 31 … 40 high yes fair yes >40 medium no excellent no 98

Computing Information-Gain for Continuous-Valued Attributes n Let ¡aNribute ¡A ¡be ¡a ¡con(nuous-‑valued ¡aNribute ¡ n Must ¡determine ¡the ¡ best ¡split ¡point ¡for ¡A ¡ n Sort ¡the ¡value ¡A ¡in ¡increasing ¡order ¡ n Typically, ¡the ¡midpoint ¡between ¡each ¡pair ¡of ¡adjacent ¡values ¡ is ¡considered ¡as ¡a ¡possible ¡ split ¡point ¡ n (a i +a i+1 )/2 ¡is ¡the ¡midpoint ¡between ¡the ¡values ¡of ¡a i ¡and ¡a i+1 ¡ n The ¡point ¡with ¡the ¡ minimum ¡expected ¡informa,on ¡ requirement ¡for ¡A ¡is ¡selected ¡as ¡the ¡split-‑point ¡for ¡A ¡ n Split: ¡ n D1 ¡is ¡the ¡set ¡of ¡tuples ¡in ¡D ¡sa(sfying ¡A ¡≤ ¡split-‑point, ¡and ¡D2 ¡is ¡ the ¡set ¡of ¡tuples ¡in ¡D ¡sa(sfying ¡A ¡> ¡split-‑point ¡ 99

Gain Ratio for Attribute Selection (C4.5) n Informa(on ¡gain ¡measure ¡is ¡biased ¡towards ¡aNributes ¡with ¡a ¡ large ¡number ¡of ¡values ¡ n C4.5 ¡(a ¡successor ¡of ¡ID3) ¡uses ¡gain ¡ra(o ¡to ¡overcome ¡the ¡ problem ¡(normaliza(on ¡to ¡informa(on ¡gain) ¡ | D | | D | v = ∑ − j × j SplitInfo ( D ) log ( ) A 2 | D | | D | = j 1 n GainRa(o(A) ¡= ¡Gain(A)/SplitInfo(A) ¡ n Ex. ¡ n gain_ra(o(income) ¡= ¡0.029/1.557 ¡= ¡0.019 ¡ n The ¡aNribute ¡with ¡the ¡maximum ¡gain ¡ra(o ¡is ¡selected ¡as ¡the ¡ spli{ng ¡aNribute ¡ 100

Data Mining: Classification Jay Urbain, PhD Credits: Tom Mitchell, - PowerPoint PPT Presentation

Data Mining: Classification Jay Urbain, PhD Credits: Tom Mitchell, Machine Learning, CMU Nazli Goharian, IIT/Georgetown Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining 1 1 2 Classification: Basic Concepts n Classifica(on:

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

T HE CONCEPT of cognitive radio (CR) for designing is extracted from a pattern and is fed into

FOster the Comprehension and USe of of Knowledge intensive technologies for coding and sharing

Proseminar Linguistische Annotationen SS 2010 Wann Thema Literatur 15/04/10 Einfhrung,

Stout An Adaptive Interface to Scalable Cloud Storage John Dunagan John C. McCullough Alec

Kestrel An XMPP-Based Framework for Many Task Computing Applications HISTORY/PURPOSE Lance

Deconfinement and chiral transition in finite temperature lattice QCD Pter Petreczky for

Deconfinement and Equation of State in QCD Pter Petreczky What is deconfinement in QCD ? What