Supervised Learning: Classifica4on Sept. 24, 2018
Classification: Basic concepts • Classifica4on: Basic Concepts • Decision Tree Induc4on • Bayes Classifica4on Methods • Model Evalua4on and Selec4on • Techniques to Improve Classifica4on Accuracy: Ensemble Methods • Summary
Supervised vs. Unsupervised Learning • Supervised learning (classifica4on) – Supervision: The training data (observa4ons, measurements, etc.) are accompanied by labels indica4ng the class of the observa4ons – New data is classified based on the training set • Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observa4ons, etc. with the aim of establishing the existence of classes or clusters in the data
Prediction Problems: Classification vs. Numeric Prediction Classifica4on • • predicts categorical class labels (discrete or nominal) • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying aRribute and uses it in classifying new data Numeric Predic4on • • models con4nuous-valued func4ons, i.e., predicts unknown or missing values Typical applica4ons • • Credit/loan approval: • Medical diagnosis: if a tumor is cancerous or benign • Fraud detec4on: if a transac4on is fraudulent • Web page categoriza4on: which category it is
Classification—A Two-Step Process Model construc4on: describing a set of predetermined classes • • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label aRribute • The set of tuples used for model construc4on is training set • The model is represented as classifica4on rules, decision trees, or mathema4cal formulae Model usage: for classifying future or unknown objects • • Es4mate accuracy of the model • The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified • by the model Test set is independent of training set (otherwise over-fiZng) • If the accuracy is acceptable, use the model to classify new data • Note: If the test set is used to select models, it is called valida4on (test) set •
Process (1): Model Construction Classification Algorithms Training Data Classifier NAME RANK YEARS TENURED (Model) Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes IF rank = ‘ professor ’ Dave Assistant Prof 6 no OR years > 6 Anne Associate Prof 3 no THEN tenured = ‘ yes ’ 6
Process (2): Using the Model in Prediction Classifier Testing Unseen Data Data (Jeff, Professor, 4) NAME RANK YEARS TENURED Tenured? Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes 7
Chapter 8. Classification: Basic Concepts • Classifica4on: Basic Concepts • Decision Tree Induc4on • Bayes Classifica4on Methods • Model Evalua4on and Selec4on • Techniques to Improve Classifica4on Accuracy: Ensemble Methods • Summary 8
Decision Tree Induction: An Example age income student credit_rating buys_computer • Training data set: Buys_computer <=30 high no fair no • The data set follows an example <=30 high no excellent no 31 … 40 high no fair yes of Quinlan ’ s ID3 (Playing Tennis) >40 medium no fair yes >40 low yes fair yes • Resul4ng tree: >40 low yes excellent no 31 … 40 low yes excellent yes <=30 medium no fair no age? <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 … 40 medium no excellent yes <=30 31 … 40 high yes fair yes overcast 31..40 >40 >40 medium no excellent no student? yes credit rating? excellent fair no yes no yes no yes 9
Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – ARributes are categorical (if con4nuous-valued, they are discre4zed in advance) – Examples are par44oned recursively based on selected aRributes – Test aRributes are selected on the basis of a heuris4c or sta4s4cal measure (e.g., informa4on gain) • Condi4ons for stopping par44oning – All samples for a given node belong to the same class – There are no remaining aRributes for further par44oning – majority vo4ng is employed for classifying the leaf – There are no samples le` 10
Brief Review of Entropy m = 2
Attribute Selection Measure: Information Gain (ID3/C4.5) Select the aRribute with the highest informa4on gain • Let p i be the probability that an arbitrary tuple in D belongs to • class C i , es4mated by |C i , D |/|D| Expected informa4on (entropy) needed to classify a tuple in D: • m Info ( D ) p log ( p ) ∑ = − i 2 i i 1 = Informa4on needed (a`er using A to split D into v par44ons) to • classify D: | D | v = ∑ j Info ( D ) Info ( D ) × A j | D | j 1 = Informa4on gained by branching on aRribute A • Gain(A) Info(D) Info (D) = − A
Attribute Selection: Information Gain Class P: buys_computer = “ yes ” 5 4 • Info age ( D ) I ( 2 , 3 ) I ( 4 , 0 ) = + 14 14 Class N: buys_computer = “ no ” • 5 I ( 3 , 2 ) 0 . 694 + = 9 9 5 5 14 Info ( D ) = I ( 9 , 5 ) log ( ) log ( ) 0 . 940 = − − = 2 2 14 14 14 14 5 I age p i n i I(p i , n i ) means “ age <=30 ” has 5 out of ( 2 , 3 ) 14 14 samples, with 2 yes ’ es and 3 <=30 2 3 0.971 no ’ s. Hence, 31 … 40 4 0 0 >40 3 2 0.971 Gain ( age ) Info ( D ) Info ( D ) 0 . 246 = − = age Similarly, age income student credit_rating buys_computer <=30 high no fair no Gain ( income ) 0 . 029 = <=30 high no excellent no Gain ( student ) 0 . 151 31 … 40 high no fair yes = >40 medium no fair yes Gain ( credit _ rating ) 0 . 048 = >40 low yes fair yes >40 low yes excellent no 31 … 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 … 40 medium no excellent yes 31 … 40 high yes fair yes 13 >40 medium no excellent no
Computing Information-Gain for Continuous- Valued Attributes • Let aRribute A be a con4nuous-valued aRribute • Must determine the best split point for A – Sort the value A in increasing order – Typically, the midpoint between each pair of adjacent values is considered as a possible split point • (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 – The point with the minimum expected informa4on requirement for A is selected as the split-point for A • Split: – D1 is the set of tuples in D sa4sfying A ≤ split-point, and D2 is the set of tuples in D sa4sfying A > split-point
Gain Ratio for Attribute Selection (C4.5) • Informa4on gain measure is biased towards aRributes with a large number of values • C4.5 (a successor of ID3) uses gain ra4o to overcome the problem (normaliza4on to informa4on gain) v ⎛ ⎞ ⎛ ⎞ | D j | | D j | ∑ SplitInfo A ( D ) = − × log 2 ⎜ ⎟ ⎜ ⎟ | D | | D | ⎝ ⎠ ⎝ ⎠ j = 1 – GainRa4o(A) = Gain(A)/SplitInfo(A) • Ex. – gain_ra4o(income) = 0.029/1.557 = 0.019 • The aRribute with the maximum gain ra4o is selected as the spliZng aRribute
Gini Index (CART, IBM IntelligentMiner) • If a data set D contains examples from n classes, gini index, gini ( D ) is defined as n 2 gini ( D ) = 1 − p ∑ j j = 1 where p j is the rela4ve frequency of class j in D • If a data set D is split on A into two subsets D 1 and D 2 , the gini index gini ( D ) is defined as gini ( D ) = | D 1 | ) + | D 2 | | D | gini | D | gini ( D ) ( D 1 2 A • Reduc4on in Impurity: gini ( A ) gini ( D ) gini ( D ) Δ = − A • The aRribute provides the smallest gini split ( D ) (or the largest reduc4on in impurity) is chosen to split the node ( need to enumerate all the possible spli;ng points for each a<ribute ) 16
Computation of Gini Index • Ex. D has 9 tuples in buys_computer = “ yes ” and 5 in “ no ” 2 2 9 5 ⎛ ⎞ ⎛ ⎞ gini ( D ) 1 0 . 459 = − − = ⎜ ⎟ ⎜ ⎟ 14 14 ⎝ ⎠ ⎝ ⎠ • Suppose the aRribute income par44ons D into 10 in D 1 : {low, medium} and 4 in D 2 10 4 ⎛ ⎞ ⎛ ⎞ gini ( D ) Gini ( D ) Gini ( D ) = + ⎜ ⎟ ⎜ ⎟ income { low , medium } 1 2 ∈ 14 14 ⎝ ⎠ ⎝ ⎠ Gini {low,high} is 0.458; Gini {medium,high} is 0.450. Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index • All aRributes are assumed con4nuous-valued • May need other tools, e.g., clustering, to get the possible split values • Can be modified for categorical aRributes 17
Comparing Attribute Selection Measures • The three measures, in general, return good results but – Informa4on gain : • biased towards mul4valued aRributes – Gain ra4o : • tends to prefer unbalanced splits in which one par44on is much smaller than the others – Gini index : • biased to mul4valued aRributes • has difficulty when # of classes is large • tends to favor tests that result in equal-sized par44ons and purity in both par44ons
Recommend
More recommend