implementation real machine learning schemes decision
play

Implementation: Real machine learning schemes Decision trees z From - PDF document

Implementation: Real machine learning schemes Decision trees z From ID3 to C4.5 (pruning, numeric attributes, ...) Classification rules z From PRISM to RIPPER and PART (pruning, numeric data, ...) Data Mining


  1. � � � � � � � Implementation: Real machine learning schemes Decision trees z From ID3 to C4.5 (pruning, numeric attributes, ...) Classification rules z From PRISM to RIPPER and PART (pruning, numeric data, ...) Data Mining Extending linear models z Support vector machines and neural networks Practical Machine Learning Tools and Techniques Instance-based learning z Pruning examples, generalized exemplars, distance functions Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Numeric prediction z Regression/model trees, locally weighted regression Clustering: hierarchical, incremental, probabilistic z Hierarchical, incremental, probabilistic Bayesian networks z Learning and prediction, fast data structures for learning 1 2 07/20/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 07/20/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) Industrial-strength algorithms Decision trees ✁ For an algorithm to be useful in a wide ✂ Extending ID3: ✄ to permit numeric attributes: range of real-world applications it must: straightforward ✄ to deal sensibly with missing values: z Permit numeric attributes trickier ✄ stability for noisy data: z Allow missing values z Be robust in the presence of noise requires pruning mechanism ✂ End result: C4.5 (Quinlan) z Be able to approximate arbitrary concept ✄ Best-known and (probably) most widely-used descriptions (at least in principle) ✁ Basic schemes need to be extended to fulfill learning algorithm ✄ Commercial successor: C5.0 these requirements 3 4 07/20/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 07/20/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) Numeric attributes Weather data (again!) Outlook Outlook Outlook Temperature Temperature Temperature Humidity Humidity Humidity Windy Windy Windy Play Play Play ✁ Standard method: binary splits Sunny Sunny Sunny Hot Hot Hot High High High False False False No No No Sunny Sunny Sunny Hot Hot Hot High High High True True True No No No z E.g. temp < 45 Overcast Overcast Overcast Hot Hot Hot High High High False False False Yes Yes Yes ✁ Unlike nominal attributes, Rainy Rainy Rainy Mild Mild Mild Normal Normal High False False False Yes Yes Yes Rainy … … … … … Cool … … … … … Normal … … … … … False … … … … … Yes … … … … … every attribute has many possible split points Rainy … … Cool … … Normal … … True … … … No … ✁ Solution is straightforward extension: … … … … … … … … … … … … … … … z Evaluate info gain (or other measure) for every possible split point of attribute If outlook = sunny and humidity = high then play = no z Choose “best” split point If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes z Info gain for best split point is info gain for attribute If humidity = normal then play = yes ✁ Computationally more demanding If none of the above then play = yes 5 6 07/20/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 07/20/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1

  2. Weather data (again!) Example ✁ Split on temperature attribute: Outlook Outlook Outlook Temperature Temperature Temperature Humidity Humidity Humidity Windy Windy Windy Play Play Play Sunny Sunny Sunny Hot Hot Hot High High High False False False No No No Sunny Sunny Sunny Hot Hot Hot Outlook Outlook Outlook High High Temperature Temperature High Temperature True True True Humidity Humidity Humidity No No No Windy Windy Windy Play Play Play 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Overcast Overcast Overcast Hot Hot Hot Sunny Sunny Sunny High High High Hot Hot 85 False False False High High 85 Yes Yes Yes False False False No No No Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No Rainy Rainy Rainy Mild Mild Mild Sunny Sunny Sunny Normal Normal High Hot Hot 80 False False False High High 90 Yes Yes Yes True True True No No No Rainy … … … … … Cool … … … … … Overcast Overcast Overcast Normal … … … … … Hot Hot 83 False … … … … … High High 86 Yes … … … … … False False False Yes Yes Yes z E.g. temperature < 71.5: yes/4, no/2 Rainy … … Cool … … Rainy Rainy Rainy Normal … … Mild Mild 70 True … … Normal Normal 96 No … … False False False Yes Yes Yes temperature * 71.5: yes/5, no/3 … … … … … … Rainy … … … … … … … … … 68 … … … … … … … … … … … … 80 … … … False … … … … … Yes … … … … … Rainy … … … 65 … … … 70 True … … No … … z Info([4,2],[5,3]) … … … … … … … … … … … … … … … = 6/14 info([4,2]) + 8/14 info([5,3]) If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no = 0.939 bits If outlook = overcast then play = yes ✁ Place split points halfway between values If humidity = normal then play = yes If none of the above then play = yes If outlook = sunny and humidity > 83 then play = no ✁ Can evaluate all split points in one pass! If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity < 85 then play = no If none of the above then play = yes 7 8 07/20/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 07/20/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) Can avoid repeated sorting Binary vs multiway splits ✁ Splitting (multi-way) on a nominal attribute ✁ Sort instances by the values of the numeric attribute exhausts all information in that attribute z Nominal attribute is tested (at most) once on any path z Time complexity for sorting: O ( n log n ) in the tree ✁ Does this have to be repeated at each node of the ✁ Not so for binary splits on numeric attributes! tree? z Numeric attribute may be tested several times along a ✁ No! Sort order for children can be derived from sort path in the tree ✁ Disadvantage: tree is hard to read order for parent z Time complexity of derivation: O ( n ) ✁ Remedy: z Drawback: need to create and store an array of sorted z pre-discretize numeric attributes, or indices for each numeric attribute z use multi-way splits instead of binary ones 9 10 07/20/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 07/20/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) Computing multi-way splits Missing values ✁ Simple and efficient way of generating ✁ Split instances with missing values into pieces multi-way splits: greedy algorithm z A piece going down a branch receives a weight ✁ Dynamic programming can find optimum proportional to the popularity of the branch z weights sum to 1 multi-way split in O ( n 2 ) time ✁ Info gain works with fractional instances z imp ( k , i , j ) is the impurity of the best split of values x i … x j into k sub-intervals z use sums of weights instead of counts ✁ During classification, split the instance into z imp ( k , 1, i ) = min 0< j < i imp ( k –1, 1, j ) + imp (1, j +1, i ) pieces in the same way z imp ( k, 1 , N ) gives us the best k -way split z Merge probability distribution using weights ✁ In practice, greedy algorithm works as well 11 12 07/20/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 07/20/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 2

Recommend


More recommend