Tree Algorithms in Data Mining: Comparison of rpart and RWeka . . . - PowerPoint PPT Presentation

Tree Algorithms in Data Mining: Comparison of rpart and RWeka . . . and Beyond Achim Zeileis http://statmath.wu.ac.at/~zeileis/

Motivation For publishing new tree algorithms, benchmarks against established methods are necessary. When developing the tools in party , we benchmarked against rpart , the open-source implementation of CART. Statistical journals were usually happy with that. Usual comment from machine learners: You have to benchmark against C4.5, it’s much better than CART! Quinlan provided source code for C4.5, but not with a license that would allow usage. Weka had an open-source Java implementation, but hard to access from R. When we developed RWeka , we finally were able to set up some benchmark with CART and C4.5 within R.

Tree algorithms CART/RPart ( rpart ): Classification and regression trees (Breiman, Friedman, Olshen, Stone 1984). Cross-validation-based cost-complexity pruning: RPart0: Best prediction error. RPart1: Highest complexity parameter within 1 standard error. C4.5/J4.8 ( RWeka ): C4.5 (Quinlan, 1993). Determine size by confidence threshold C and minimal leaf size M : J4.8: Standard heuristics C = 0 . 25, M = 2. J4.8(cv): Cross-validation for C = 0 . 01 , . . . , 0 . 5, M = 2 , . . . , 20. QUEST ( LohTools ): Quick, unbiased and efficient statistical trees (Loh, Shih 1997). Popularized concept of unbiased recursive partitioning in statistics. Hand-crafted convenience interface to original binaries. CTree ( party ): Conditional inference trees (Hothorn, Hornik, Zeileis 2006). Unbiased recursive partitioning based on permutation tests.

UCI data sets (mlbench) Data set # of obs. # of cat. inputs # of num. inputs breast cancer 699 9 – chess 3196 36 – circle ∗ 1000 – 2 credit 690 – 24 heart 303 8 5 hepatitis 155 13 6 house votes 84 435 16 – ionosphere 351 1 32 liver 345 – 6 Pima Indians diabetes 768 – 8 promotergene 106 57 – ringnorm ∗ 1000 – 20 sonar 208 – 60 spirals ∗ 1000 – 2 threenorm ∗ 1000 – 20 tictactoe 958 9 – titanic 2201 3 – twonorm ∗ 1000 – 20

Analysis 6 tree algorithms. 18 data sets. 500 bootstrap samples for each combination. Performance measure: Out-of-bag misclassification rate. Complexity measure: Number of splits + number of leafs. Individual results: Simultaneous pairwise confidence intervals (Tukey all-pair comparisons). Aggregated results: Bradley-Terry model (Alternatively: median linear consensus ranking, . . . ).

Individual results: Pima Indian diabetes ( ) J4.8(cv) − J4.8 ● ( ) RPart0 − J4.8 ● RPart1 − J4.8 ( ) ● ( ) QUEST − J4.8 ● ( ) CTree − J4.8 ● ( ) RPart0 − J4.8(cv) ● ( ) RPart1 − J4.8(cv) ● QUEST − J4.8(cv) ( ) ● ( ) CTree − J4.8(cv) ● ( ) RPart1 − RPart0 ● ( ) QUEST − RPart0 ● ( ) CTree − RPart0 ● QUEST − RPart1 ( ) ● ( ) CTree − RPart1 ● ( ) CTree − QUEST ● −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 Misclassification difference (in percent)

Individual results: Pima Indian diabetes ( ) J4.8(cv) − J4.8 ● ( ) RPart0 − J4.8 ● RPart1 − J4.8 ( ) ● ( ) QUEST − J4.8 ● ( ) CTree − J4.8 ● ( ) RPart0 − J4.8(cv) ● ( ) RPart1 − J4.8(cv) ● QUEST − J4.8(cv) ( ) ● ( ) CTree − J4.8(cv) ● ( ) RPart1 − RPart0 ● ( ) QUEST − RPart0 ● ( ) CTree − RPart0 ● QUEST − RPart1 ( ) ● ( ) CTree − RPart1 ● ( ) CTree − QUEST ● −80 −60 −40 −20 0 20 Complexity difference

Individual results: Breast cancer ( ) J4.8(cv) − J4.8 ● ( ) RPart0 − J4.8 ● RPart1 − J4.8 ( ) ● ( ) QUEST − J4.8 ● ( ) CTree − J4.8 ● ( ) RPart0 − J4.8(cv) ● ( ) RPart1 − J4.8(cv) ● QUEST − J4.8(cv) ( ) ● ( ) CTree − J4.8(cv) ● ( ) RPart1 − RPart0 ● ( ) QUEST − RPart0 ● ( ) CTree − RPart0 ● QUEST − RPart1 ( ) ● ( ) CTree − RPart1 ● ( ) CTree − QUEST ● −1.0 −0.5 0.0 0.5 1.0 Misclassification difference (in percent)

Individual results: Breast cancer ( ) J4.8(cv) − J4.8 ● ( ) RPart0 − J4.8 ● RPart1 − J4.8 ( ) ● ( ) QUEST − J4.8 ● ( ) CTree − J4.8 ● ( ) RPart0 − J4.8(cv) ● ( ) RPart1 − J4.8(cv) ● QUEST − J4.8(cv) ( ) ● ( ) CTree − J4.8(cv) ● ( ) RPart1 − RPart0 ● ( ) QUEST − RPart0 ● ( ) CTree − RPart0 ● QUEST − RPart1 ( ) ● ( ) CTree − RPart1 ● ( ) CTree − QUEST ● −15 −10 −5 0 5 10 Complexity difference

Aggregated results: Misclassification 0.6 Worth parameters 0.4 ● ● 0.2 ● ● ● ● 0.0 J4.8 J4.8(cv) RPart0 RPart1 QUEST CTree Objects

Aggregated results: Complexity ● 0.6 Worth parameters 0.4 0.2 ● ● ● ● 0.0 ● J4.8 J4.8(cv) RPart0 RPart1 QUEST CTree Objects

Summary No clear preference between CART/RPart and C4.5/J4.8. Other tree algorithms perform similarly well. Cross-validated trees perform better than their counterparts. 1-standard error rule does not seem to be supported. And now for something different: Before: Pairwise comparisons of tree algorithms. Now: Tree algorithm for pairwise comparison data.

Model-based recursive partitioning Generic algorithm: Fit parametric model for Y . 1 Assess stability of the model parameters over each splitting 2 variable Z j . Split sample along the Z j ∗ with strongest association: Choose 3 breakpoint with highest improvement of the model fit. Repeat steps 1–3 recursively in the subsamples until no more 4 significant instabilities. Application: Use Bradley-Terry models in step 1. Implementation: psychotree on R-Forge.

Germany’s Next Topmodel Study at Department of Psychology, Universität Tübingen. 192 subjects rated the attractiveness of candidates in 2nd season of Germany’s Next Topmodel. 6 finalists: Barbara Meier, Anni Wendler, Hana Nitsche, Fiona Erdmann, Mandy Graff and Anja Platzer. Pairwise comparison (with forced choice). Subject covariates: Gender, age, questions about interest in the show.

Germany’s Next Topmodel

Germany’s Next Topmodel 1 age p < 0.001 ≤ 52 > 52 2 q2 p = 0.017 yes no 4 gender p = 0.007 male female Node 3 (n = 35) Node 5 (n = 71) Node 6 (n = 56) Node 7 (n = 30) 0.5 0.5 0.5 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 0 0 B Ann H F M Anj B Ann H F M Anj B Ann H F M Anj B Ann H F M Anj

References Hothorn T, Leisch F , Zeileis A, Hornik K (2005). “The Design and Analysis of Benchmark Experiments.” Journal of Computational and Graphical Statistics , 14 (3), 675–699. doi:10.1198/106186005X59630 Schauerhuber M, Zeileis A, Meyer D (2008). “Benchmarking Open-Source Tree Learners in R/ RWeka .” In C Preisach, H Burkhardt, L Schmidt-Thieme, R Decker (eds.), Data Analysis, Machine Learning and Applications (Proceedings of the 31st Annual Conference of the Gesellschaft für Klassifikation e.V., Albert-Ludwigs-Universität Freiburg, March 7–9, 2007) . pp. 389–396. Hornik K, Buchta C, Zeileis A (2009). “Open-Source Machine Learning: R Meets Weka .” Computational Statistics , 24 (2), 225–232. doi:10.1007/s00180-008-0119-7 Strobl C, Wickelmaier F , Zeileis A (2009). “Accounting for Individual Differences in Bradley-Terry Models by Means of Recursive Partitioning.” Technical Report 54 , Department of Statistics, Ludwig-Maximilians-Universität München. URL http://epub.ub.uni-muenchen.de/10588/

Tree Algorithms in Data Mining: Comparison of rpart and RWeka . . . - PowerPoint PPT Presentation

Tree Algorithms in Data Mining: Comparison of rpart and RWeka . . . and Beyond Achim Zeileis http://statmath.wu.ac.at/~zeileis/ Motivation For publishing new tree algorithms, benchmarks against established methods are necessary. When

Session 12 Tree-based models: tree and rpart Two libraries The tree library is like the

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Algorithms in Bioinformatics: A Practical Introduction Phylogenetic Tree comparison and

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

ECE 3060 VLSI and Advanced Digital Design Lecture 16 Technology Mapping/Library Binding Outline

Technology Mapping Technology Mapping Slides adopted from A. Kuehlmann Slides adopted from A.

Introduction to Unification Theory Matching Temur Kutsia RISC, Johannes Kepler University of

Pesticide Enforcement Partners Department of Pesticide Regulation (DPR) Oversees statewide

A Framework for Layout-Level Logic Restructuring Hosung Leo Kim John Lillis Motivation:

Grammar Implementation with Lexicalized Tree Adjoining Grammars and Frame Semantics Grammar

RC(L) Interconnect Sizing with Second Order Considerations via Posynomial Programming Tao Lin

Dual-tree Algorithms in Statistics Ryan Riegel rriegel@cc.gatech.edu Computational Science and