Machine Learning in R The mlr package Lars Kotthofg 1 University of Wyoming larsko@uwyo.edu St Andrews, 24 July 2018 1 with slides from Bernd Bischl
Outline 2 ▷ Overview ▷ Basic Usage ▷ Wrappers ▷ Preprocessing with mlrCPO ▷ Feature Importance ▷ Parameter Optimization
Don’t reinvent the wheel. 3
Motivation The good news The bad news not easily available 4 ▷ hundreds of packages available in R ▷ often high-quality implementation of state-of-the-art methods ▷ no common API (although very similar in many cases) ▷ not all learners work with all kinds of data and predictions ▷ what data, predictions, hyperparameters, etc are supported is � mlr provides a domain-specifjc language for ML in R
Overview hyperparameters… 5 ▷ https://github.com/mlr-org/mlr ▷ 8-10 main developers, > 50 contributors, 5 GSoC projects ▷ unifjed interface for the basic building blocks: tasks, learners,
Basic Usage setosa 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 ## 6 4.6 5.4 3.9 1.7 0.4 setosa # create task task = makeClassifTask (id = ”iris”, iris, target = ”Species”) # create learner learner = makeLearner (”classif.randomForest”) 3.1 ## 4 head (iris) ## 2 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa 4.9 setosa 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 6
Basic Usage ## Aggregated Result: ## Runtime: 0.0425465 ## Aggr perf: mmce.test.mean=0.0400000 ## Learner: classif.randomForest ## Task: iris ## Resample Result ## mmce.test.mean=0.0400000 ## # build model and evaluate 0.0400000 ## [Resample] iter 1: mmce ## Measures: holdout ## Resampling: holdout (learner, task) 7
Basic Usage ## Aggregated Result: ## Runtime: 0.0333493 ## Aggr perf: acc.test.mean=0.9800000 ## Learner: classif.randomForest ## Task: iris ## Resample Result ## acc.test.mean=0.9800000 ## # measure accuracy 0.9800000 ## [Resample] iter 1: acc ## Measures: holdout ## Resampling: holdout (learner, task, measures = acc) 8
Basic Usage ## ## [Resample] iter 8: 0.9333333 ## [Resample] iter 9: 1.0000000 ## [Resample] iter 10: 0.9333333 ## Aggregated Result: ## [Resample] iter 7: acc.test.mean=0.9600000 ## ## Resample Result ## Task: iris ## Learner: classif.randomForest ## Aggr perf: acc.test.mean=0.9600000 ## Runtime: 0.530509 1.0000000 1.0000000 # 10 fold cross-validation 1.0000000 crossval (learner, task, measures = acc) ## Resampling: cross-validation ## Measures: acc ## [Resample] iter 1: ## [Resample] iter 2: ## [Resample] iter 6: 0.9333333 ## [Resample] iter 3: 1.0000000 ## [Resample] iter 4: 1.0000000 ## [Resample] iter 5: 0.8000000 9
Basic Usage ## Aggregated Result: 1.0000000 0.0000000 ## [Resample] iter 7: 0.9444444 0.0555556 ## [Resample] iter 8: 0.8947368 0.1052632 ## acc.test.mean=0.9535819,mmce.test.mean=0.0464181 0.9473684 0.0526316 ## ## Resample Result ## Task: iris ## Learner: classif.randomForest ## Aggr perf: acc.test.mean=0.9535819,mmce.test.mean=0.0464181 ## Runtime: 0.28359 ## [Resample] iter 6: ## [Resample] iter 5: # more general -- resample description mmce rdesc = makeResampleDesc (”CV”, iters = 8) resample (learner, task, rdesc, measures = list (acc, mmce)) ## Resampling: cross-validation ## Measures: acc ## [Resample] iter 1: 1.0000000 0.0000000 0.9473684 0.0526316 ## [Resample] iter 2: 0.9473684 0.0526316 ## [Resample] iter 3: 0.9473684 0.0526316 ## [Resample] iter 4: 10
Finding Your Way Around ”multiclass.aunu” ”lsr” ## [4] ”bac” ”qsr” ”timeboth” ## [7] ”multiclass.aunp” ”timetrain” ## [10] ”ber” [1] ”featperc” ”timepredict” ”multiclass.brier” ## [13] ”ssr” ”acc” ”logloss” ## [16] ”wkappa” ”multiclass.au1p” ”multiclass.au1u” ## [19] ”kappa” ”mmce” ## listLearners (task)[1:5, c (1,3,4)] classif.C50 ## class short.name package ## 1 classif.adaboostm1 adaboostm1 RWeka ## 2 classif.boosting adabag adabag,rpart ## 3 C50 listMeasures (task) C50 ## 4 classif.cforest cforest party ## 5 classif.ctree ctree party 11
Integrated Learners Clustering Survival Classifjcation Regression 12 ▷ LDA, QDA, RDA, MDA ▷ Linear, lasso and ridge ▷ Trees and forests ▷ Boosting ▷ Boosting (difgerent variants) ▷ Trees and forests ▷ SVMs (difgerent variants) ▷ Gaussian processes ▷ … ▷ … ▷ K-Means ▷ Cox-PH ▷ EM ▷ Cox-Boost ▷ DBscan ▷ Random survival forest ▷ X-Means ▷ Penalized regression ▷ … ▷ …
Learner Hyperparameters TRUE - - - logical ## oob.prox - FALSE - - - FALSE logical ## proximity - - FALSE - - FALSE logical ## localImp - TRUE - - - FALSE logical ## importance - TRUE Y - - 1 to Inf logical - FALSE - - - FALSE logical ## keep.inbag - FALSE - - TRUE - ## keep.forest ## norm.votes - FALSE - - - FALSE logical ## do.trace - FALSE - - TRUE - logical - - getParamSet (learner) - 1 to Inf numericvector <NA> ## classwt - TRUE - - TRUE - logical ## replace - TRUE - - - integer ## mtry - TRUE - 500 1 to Inf - integer ## ntree Constr Req Tunable Trafo Def len Type ## - 0 to Inf TRUE integer ## sampsize ## maxnodes - TRUE - 1 1 to Inf - integer ## nodesize - TRUE - - 1 to Inf integervector <NA> - - FALSE - - - - untyped ## strata - TRUE - 0 to 1 - numericvector <NA> ## cutoff 13
Learner Hyperparameters lrn = makeLearner (”classif.randomForest”, ntree = 100, mtry = 10) 14 lrn = setHyperPars (lrn, ntree = 100, mtry = 10)
Wrappers impute wrapper 15 ▷ extend the functionality of learners ▷ e.g. wrap a learner that cannot handle missing values with an ▷ hyperparameter spaces of learner and wrapper are joined ▷ can be nested
Wrappers Available Wrappers algorithms, CMAES, iRace, MBO exhaustive search, genetic algorithms, … min, max, empirical distribution or other learners 16 ▷ Preprocessing: PCA, normalization (z-transformation) ▷ Parameter Tuning: grid, optim, random search, genetic ▷ Filter: correlation- and entropy-based, X 2 -test, mRMR, … ▷ Feature Selection: (fmoating) sequential forward/backward, ▷ Impute: dummy variables, imputations with mean, median, ▷ Bagging to fuse learners on bootstraped samples ▷ Stacking to combine models in heterogenous ensembles ▷ Over- and Undersampling for unbalanced classifjcation
Preprocessing with mlrCPO https://github.com/mlr-org/mlrCPO objects with their own hyperparameters operation = cpoScale () print (operation) ## scale(center = TRUE, scale = TRUE) 17 ▷ Composable Preprocessing Operators for mlr – ▷ separate R package due to complexity, mlrCPO ▷ preprocessing operations (e.g. imputation or PCA) as R
Preprocessing with mlrCPO imputing.pca = cpoImputeMedian () %>>% cpoPca () task %>>% imputing.pca pipeline: pca.rf = imputing.pca %>>% makeLearner (”classif.randomForest”) 18 ▷ objects are handled using the “piping” operator %>>% ▷ composition: ▷ application to data: ▷ combination with a Learner to form a machine learning
mlrCPO Example: Titanic # drop uninteresting columns dropcol.cpo = cpoSelect (names = c (”Cabin”, ”Ticket”, ”Name”), invert = TRUE) # impute impute.cpo = cpoImputeMedian (affect.type = ”numeric”) %>>% cpoImputeConstant (”__miss__”, affect.type = ”factor”) 19
mlrCPO Example: Titanic 3 ## Positive class: 0 ## 541 331 1 0 ## ## Classes: 2 ## Has coordinates: FALSE ## Has blocking: FALSE ## Has weights: FALSE ## Missings: FALSE 0 0 4 train.task = makeClassifTask (”Titanic”, train.data, ## ordered functionals factors numerics ## ## Features: ## Observations: 872 ## Target: Survived ## Type: classif ## Supervised task: Titanic print (pp.task) pp.task = train.task %>>% dropcol.cpo %>>% impute.cpo target = ”Survived”) 20
Combination with Learners learning pipelines learner = dropcol.cpo %>>% impute.cpo %>>% makeLearner (”classif.randomForest”, predict.type = ”prob”) # train using the task that was not preprocessed pp.mod = train (learner, train.task) 21 ▷ attach one or more CPOs to a learner to build machine ▷ automatically handles preprocessing of test data
mlrCPO Summary conversion, target value transformation, over/undersampling, ... preprocessing operations selectable through hyperparameter 22 ▷ listCPO() to show available CPOs ▷ currently 69 CPOs, and growing: imputation, feature type ▷ CPO “multiplexer” enables combination of difgerent distinct ▷ custom CPO s can be created using makeCPO()
Feature Importance ## Number of Monte-Carlo iterations: NA 44.58139 42.51918 2.282677 9.857828 ## 1 Sepal.Length Sepal.Width Petal.Length Petal.Width ## ## Local: FALSE ## Replace: NA model = train ( makeLearner (”classif.randomForest”), iris.task) x ## Aggregation: function (x) ## Contrast: NA ## Measure: NA ## Learner: classif.randomForest ## ## Task: iris-example ## FeatureImportance: getFeatureImportance (model) 23
Recommend
More recommend