Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining Architect, Hitachi Vantara
Agenda The Spark distributed processing framework was designed with iterative machine learning in mind. This session discusses: • Integration of MLlib classification algorithms into Weka • Consistent evaluation of algorithms on the desktop and in the cluster • Benefits for data science practitioners
What’s Weka? • Weka is a library containing a large collection of machine learning algorithms, implemented in Java • Main types of learning problems that it can tackle – Classification: given a labeled set of observations, learn to predict labels for new observations – Regression: numeric value instead of label – Attribute selection: find attributes of observations that are important for prediction – Clustering: no labels, just identify groups of similar observations (clusters) • 190 plugin “packages”
Introduction RDD<Instance> Map partitions RDD<Model> Partition 0 Instances • Distributed Weka for Instances streamed or Model data batched to Spark package: algorithm • Averaging for several Partition 1 Instances classifiers Instances streamed or Model data batched to • “Dagging” for all the rest algorithm • Ensembles via Dagging can Cached work well, but… – Partition size is another tuning parameter – Small partitions might lead Voted ensemble to poor modelling power
MLlib • Small set of algorithms • Learning is fully distributed —> single final model • Could have an accuracy advantage over Dagging for complex problems • Definitely has an advantage when model comprehensibility is important • New distributedWekaSparkDev package – No coding MLlib integration
MLlib Integration in Weka: Desktop Mode • Weka wrapper classifiers for MLlib supervised learning schemes • Work like any other Weka classifier • Operate on datasets that fit into main memory on the desktop • Can be used within Weka’s evaluation framework, used as base classifiers in meta learners, combined with preprocessing filters, used in standard Knowledge Flow processes and used in repeated cross-validation experiments in the Experimenter
Under the Hood • WEKA MLlib classifiers accept standard Instances objects • Weka filters are applied automatically (where necessary) – MLlibNaiveBayes wrapper discretizes numeric fields for Bernoulli model • Instances are parallelized to RDD[ Instance ] • RDD[Instance] converted to RDD[ LabeledPoint ] • Local Spark cluster started on the fly • Scoring only requires LabeledPoint data structure – no Spark cluster/infrastructure required
MLlib Integration in Weka: Distributed Weka • Run in a cluster • Data sourced from Spark data frame-support formats • Data frames converted to RDD[ Instance ], then to RDD[ LabeledPoint ] • Weka filters can be applied within each RDD partition • Implements hold-out and cross-validation for evaluation
Cross-Validation • X-val folds that are consistent for both MLlib and Weka classifiers • Max parallelism for Weka Dagging/model averaging – Build all training fold classifiers in one pass over the data – Evaluate all classifiers in second pass Pass 2 Pass 1
Cross-Validation for MLlib RDD for test • MLlib classifiers – new fold 1 RDD partitions RDD for each training (original dataset) Fold 1 fold Fold 1 Fold 1 – Assemble partial folds Fold 2 for fold k – Each fold processed Fold 3 RDD for training sequentially in turn fold 1 Partition 1 Fold 2 Fold 1 Fold 2 MLlib M1 Fold 2 Fold 3 Fold 3 Fold 3 Partition 2
Demonstration • MLlib classifiers running in Weka Explorer, and Knowledge Flow • Comparing MLlib schemes against Weka, R and Python equivalents in the Weka Experimenter • Deploying an MLlib model in PDI’s WekaScoring step
Summary What we covered today: • Integration of MLlib algorithms continues Weka’s interoperability theme • Provides convenient no-coding access to Mllib algorithms for desktop and cluster-based execution • Simplifies the data scientist’s job when considering multiple tools – Weka vs R vs Scikit-learn vs MLlib within one unified experimental framework
Next Steps Want to learn more? • http://wiki.pentaho.com/display/DATAMINING/Pentaho+Data+Mining+Commun ity+Documentation • http://markahall.blogspot.co.nz/2017/07/integrating-spark-mllib-into- weka.html
Recommend
More recommend