integrating spark mllib into weka
play

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining - PowerPoint PPT Presentation

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining Architect, Hitachi Vantara Agenda The Spark distributed processing framework was designed with iterative machine learning in mind. This session discusses: Integration of MLlib


  1. Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining Architect, Hitachi Vantara

  2. Agenda The Spark distributed processing framework was designed with iterative machine learning in mind. This session discusses: • Integration of MLlib classification algorithms into Weka • Consistent evaluation of algorithms on the desktop and in the cluster • Benefits for data science practitioners

  3. What’s Weka? • Weka is a library containing a large collection of machine learning algorithms, implemented in Java • Main types of learning problems that it can tackle – Classification: given a labeled set of observations, learn to predict labels for new observations – Regression: numeric value instead of label – Attribute selection: find attributes of observations that are important for prediction – Clustering: no labels, just identify groups of similar observations (clusters) • 190 plugin “packages”

  4. Introduction RDD<Instance> Map partitions RDD<Model> Partition 0 Instances • Distributed Weka for Instances streamed or Model data batched to Spark package: algorithm • Averaging for several Partition 1 Instances classifiers Instances streamed or Model data batched to • “Dagging” for all the rest algorithm • Ensembles via Dagging can Cached work well, but… – Partition size is another tuning parameter – Small partitions might lead Voted ensemble to poor modelling power

  5. MLlib • Small set of algorithms • Learning is fully distributed —> single final model • Could have an accuracy advantage over Dagging for complex problems • Definitely has an advantage when model comprehensibility is important • New distributedWekaSparkDev package – No coding MLlib integration

  6. MLlib Integration in Weka: Desktop Mode • Weka wrapper classifiers for MLlib supervised learning schemes • Work like any other Weka classifier • Operate on datasets that fit into main memory on the desktop • Can be used within Weka’s evaluation framework, used as base classifiers in meta learners, combined with preprocessing filters, used in standard Knowledge Flow processes and used in repeated cross-validation experiments in the Experimenter

  7. Under the Hood • WEKA MLlib classifiers accept standard Instances objects • Weka filters are applied automatically (where necessary) – MLlibNaiveBayes wrapper discretizes numeric fields for Bernoulli model • Instances are parallelized to RDD[ Instance ] • RDD[Instance] converted to RDD[ LabeledPoint ] • Local Spark cluster started on the fly • Scoring only requires LabeledPoint data structure – no Spark cluster/infrastructure required

  8. MLlib Integration in Weka: Distributed Weka • Run in a cluster • Data sourced from Spark data frame-support formats • Data frames converted to RDD[ Instance ], then to RDD[ LabeledPoint ] • Weka filters can be applied within each RDD partition • Implements hold-out and cross-validation for evaluation

  9. Cross-Validation • X-val folds that are consistent for both MLlib and Weka classifiers • Max parallelism for Weka Dagging/model averaging – Build all training fold classifiers in one pass over the data – Evaluate all classifiers in second pass Pass 2 Pass 1

  10. Cross-Validation for MLlib RDD for test • MLlib classifiers – new fold 1 RDD partitions RDD for each training (original dataset) Fold 1 fold Fold 1 Fold 1 – Assemble partial folds Fold 2 for fold k – Each fold processed Fold 3 RDD for training sequentially in turn fold 1 Partition 1 Fold 2 Fold 1 Fold 2 MLlib M1 Fold 2 Fold 3 Fold 3 Fold 3 Partition 2

  11. Demonstration • MLlib classifiers running in Weka Explorer, and Knowledge Flow • Comparing MLlib schemes against Weka, R and Python equivalents in the Weka Experimenter • Deploying an MLlib model in PDI’s WekaScoring step

  12. Summary What we covered today: • Integration of MLlib algorithms continues Weka’s interoperability theme • Provides convenient no-coding access to Mllib algorithms for desktop and cluster-based execution • Simplifies the data scientist’s job when considering multiple tools – Weka vs R vs Scikit-learn vs MLlib within one unified experimental framework

  13. Next Steps Want to learn more? • http://wiki.pentaho.com/display/DATAMINING/Pentaho+Data+Mining+Commun ity+Documentation • http://markahall.blogspot.co.nz/2017/07/integrating-spark-mllib-into- weka.html

Recommend


More recommend