mleap release spark ml pipelines
play

MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail - PowerPoint PPT Presentation

MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY Introduction Rails Consulting for TrueCar and other companies Erlang for highly- Web Dev @ Cornell Implement ML concurrent game model for servers Join


  1. MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY

  2. Introduction Rails Consulting for TrueCar and other companies Erlang for highly- Web Dev @ Cornell Implement ML concurrent game model for servers Join TrueCar for Studied some ClearBook in Ruby, Rails/mobile General Biology Python and Java C/C# Game Engines development How do we build real-time APIs Begin work on based on Spark Try several ideas, machine learning Models? fail, learn a ton, platform based on SparkContext begin MLeap project Spark/Hadoop prohibitive MLeap - Massive pre- - Spark enables data - Continues to train computation in processing and training models in SAS/R Hadoop/ElasticSearch of models in same environment - SQL-based batch Studied some Math - Predictive modeling - - Joins TrueCar to do - Portions of models still pipelines + python for and Economics @ of at risk patients @ new car price translated, now to Java Linear Regressions in University of UHG modeling Minnesota API layer - R/SAS batch - SQL-based batch pipelines pipelines + python for Linear Regressions in API layer

  3. Problem Statement: Deploying machine learning algorithms to a production environment is a lot more difficult than it has to be and is a common source of friction at data-driven organizations Action Reaction - Data scientists write data pipelines to - Engineers re-write the data pipelines for a construct research datasets production-ready system - Engineers write scalable libraries for - Data scientists largely don’t use those computing features and algorithms libraries and maintain/re-write their own copy of the code - Data scientists largely focus on - Talented engineers get largely tired of coding linear/logistic regressions due to up linear regressions and updating engineering constraints coefficients Everyone wants to do better! The winning technology will be the one that enables Engineers and Data Scientists to collaborate and work across a single platform.

  4. Existing Solutions: You won’t believe how many companies are still deploying algorithms in a SQL environment! And these are billion dollar operations. Hard-Coded PMML Emerging Enterprise MLeap Models Solutions Solutions (SQL, Java, Ruby) (yHat, DataRobot) (Microsoft, IBM, SAS) Quick to Implement Open Sourced Committed to Spark/Hadoop API Server Infrastructure Lesson Learned: Push code down to where the data is, not the other way around!

  5. MLeap Solution • Born out of need to deploy models quickly to a real time API server • Leverage Hadoop/Spark ecosystem for training, get rid of Spark dependency for execution • Easily reuse models with serialization and executing without Spark

  6. MLeap Components • core - provides linear algebra system, regression models, and feature builders mleap-spark mleap-serialization • runtime - provides DataFrame-like “LeapFrame” and transformers for it mleap-runtime Bundle.ML • spark - provides easy conversion from Spark mleap-core transformers to MLeap transformers • serialization - common serialization format for Spark and MLeap (Bundle.ML) New features: expanded serialization formats to include both json and protobuf for large models (i.e. random forests with thousands of features)

  7. mleap-core MLeap Core Components Regressions Classifiers Linear Algebra Feature Builders LinearRegression Dense/Sparse Vectors VectorAssembler RandomForest RandomForest BLAS from Spark StringIndexer LogisticRegression Gradient Boosted Reg. StandardScaler Trees

  8. mleap-runtime MLeap Runtime • Provides LeapFrame, which stores data for transformations by MLeap transformers • MLeap transformers use mleap-core building blocks to transform LeapFrame • MLeap transformers correspond one-to-one with Spark transformers • No dependencies on Spark

  9. Feature Pipeline Legend Categorical Feature Categorical Feature Index Continuous Scaled Continuous VectorAssembler StandardScaler Categorical Feature Vector Feature Vector Feature One Hot Vector Continuous Feature VectorAssembler Final Feature Vector StringIndexer OneHotEncoder Categorical StringIndexer OneHotEncoder VectorAssembler Feature Vector StringIndexer OneHotEncoder Regression Pipeline LinearRegression Final Feature Vector Prediction

  10. Categorical Pipeline Categorical Categorical Categorical StringIndexer OneHotEncoder Feature One Feature Feature Index Hot Vector LeapFrame LeapFrame LeapFrame StringIndexer OneHotEncoder

  11. MLeap Serialization (Bundle.ML) • Provides common serialization for both Spark and MLeap • 100% protobuf/JSON based for easy reading, compact data, and portability • No dependencies on Parquet * • Can be written to zip files, file system, HDFS, anywhere with an FS-like structure mleap-serialization

  12. String Indexer Model Linear Regression Model Linear Regression Model (Code)

  13. MLeap Spark • Train an ML pipeline with Spark then export it to MLeap MLeap Spark Spark Estimator Spark Model MLeap Model • Execute an MLeap pipeline against a Spark DataFrame MLeap MLeap MLeap Spark Transformer Spark Spark DataFrame Spark LeapFrame Spark LeapFrame Spark DataFrame

  14. Benchmarks Spark: 23.4ms /transform MLeap: 0.011ms /transform

  15. Web Services Algo 1 Algo 2 Algo n REST API Mobile Apps MLeap API + Server Java API Map/ Reduce MLeap Transformers + Serialization Spark Jobs MLlib Scikit + other Spark + Hadoop + HDFS Pipeline Code Notebooks

  16. Demo Usage of MLeap • Train a sample listing price model using linear regression and random forest against some AirBnb training data • Deploy both models to a local API server • Get real-time results • IN UNDER 5 MINUTES!

  17. Future of MLeap • Unify linear algebra and core libraries with Spark • Python/R interface • Deploy easily to embedded systems and outside of JVM • Full support for all Spark transformers

  18. MLeap Development • Currently 5 people working on projects across 4 different companies • Talk to us if you are interested in deploying this technology at your company • MLeap Demo Project: https://github. com/TrueCar/mleap-demo

  19. Thank You! Hollin Wilkins Mikhail Semeniuk email: hollinrwilkins@gmail.com email: seme0021@gmail.com github: https://github.com/hollinwilkins github: https://github.com/seme0021 twitter: https://twitter.com/HollinWilkins twitter: https://twitter.com/MikhailSemeniuk SATURDAY

Recommend


More recommend