On the representation and reuse of machine learning models Villu - PowerPoint PPT Presentation

On the representation and reuse of machine learning models Villu Ruusmann Openscoring OÜ

https://github.com/jpmml 2

Def: "Model" Output = func(Input) 3

Def: "Representation" Generic Specific Data Application structure code 4

The problem "Train once, deploy anywhere" 5

A solution Matching model representation (MR) with the task at hand: 1. Storing a generic and stable MR 2. Generating a wide variety of more specific and volatile MRs upon request 6

The Predictive Model Markup Language (PMML) ● XML dialect for marking up models and associated data transformations ● Version 1.0 in 1999, version 4.3 in 2016 ● "Conventions over configuration" ● 17 top-level model types + ensembling http://dmg.org/ http://dmg.org/pmml/pmml-v4-3.html http://dmg.org/pmml/products.html 7

A continuum from black to white boxes Introducing transparency in the form of rich, easy to use, well-documented APIs: 1. Unmarshalling and marshalling 2. Static analyses. Ex: schema querying 3. Dynamic analyses. Ex: scoring 4. Tracing and explaining individual predictions 8

The Zen of Machine Learning "Making the model requires large data and many cpus. Using it does not" --darren https://www.mail-archive.com/user@spark.apache.org/msg40636.html 9

Model training workflow Real-world ML-platform ML-platform feature space feature space model 10

Model deployment workflow Real-world ML-platform ML-platform feature space feature space model vs. Real-world Real-world feature space model 11

Model resources R code Java code Scikit-Learn Original Optimized code PMML markup PMML markup Python code Apache Spark ML code Training Versioned storage Deployment 12

Comparison of model persistence options R Scikit-Learn Apache Spark ML Model data structure Fair to excellent Fair Poor stability Native serialization RDS (binary) Pickle (binary) SER (binary) and data format JSON (text) Export to PMML Few external N/A Built-in trait PMMLWritable Import from PMML Few external N/A JPMML projects JPMML-R and JPMML-SkLearn JPMML-SparkML r2pmml and sklearn2pmml (-Package) 13

PMML production: R library("r2pmml") auto <- read.csv("Auto.csv") auto$origin <- as.factor(auto$origin) auto.formula <- formula(mpg ~ (.) ^ 2 + # simple features and their two way interactions I(displacement / cylinders) + I(log(weight))) # derived features auto.lm <- lm(auto.formula, data = auto) r2pmml(auto.lm, "auto_lm.pmml", dataset = auto) auto.glm <- glm(auto.formula, data = auto, family = "gaussian") r2pmml(auto.glm, "auto_glm.pmml", dataset = auto) 14

R quirks ● No pipeline concept. Some workflow standardization efforts by third parties. Ex: caret package ● Many (equally right-) ways of doing the same thing. Ex: "formula interface" vs. "matrix interface" ● High variance in the design and quality of packages. Ex: academia vs. industry ● Model objects may enclose the training data set 15

PMML production: Scikit-Learn from sklearn2pmml import sklearn2pmml from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain audit_df = pandas.read_csv("Audit.csv") audit_mapper = DataFrameMapper([ (["Age", "Income", "Hours"], ContinuousDomain()), (["Employment", "Education", "Marital", "Occupation"], [CategoricalDomain(), LabelBinarizer()]), (["Gender", "Deductions"], [CategoricalDomain(), LabelEncoder()]), ("Adjusted", None)]) audit = audit_mapper.fit_transform(audit_df) audit_classifier = DecisionTreeClassifier(min_samples_split = 10) audit_classifier.fit(audit[:, 0:48], audit[:, 48].astype(int)) sklearn2pmml(audit_classifier, audit_mapper, "audit_tree.pmml") 16

Scikit-Learn quirks ● Completely schema-less at algorithm level. Ex: no identification of columns, no tracking of column groups ● Very limited, simple data structures. Mix of Python and C ● No built-in persistence mechanism. Serialization in generic pickle data format. Upon de-serialization, hope that class definitions haven't changed in the meantime. 17

PMML production: Apache Spark ML // $ spark-shell --packages org.jpmml:jpmml-sparkml-package:1.0-SNAPSHOT .. import org.jpmml.sparkml.ConverterUtil val df = spark.read.option("header", "true").option("inferSchema", "true").csv("Wine.csv") val formula = new RFormula().setFormula("quality ~ .") val regressor = new DecisionTreeRegressor() val pipeline = new Pipeline().setStages(Array(formula, regressor)) val pipelineModel = pipeline.fit(df) val pmmlBytes = ConverterUtil.toPMMLByteArray(df.schema, pipelineModel) Files.write(Paths.get("wine_tree.pmml"), pmmlBytes) 18

Apache Spark ML quirks ● Split schema. Static def via Dataset#schema() , dynamic def via Dataset column metadata ● Models make predictions in transformed output space ● High internal complexity, overhead. Ex: temporary Dataset columns for feature transformation ● Built-in PMML export capabilities leak the JPMML-Model library to application classpath 19

PMML consumption: Apache Spark ML // $ spark-submit --packages org.jpmml:jpmml-spark:1.0-SNAPSHOT .. import org.jpmml.spark.EvaluatorUtil; import org.jpmml.spark.TransformerBuilder; Evaluator evaluator = EvaluatorUtil.createEvaluator(new File("audit_tree.pmml")); TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator) .withLabelCol("Adjusted") // String column .withProbabilityCol("Adjusted_probability", Arrays.asList("0", "1")) // Vector column .exploded(true); Transformer pmmlTransformer = pmmlTransformerBuilder.build(); Dataset<Row> input = ...; Dataset<Row> output = pmmlTransformer.transform(input); 20

Comparison of feature spaces R Scikit-Learn Apache Spark ML Feature Named Positional Pseudo-named identification Feature data type Any Float, Double Double Feature operational Continuous, Continuous Continuous, type Categorical, Ordinal pseudo-categorical Dataset abstraction List< Map<String,?> > float[][] or double[][] List<double[]> Effect of Low Medium (sparse) to high (dense) transformations on dataset size 21

Feature declaration <DataField name="Age" dataType="float" optype="continuous"> <Interval closure="closedClosed" leftMargin="17.0" rightMargin="83.0"/> </DataField> <DataField name="Gender" dataType="string" optype="categorical"> <Value value="Male"/> <Value value="Female"/> <Value value="N/A" property="missing"/> </DataField> http://dmg.org/pmml/v4-3/DataDictionary.html <MiningField name="Age" outliers="asExtremeValues" lowValue="18.0" highValue="75.0"/> <MiningField name="Gender"/> http://dmg.org/pmml/v4-3/MiningSchema.html 22

Feature statistics <UnivariateStats field="Age"> <NumericInfo mean="38.30279" standardDeviation="13.01375" median="3.70"/> <ContStats> <Interval closure="openClosed" leftMargin="17.0" rightMargin="23.6"/>  <Interval closure="openClosed" leftMargin="76.4" rightMargin="83.0"/> <Array type="int">261 360 297 340 280 156 135 51 13 6</Array> </ContStats> </UnivariateStats> <UnivariateStats field="Gender"> <Counts totalFreq="1899" missingFreq="0" invalidFreq="0"/> <DiscrStats> <Array type="string">Male Female</Array> <Array type="int">1307 592</Array> </DiscrStats> </UnivariateStats> 23 http://dmg.org/pmml/v4-3/Statistics.html

Comparison of tree models R Scikit-Learn Apache Spark ML Algorithms No built-in, many Few built-in Single built-in external Split type(s) Binary or multi-way; Binary; simple features simple and derived features Continuous features Rel. op. (<, <=) Rel. op. (<=) Categorical features Set op. (%in%) Pseudo-rel. op. (==) Pseudo-set op. Reuse Hard Easy to medium 24

On the representation and reuse of machine learning models Villu - PowerPoint PPT Presentation

On the representation and reuse of machine learning models Villu Ruusmann Openscoring O https://github.com/jpmml 2 Def: "Model" Output = func(Input) 3 Def: "Representation" Generic Specific Data Application

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

Software Reuse From informal reuse (scavenging) to systematic reuse Management and technical

UC Berkeley ReUSE Programs March 9, 2017 Lin King Cal Zero Waste Manager UC Berkeley Chair

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

CS 5150 So(ware Engineering 18. Reuse and Design Pa9erns William Y. Arms So(ware Reuse It is

Californias Regulatory Process to Protect Public Health for Crop Irrigation Reuse and Potable

Japanese waste paper trend Japanese waste paper trend High collection & reuse High

Groundwater Solutions for Indirect Potable Reuse 2014 Rocky Mountain Water Reuse Workshop August

Potable Reuse for Inland Applications: Pilot Testing Results from a New Potable Reuse Treatment

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

CS490W Semi-Structured Data Structure of XML XML data is organized by documents like

Introduction Why study programming languages? Programming language COS 301 paradigms

Knowledge Sharing A conceptualization is a map from the problem domain into the representation. A

Grant Aw ard Package Training for FY2020 Comprehensive Housing Counseling Grants Audio is

CS 403X Mobile and Ubiquitous Computing Lecture 1: Introduction Emmanuel Agu About this class

Motivation n Distributed computing, WWW n Need interoperability n Open systems n Need

AI and Law Semantic Web, Open Data and AI in the Legal Domain Enrico Francesconi Publications

XML - Part 1 STAT 133 Gaston Sanchez Department of Statistics, UCBerkeley gastonsanchez.com

On the representation and reuse of machine learning models Villu - PowerPoint PPT Presentation

On the representation and reuse of machine learning models Villu Ruusmann Openscoring O https://github.com/jpmml 2 Def: "Model" Output = func(Input) 3 Def: "Representation" Generic Specific Data Application

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

Software Reuse From informal reuse (scavenging) to systematic reuse Management and technical

UC Berkeley ReUSE Programs March 9, 2017 Lin King Cal Zero Waste Manager UC Berkeley Chair

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

CS 5150 So(ware Engineering 18. Reuse and Design Pa9erns William Y. Arms So(ware Reuse It is

Californias Regulatory Process to Protect Public Health for Crop Irrigation Reuse and Potable

Japanese waste paper trend Japanese waste paper trend High collection &amp; reuse High

Groundwater Solutions for Indirect Potable Reuse 2014 Rocky Mountain Water Reuse Workshop August

Potable Reuse for Inland Applications: Pilot Testing Results from a New Potable Reuse Treatment

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

CS490W Semi-Structured Data Structure of XML XML data is organized by documents like

Introduction Why study programming languages? Programming language COS 301 paradigms

Knowledge Sharing A conceptualization is a map from the problem domain into the representation. A

Grant Aw ard Package Training for FY2020 Comprehensive Housing Counseling Grants Audio is

CS 403X Mobile and Ubiquitous Computing Lecture 1: Introduction Emmanuel Agu About this class

Motivation n Distributed computing, WWW n Need interoperability n Open systems n Need

AI and Law Semantic Web, Open Data and AI in the Legal Domain Enrico Francesconi Publications

XML - Part 1 STAT 133 Gaston Sanchez Department of Statistics, UCBerkeley gastonsanchez.com

Japanese waste paper trend Japanese waste paper trend High collection & reuse High