Complexity vs. Performance: Empirical Analysis of Machine Learning - PowerPoint PPT Presentation

Complexity vs. Performance: Empirical Analysis of Machine Learning as a Service Yuanshun Yao , Zhujun Xiao, Bolun Wang*, Bimal Viswanath, Haitao Zheng and Ben Y. Zhao The University of Chicago *University of California, Santa Barbara ysyao@cs.uchicago.edu

ML in Network Research congestion network user behavior control protocols link prediction analysis • Sivaraman et al., • Liu et al., IMC’16 • Wang et al., IMC’14 SIGCOMM’14 • Zhao et al., IMC’12 • Zannettouet al., IMC’17 • Winstein & Balakrishnan, SIGCOMM’13 …

Running ML is Hard Solution: Machine Learning as a Service (ML-as-a-Service) dataset model

ML-as-a-Service training data ML-as-a-Service user input (model, parameter etc.)

Why Study ML-as-a-Service? Is my model good enough? Q: How well do they perform? Q: How much does the amount of user control impact ML performance?

ML-as-a-Service Platforms Google Amazon ABM BigML PIO Microsoft Prediction ML ML less more amount of user input

Control in ML ? training data trained model

Control in ML Data Cleaning • Invalid/dup/missing data ? training data trained model

Control in ML Data Cleaning Feature Selection • Invalid/dup/missing • Mutual Info,Pearson, data Chi… ? training data trained model

Control in ML Data Cleaning Feature Selection • Invalid/dup/missing • Mutual Info,Pearson, data Chi_square… ? training data trained model Classifier Choice • Logistic Regression, Decision Tree, kNN…

Control in ML Data Cleaning Feature Selection • Invalid/dup/missing • Mutual Info,Pearson, data Chi_square… training data trained model Classifier Choice Parameter Tuning • Logistic Regression, • Logistic Regression: L1, Decision Tree, kNN… L2, max_iter…

Control in ML-as-a-Service Data ✖ ✖ ✖ ✖ ✖ ✖ Cleaning Feature ✖ ✖ ✖ ✖ ✖ ✖ ✔ Selection Complexity vs. Performance? Classifier ✖ ✖ ✖ ✔ ✔ ✔ Choice Parameter ✖ ✖ ✔ ✔ ✔ ✔ Tuning Amazon Google ABM PIO BigML Microsoft high low user control/complexity

Performance Measurement

Characterizing Performance • Theoretical modeling is hard • Output of ML model depends on dataset • No access to implementation details • Empirical data-driven analysis • Simulate a real-world scenario from end to end • Need a large number of diverse datasets • Focus on binary classification

Dataset • 119 datasets • From diverse application domains • Sample size: 15 - 245K, number of features: 1 - 4K • 79% of them are from UCI ML Repository Other Financial & Business 11% 6% Life Science 37% Physical Science 8% Social Science 9% Artificial Test Computer Applications 14% 15%

Methodology • Tune all available control dimensions Feature Selection Classifier Choice Parameter Tuning API ✖ ✔ ✔ training trained data model • Logistic Regression L1_reg • • KNN L2_reg • • SVM Max_iter • API API • … … •

Methodology • Tune all available control dimensions Feature Selection Classifier Choice Parameter Tuning API ✖ ✔ ✔ training trained data model API testing data

Trade-offs between Complexity and Performance

Complexity vs. Performance • Q: How does the complexity correlate with performance? • High complexity -> high performance 1 Optimized 0.9 Average F-Score 0.8 0.7 0.6 0.5 ABM Google Amazon BigML PIO Microsoft Scikit low complexity high

Complexity vs. Risk • Q: How does the risk correlate with complexity? • High complexity -> high risk 0.5 Performance Variance 0.4 (F-Score) 0.3 0.2 0.1 0 ABM Google Amazon BigML PIO Microsoft Scikit low complexity high

Understanding Server-side Optimization

Reverse-engineering Optimization • Q: Does server-side adapt to different datasets? • Reverser-engineering using datasets • Create synthetic datasets • Use prediction results to infer classifier information Circular Linear 2 6 Class 0 Class 1 Feature #2 Feature #2 3 1 0 0 Class 0 -3 -1 Class 1 -6 -1.5 -1 -0.5 0 0.5 1 1.5 -3 -2 -1 0 1 2 3 Feature #1 Feature #1

Understanding Optimization Google decision boundaries 2 6 Class 0 Class 1 3 Feature #2 Feature #2 1 0 0 Class 0 -3 -1 Class 1 -6 -1.5 -1 -0.5 0 0.5 1 1.5 -3 -2 -1 0 1 2 3 Feature #1 Feature #1 • Google switches between classifiers based on the dataset • Use supervised learning to infer classifier family used

Takeaways • ML-as-a-Service is an attractive tool to reduce workload • But user control still has a large impact on performance • Fully automated systems are less risky

Thank you! Questions?

Complexity vs. Performance: Empirical Analysis of Machine Learning - PowerPoint PPT Presentation

Complexity vs. Performance: Empirical Analysis of Machine Learning as a Service Yuanshun Yao , Zhujun Xiao, Bolun Wang, Bimal Viswanath, Haitao Zheng and Ben Y. Zhao The University of Chicago University of California, Santa Barbara

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabs Pczos Empirical Risk

Hans Vangheluwe Modelling and Simulation Causes of Complexity Dealing with Complexity

Hans Vangheluwe Modelling and Simulation Causes of Complexity Dealing with Complexity

Background Background Text Complexity Text Complexity Text Complexity Sowmya V.B., Sowmya

Kolmogorov Complexity of Categories Complexity Programing Language Kolmogorov Noson S.

IN 5210 Complexity Theory Complexity Complexity: Socio-technical (Internet, globalization)

Communication Complexity Lecture 23 Computing with remote inputs 1 Communication Complexity

Complexity and Character of Human Languages The Faculty of Language Informatics 2A: Lecture 28

Empirical Project Monitor and Results from 100 OSS Development Projects Masao Ohira Empirical

From Complexity to Intelligence Machine Learning and Complexity 17 novembre 2016

8/29/2015 Effect of Empirical Left Atrial Appendage Isolation on Effect of Empirical Left Atrial

Empirical research on economic inequality: Normative considerations and empirical practice.

Empirical problem solving Statistical method R.W. Oldford Empirical problem solving - PPDAC The

Verification Verification, Performance Performance Analysis Performance Performance Analysis

Kicking the complexity habit Dan North @tastapod Kicking the complexity habit Dan North

NANYANG RESEARCH PROGRAMME NRPjr05B Remote Earth Monitoring Station Bai Yuanyuan Lu Xiling

PAC-EDWARDS Varun Mehta - Mike Pierorazio - Jeffrey Cropsey Project Overview Design a two

Tranquil Hans Hu, John Hu, Injay Song The original idea was to make a robot that finds the

XR10910 16:1 Sensor Interface Key Features Benefits Integrated features for interfacing

Final Merger Department of Natural and Cultural Resources and Department of Environmental Quality

Produced Water: Volume Reduction Using Evaporation Paul J. Kleinen. P.E. August 21, 2019

DEEP DIVE 9:30-12:30 How to Hold a Board Meeting in Half the Time with Twice the Results/LEAP

Marsha P. Giesler IFSAP - April 22, 2010 IFSAP - April 22, 2010 IFSAP - April 22, 2010 Where Do