machine learning track
play

Machine Learning Track Data Analytics, Machine Learning and HPC in - PowerPoint PPT Presentation

Intel HPC Developer Convention Salt Lake City 2016 Machine Learning Track Data Analytics, Machine Learning and HPC in todays changing application environment Franz J. Kirly (practical) An overview of data analytics DATA Scientific


  1. Intel HPC Developer Convention Salt Lake City 2016 Machine Learning Track Data Analytics, Machine Learning and HPC in today’s changing application environment Franz J. Király

  2. (practical) An overview of data analytics DATA Scientific Questions R python The Scientific Method Statistical Programming Methods Statistical Exploration Questions Quantitative Modelling Descriptive/Explanatory Predictive/Inferential Scientific and Statistical Validation Knowledge

  3. Data analytics and data science in a broader context Lot of problems and subtleties at these stages already Raw often, most of manpower data in „ data “ project needs Clean to go here first before data one can attempt reliable Data analytics Statistics, Modelling, Data mining, Machine learning Relevant findings and Knowledge underlying arguments need to be explained well and properly

  4. Big Data?

  5. What „Big Data“ may mean in practice 100 1.000 Number of features Strategies that Manual exploratory stop working data analysis in reasonable time Feature extraction Feature selection Kernel methods, OLS 1.000 Random forests 10.000 L1, LASSO Large-scale strategies (around the same order) for super-linear algorithms Super-linear algorithms On-line models 10.000.000 Distributed computing Linear algorithms, including Reading in all the data Sub-sampling 10.000.000.000 Solution strategies Number of data samples

  6. Large-scale motifs in data science = where high-performance computing is helpful/impactful „Big models “ = the „classic“, beloved by everyone Not necessarily a lot of data, but computationally intensive models Classical example: finite elements and other numerical models New fancy example: large neural networks aka „ deep learning “ Common HPC motif: divide/conquer in parts-of-model, e.g. neurons/nodes „Big data “ = what it says, a lot of data (ca 1 million samples or more) Computational challenge arises from processing all of the data Example: histogram or linear regression with huge amounts of data Common HPC motif: divide/conquer training/fitting of model, e.g. batchwise/epoch fitting Model validation and model selection = this talk‘s focus Answers the question: which model is best for your data? Demanding even for simple models and small amounts of data! Example: is deep learning better than logistic regression, or guessing?

  7. Meta-modelling: stylized case studies Customer: Hospital specializing in treatment of patients with a certain disease. Patients with this disease are at-risk to experience an adverse event (e.g. death) Scientific question: depending on patient characteristics, predict the event risk . Data set: complete clinical records of 1.000 patients, including event if occurred Customer: Retailer who wants to accurately model behaviour of customers. Customers can buy (or not buy) any of a number of products, or churn. Scientific question: predict future customer behaviour given past behaviour Data set: complete customer and purchase records of 100.000 customers Customer: Manufacturer wishes to find best parameter setting for machines. Parameters influence amount/quality of product (or whether machine breaks) Scientific question: find parameter settings which optimizes the above Data set: outcomes for 10.000 parameter settings on those machines Of interest: model interpretability ; how accurate the predictions are expected to be whether the algorithm/model is (easily) deployable in the „real world “ Not of interest: which algorithm/strategy , out of many, exactly solves the task

  8. Model validation and model selection = data-centric and data-dependent modelling a scientific necessity implied by the scientific method and the following: 1. There is no model that is good for all data. (otherwise the concept of a model would be unnecessary) 2. For given data, there is no a-priori reason to believe that a certain type of model will be the best one. (any such belief is not empirically justified hence pseudoscientific) 3. No model can be trusted unless its validity has been verified by a model-independent argument. (otherwise the justification of validity is circular hence faulty) Machine learning provides algorithms & theory for meta-modelling and powerful algorithms motivated by meta-modelling optimality.

  9. Machine Learning and Meta-Modelling in a Nutshell

  10. Leitmotifs of Machine Learning from the intersection of engineering, statistics and computer science Engineering & statistics idea: Statistical models are objects in their own right „ learning modelling machines “ strategy Engineering & computer science idea: Any abstract algorithm can be a modelling strategy/learning machine „ computational modelling learning “ strategy Possibly non-explicit Computer science & statistics idea: Future performance of algorithm/learning machine can be estimated (and should) „ model validation “ learning ? „ model selection “ machine

  11. Problem types in Machine Learning Supervised Learning: some data is labelled by expert/oracle ? ? ? Task: predict label from covariates statistical models are usually discriminative Examples: regression, classification

  12. Problem types in Machine Learning Unsupervised Learning: the training data is not pre-labelled ? ? ! Task: find „ structure “ or „ pattern “ in data statistical models are usually generative Examples: clustering, dimension reduction

  13. Advanced learning tasks Complications in the labelling Semi-supervised learning some training data are labelled, some are not Reinforcement learning data are not directly labelled, only indirect gain/loss Anomaly detection all or most data are „positive examples “, the task is to flag „ test negatives“ Complications through correlated data and/or time On-line learning the data is revealed with time, models need to update Forecasting each data point has a time stamp, predict the temporal future Transfer learning the data comes in dissimilar batches, train and test may be distinct

  14. What is a Learning Machine? … an algorithm that solves, Illustration: supervised learning machine e.g., the previous tasks: new data prediction observations model fitting „ training data “ “learning” fitted model ?? predictions e.g., to base decisions on model tuning parameters Examples: generalized linear model, linear regression, support vector machine, neural networks (= „ deep learning “ ) , random forests, gradient boosting , …

  15. Example: Linear Regression new data ? prediction observations model fitting „ training data “ “learning” fitted model predictions Fit intercept or not?

  16. Model validation: does the model make sense? ? „ test labels “ „ the truth “ „ test data “ compare „hold - out “ & prediction strategy „out -of- sample“ quantify „in - sample“ learning machine Model Prediction ?? learning e.g. evaluating the e.g. regression, GLM, predictions „ training data “ regression model advanced methods learnt model Predictive models need to be validated on unseen data! The only (general) way to test goodness of prediction is actually observing prediction! Which means the part of data for testing has not been seen by the algorithm before! (note: this includes the case where machine = linear regression, deep learning, etc)

  17. „Re - sampling“: training data 1 Predictor 1 errors 1,2,3 Predictor 2 Predictor 3 training data 2 test data Predictor 1 aggregate errors 1,2,3 Predictor 2 all data Predictor 3 training data 3 test data Predictor 1 errors 1,2,3 Predictor 2 errors 1,2,3 Predictor 3 test data 3 comparison Multiple algorithms are compared on multiple data splits/sub-datasets State-of-art principle in model validation, model comparison and meta-modelling how to obtain training/test splits pros/cons type of re-sampling 1. divide data in k (almost) equal parts good compromise between k-fold runtime and accuracy 2. obtain k train/tests splits via: cross-validation each part is test data exactly once when k is small compared to data size often: k=5 the rest of data is the training set leave-one-out = [number of data points]-fold c.v. very accurate, high run-time 1. obtain a random sub-sample of repeated can be arbitrarily quick training/test data of specified sizes sub-sampling can be arbitrarily inaccurate (train/test need not cover all data) (depending on parameter choice) parameters: 2. repeat 1. desired number of times training/test size can be combined with k-fold # of repetitions

  18. Quantitative model comparison a „ benchmarking experiment “ results in a table like this RMSE MAE model 15.3 12.3 ± 1.4 ± 1.7 9.5 7.3 ± 0.7 ± 0.9 13.6 11.4 ± 0.9 ± 0.8 ? 20.1 18.1 ± 1.2 ± 1.1 Confidence regions (or paired tests) to compare models to each other: A is better than B / B is better than A / A and B are equally good Uninformed model (stupid model/random guess) needs to be included otherwise a statement „ is better than an uninformed guess “ cannot be made. „ useful model “ = ( significantly) better than uninformed baseline

Recommend


More recommend