Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Setting the Scene The Data Science Process Supervised and Unsupervised Learning Introduction
Overview Setting the scene Data science The analytics process model Data scientists Example applications 2
Setting the Scene 3
Living in a data flooded world https://deepmind.com/blog/alphago-zero-learning-scratch/ … 2015 4
Living in a data flooded world https://deepmind.com/blog/alphazero-shedding-new-light-grand-games-chess-shogi-and-go/ … 2017 5
Living in a data flooded world https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ … 2019 6
Living in a data flooded world https://deepmind.com/blog/article/AlphaFold-Using-AI-for-scientific-discovery … 2020 7
Living in a data flooded world https://www.vox.com/recode/2020/1/28/21110902/artificial-intelligence-ai-coronavirus-wuhan https://www.vox.com/2020/1/31/21117102/artificial- intelligence-drug-discovery-exscientia https://qz.com/1791222/how-artificial-intelligence- provided-early-warning-of-wuhan-virus/ 8
Living in a data flooded world https://www.buzzfeednews.com/article/ryanmac/clearview-ai-cops-run-wild-facial-recognition-lawsuits https://www.quantamagazine.org/artificial-intelligence-will-do-what-we-ask-thats-a-problem-20200130/ 9
Living in a data flooded world https://www.latimes.com/business/story/2020-01-21/ralphs-privacy-disclosure 10
Living in a data flooded world “ The Economics of AI Today Every day we hear claims that Artificial Intelligence (AI) systems are about to transform the economy, creating mass unemployment and vast monopolies. But what do professional economists think about this? Contrary to the idea of the impending job apocalypse, this model identifies some channels through which AI systems could increase demand for labor. At the same time, and contrary to an standard assumption in economics that new technologies always increase labor demand through augmentation, the task-based model recognizes that the net effect of new technology on labor demand could be negative. This could, for example, happen if firms adopt “mediocre” AI systems that are productive enough to displace workers, but not productive enough to increase labor demand through the other channels. “ – https://thegradient.pub/the-economics-of-ai-today/ 11
Living in a data flooded world https://www.nature.com/articles/nature21056.epdf 12
Living in a data flooded world, continued https://www.theguardian.com/technology/2017/sep/07/new-artificial-intelligence-can-tell-whether-youre-gay-or-straight-from-a- photograph 13
Living in a data flooded world, continued http://theconversation.com/facial-analysis-ai-is-being-used-in-job-interviews-it-will-probably-reinforce-inequality-124790 14
Living in a data flooded world, continued https://github.com/ipsingh06/ml-desnapify http://media.idlab.ugent.be/2019/12/05/safe-sexting-in-a- world-of-ai/ 15
Living in a data flooded (real) world 16
Data Science 17
Data science Data contains value and knowledge But to extract this knowledge, you need to be able to: Store it Manage it Analyze it Terms often used interchangeably: Data Mining ≈ Big Data ≈ Data Analytics ≈ Data Science ≈ Knowledge Discovery ≈ Artificial Intelligence ≈ Deep Learning Don’t worry too much about this and don’t be too swayed by Venn diagrams or infographics 18
Data science https://vas3k.com/blog/machine_learning/ What even is this? 19
Data science 20
We focus on analytics from a business perspective Given ((huge) lots) of data, discover patterns and models from data: Instead of hand-coding, let the data speak To help predict something, explain something, decide something (and more?) Using: Which are: 1. Data Valid Useful 2. An algorithm Unexpected 3. A purpose Understandable 21
Using data 22
Using data Structured, unstructed? Tabular, relational, text, imagery, audio Non-tabular data Making it tabular (“featurization”) Using techniques that can directly utilize data as-is (and even then, some raw structure will be imposed) 23
Basic terminology A tabular data set (“structured data”): 24
Basic terminology A tabular data set (“structured data”): 25
Basic terminology A tabular data set (“structured data”): 26
Basic terminology A tabular data set (“structured data”): Has instances (examples, rows, observations, customers, …) And features (attributes, fields, variables, predictors, covariates, explanatory variables, regressors, independent variables) These features can be: Numeric (continuous) Categorical (discrete, factor) either nominal (binary as a special case) or ordinal Target (label, class, dependent variable, repsonse variable) can also be present Numeric, categorical, … 27
Using algorithms https://vas3k.com/blog/machine_learning/ 28
Using algorithms https://vas3k.com/blog/machine_learning/ 29
Using algorithms Unsupervised machine learning No target variable necessary Find structure, patterns in data E.g. clustering, association rule mining, sequence rule mining Supervised machine learning Target variable available Relate predictor variables to target Churn prediction, fraud detection, response modeling, credit risk modeling (There are more types than these two) 30
Supervised learning Regression: continuous label Classification: categorical label For classification: Binary classification (positive/negative outcome) Multiclass classification (more than two possible outcomes) Ordinal classification (target is ordinal) Multilabel classification (multiple outcomes are possible) For regression: Absolute values Delta values Quantiles regression Single versus multi-output models is possible as well (Definitions in literature and documentation can differ a bit) Binary classification forms the majority of applicative settings 31
Supervised learning 32
Unsupervised learning Extract patterns from the data as is Clustering: construct groups over the data set Association/sequence/… rule mining: find rules that describe the data Anomaly detection: find outliers in the data set Dimensionality reduction: reduce number of features Note that most of these are frequency / distance based 33
Unsupervised learning 34
Purpose “ So then, unsupervised learning for “descriptive analytics” and supervised “ learning techniques for “predictive analytics”? Kind of, but… what’s the business question/problem? 35
Purpose Exploratory analytics: plots, distributions, quick charts, basic correlations… very visual But supervised techniques can also be used for exploratory insights Descriptive analytics: yes, unsupervised techniques are commonly used But depends on the pattern-style you want to obtain, also you often already have a hypothesis in mind Explanatory analytics: unsupervised again? Depending on the target definition and model type used, a supervised model can also be used as an explanatory means Predictive analytics: supervised for sure? Though unsupervised techniques can be used as pre-processing or featurization technique Also consider whether your goal is really predictive Prescriptive analytics: “what should I do?” What-if analysis on a trained supervised model Or using goold old operations research 36
Purpose “ Assume I have a trained, validated model which works well. How would the model be used? Which features can I give it at the time of usage? Do I “ want to make it predictions going forward? What’s my end goal? Purpose is key! While machine learning is a powerful tool, keep in mind that a large majority of ML/AI use cases in business are not really about ML, but about automation! 37
Purpose https://developers.google.com/machine-learning/guides/rules-of-ml 38
Purpose “ An interesting finding is that increasing the performance of a model does not necessarily translate into a gain in [business] value. In general we found that often the best problem is not the one that comes to mind immediately and that changing the set up is a very effective way to unlock value. – https://blog.acolyer.org/2019/10/07/150-successful-machine-learning- “ models/ 39
Key criteria In any case, we want the models and patterns that we find to be: Valid: hold on new data with some certainty, i.e. generalizable Over time, seasonal effects, overfitting, sub-groups, regional differences… Useful: should be possible to act on the item, i.e. actionable Business question, implementation, maintenance costs, ease-of-use… Unexpected: non-obvious to the system, i.e. interesting Balance between trust and discovery… Big and “weird” data Understandable: humans should be able to interpret the pattern Black box vs. white box, trust, validity… 40
Valid → generalizable https://www.gwern.net/Tanks “ “ RL agent in Udacity self-driving car rewarded for speed learns to spin in circles “ NASA Mars mission planning, optimizing food/water/electricity consumption for total man-days “ survival, yields an optimal plan of killing 2/3 crew & keep survivor alive as long as possible 41
Recommend
More recommend