DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML - PowerPoint PPT Presentation

DSC 102   Systems for Scalable Analytics Arun Kumar Topic 4: ML Data Preparation and Model Selection Chapter 8, 8.1, 8.2, 8.3, 8.4 of MLSys Book 1

DSC 102 will get you thinking about the fundamentals of scalable analytics systems 1. “ Systems ”: What resources does a computer have? How to store and compute efficiently over large data? What is cloud computing? 2. “ Scalability ”: How to scale and parallelize data- intensive computations? 3. Scalable Systems for “Analytics” : 3.1. Source : Data acquisition & preparation for ML 3.2. Build : Dataflow & Deep Learning systems 3.3. Deploying ML models 4. Hands-on experience with tools for scalable analytics 2

The Lifecycle of ML-based Analytics Feature Engineering Data acquisition Model Serving Training & Inference Data preparation Monitoring Model Selection 3

Data Science in the Real World Q: How do real-world data scientists spend their time? CrowdFlower Data Science Report 2016 4 https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

Data Science in the Real World Q: How do real-world data scientists spend their time? CrowdFlower Data Science Report 2016 5 https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

Data Science in the Real World Q: How do real-world data scientists spend their time? Kaggle State of ML and Data Science Survey 2018 6

Data Science in the Real World Q: How do real-world data scientists spend their time? 7 IDC-Alteryx State of Data Science and Analytics Report 2019

Sourcing Stage of Data Science ❖ Data science does not exist in a vacuum. It must interplay with the data-generating process and prediction application ❖ Sourcing: The stage of the data science lifecycle where you go from raw datasets to “ analytics/ML-ready ” datasets ❖ What makes sourcing challenging/time-consuming? ❖ Data access /availability constraints ❖ Heterogeneity of data sources/formats/types ❖ Messy , incomplete, ambiguous, and/or erroneous data ❖ Poor data governance in organization ❖ Bespoke /diverse kinds of prediction applications ❖ Evolution of data-generating process/application ❖ Large scale of data 8

Sourcing Stage of Data Science ❖ Sourcing: The stage of the data science lifecycle where you go from raw datasets to “ analytics/ML-ready ” datasets ❖ At a high level, roughly 5 kinds of activities: 1. Acquiring 2. Organizing 3. Cleaning 5. Feature Engineering (aka Feature Extraction) Raw data Analytics/ML- sources/repos ready data 4. Labeling 9 (Sometimes)

Acquiring Data 1. Acquiring 2. Organizing 3. Cleaning 5. Feature Engineering (aka Feature Extraction) Raw data Analytics/ML- sources/repos ready data 4. Labeling (Sometimes) 10

Acquiring Data: Data Sources ❖ Modern data-driven applications tend to have multitudes of data storage repositories and sources ❖ Data scientist must know how to access and get datasets! ❖ Structured data: Exported from RDBMSs (e.g., Redshift), often with SQL ❖ Semistruct. data: Exported from “NoSQL” stores (e.g., MongoDB) ❖ Log files, text files, documents, multimedia files, etc.: Typically stored on HDFS, S3, etc. Raw data ❖ Graph/network data: Managed by Neo4j sources/repos Ad: Take DSC 104 to learn semistruct. and graph databases 11

Acquiring Data: Examples Example: Recommendation System (e.g., Netflix) Prediction App: Identify top movies to display for user Data Sources: User data and Movie data Movie images past click logs Example: Social media analytics for social science Prediction App: Predicts which tweets will go viral Data Sources: Entity Graph data Tweets as JSON Dictionaries Structured metadata 12

Acquiring Data ❖ Modern data-driven applications tend to have multitudes of data storage repositories and sources ❖ Data scientist must know how to access and get datasets! Potential challenges and mitigation: ❖ Access control : Learn organization’s data security and authentication policies ❖ Data heterogeneity : Do you really need all data sources/types? ❖ Data volume : Do you really need all data? ❖ Scale : Avoid sequential file copying Raw data ❖ Manual errors : Use automated “data sources/repos pipeline” services such as AirFlow (later) 13

Organizing Data 1. Acquiring 2. Organizing 3. Cleaning 5. Feature Engineering (aka Feature Extraction) Raw data Analytics/ML- sources/repos ready data 4. Labeling (Sometimes) 14

(Re-)Organizing Data ❖ Given diverse data sources/file formats, data scientist must reorganize them into a usable format for analytics/ML ❖ Organization is specific to the analytics/ML task at hand ❖ Might need SQL, MapReduce (later), and file handling ❖ Examples of usable organization: Prediction App: Fraud detection in banking Joins to denormalize Large single-table CSV file, say, on HDFS Flatten JSON records Prediction App: Image captioning on social media Large binary file with Fuse JSON records 1 image tensor and 1 Extract image tensors string per line 15

(Re-)Organizing Data: Tips ❖ Data re-organization these days often involves a lot of coding (Python, SQL, Java) and scripting (bash) Some suggested best practices: ❖ Documentation: Maintain notes/READMEs with your code ❖ Automation: Use scripts (meta-programs) to automate orchestration of data re-org. code ❖ Provenance: Manage metadata on where your data records/variables come from and why they are there ❖ Versioning: You might do data re-org. many times; manage metadata on what version has what and when 16

(Re-)Organizing Data: Schematization ❖ Increasingly, “ ML platforms ” in industry are imposing more discipline on what re-organized data must look like ❖ Lightweight and flexible schemas becoming common ❖ Makes it easier to automate data validation 17 https://www.tensorflow.org/tfx/guide

(Re-)Organizing Data ❖ Custom ML platforms proliferating in industry, each with its own approach to organizing and cataloging ML data! 18 https://eng.uber.com/michelangelo/

Data Cleaning 1. Acquiring 2. Organizing 3. Cleaning 5. Feature Engineering (aka Feature Extraction) Raw data Analytics/ML- sources/repos ready data 4. Labeling (Sometimes) 19

Data Cleaning ❖ Real-world datasets often have errors, ambiguity, incompleteness, inconsistency, and other quality issues ❖ Data cleaning: The process of fixing data quality issues to ensure errors do not cascade/corrupt analytics/ML results ❖ Diverse sources/causes of data quality issues: ❖ Human-generated data: Mistakes, misunderstandings ❖ Hardware-generated data: Noise, failures ❖ Software-generated data: Bugs, errors, semantic issues ❖ Attribute encoding/formatting conventions (e.g., dates) ❖ Attribute unit/semantics conventions (e.g., km vs mi) ❖ Data integration: Duplicate entities, value differences ❖ Evolution of data schemas in application 20

Data Cleaning Task: Missing Values ❖ Long standing problem studied in statistics and DB/AI ❖ Various assumptions on “missingness” property in terms of correlations of missing vs observed values in dataset: ❖ Missing Completely at Random ( MCAR ): No (causal) relationships for missing vs non-missing values ❖ Missing at Random ( MAR ): Systematic relationships between missing values and observed values ❖ Missing Not at Random ( MNAR ): Missingness itself depends on the value missing ❖ Add 0/1 missingness variable and impute missing values: ❖ Statistical approaches: distributional properties ❖ ML/DL-based approaches: self-supervised ❖ Some ML packages offer these at scale (e.g., DaskML) 21

Data Cleaning Task: Entity Matching ❖ A common cleaning task for multi-source datasets ❖ Duplications of real-world entities can arise when using data drawn from multiple sources ❖ Often need to match and deduplicate entities in unified data; o/w, query/ML answers can be wrong! ❖ Aka entity deduplication/record linkage/entity linkage FullName Age City Sate Customers1 Aisha Williams 27 San Diego CA LastName FirstName MI Age Zipcode Customers2 Williams Aisha R 27 92122 Q: Are these the same person (“entity”)? 22

General Workflow of Entity Matching ❖ 3 main stages: Blocking -> Pairwise check -> Clustering ❖ Pairwise check: ❖ Given 2 records, how likely is it that they are the same entity? SOTA: “Entity embeddings ” + deep learning ❖ Blocking: ❖ Pairwise check cost for a whole table is too high: O(n 2 ) ❖ Create “blocks”/subsets of records; pairwise only within ❖ Domain-specific heuristics for “obvious” non-matches using similarity/distance metrics (e.g., edit dist. on Name) ❖ Clustering: ❖ Given pairwise scores, consolidate records into entities 23

Data Cleaning Q: How can we even hope to automate data cleaning? ❖ Many approaches studied by DB and AI communities: ❖ Integrity constraints: E.g., if ZipCode is same across customer records, State must be same too ❖ Business logic/rules: domain knowledge programs ❖ Supervised ML: E.g., predict missing values ❖ Unfortunately, data quality issues are often so peculiar and specific to dataset/application that human intervention (by data scientist) is often the only reliable way in practice! ☺ ❖ Crowdsourcing / expertsourcing another alternative Data cleaning in practice is “death by a thousand cuts”! :) 24

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML - PowerPoint PPT Presentation

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML Data Preparation and Model Selection Chapter 8, 8.1, 8.2, 8.3, 8.4 of MLSys Book 1 DSC 102 will get you thinking about the fundamentals of scalable analytics systems 1.

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 3: Parallel and Scalable Data

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 7: ML Deployment Not included for

Slide 7 / 102 Slide 8 / 102 4 Compare/Contrast Pulse and Wave. 5 In a transverse wave, compare

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: Dataflow Systems Chapter 2.2 of

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 1: Computer Organization; Operating

DSC 102 Systems for Scalable Analytics Winter 2020 Arun Kumar 1 About Myself 2009:

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 2: Basics of Cloud Computing 1

Slide 1 / 102 Slide 2 / 102 8th Grade Wave Properties Classwork-Homwork Slides 2015-10-15

Slide 4 / 102 1 What causes a wave? Slide 5 / 102 2 In terms of wave motion, define medium.

How to do research in clinical practice Dr P S Shankar, MD, FRCP(Lond), FAMS, DSc(Gul),

3rd Grade Shapes and Perimeter 2015-11-10 www.njctl.org Slide 3 / 102 Slide 4 / 102 Table of

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

DSC 10: Lecture 1 Introduction Cause and Effect Credit: Anindita Adhikari and John DeNero

AP Physics C - Mechanics Simple Harmonic Motion 2015-12-05 www.njctl.org Slide 3 / 102 Slide 4

DAQ++: A C++ Data Acquisition Software Framework C. Lacasta Co-developpers: E. Cochran, K.

www.webinos.org Use existing standards where ever possible Functional abstraction vs

Data Abstraction Programmers Compound values combine other values together All A date: a

Compound Data and Data Abstraction CoSc 450: Programming Paradigms 06 The Game of Nim Rules:

SCADA and Other Dangerous Things Professor Andrew Blyth, PhD. University of South Wales, UK.

Com pressive Sensing for High-Dim ensional Data Richard Baraniuk Rice University

Develop Your Data Mindset Module 1 - Introduction to Course and Theme, Need for Data Training,

SASBDB Small Angle Scattering Biological Data Bank Erica Valentini Dmitri Svergun group