CSE 232A Database System Implementation Arun Kumar Topic 8: Data - PowerPoint PPT Presentation

CSE 232A   Database System Implementation Arun Kumar Topic 8: Data Systems for ML Workloads Book: “Data Management in ML Systems” by Morgan & Claypool Publishing 1

“Big Data” Systems Parallel RDBMSs and Cloud-Native RDBMSs ❖ Beyond RDBMSs: A Brief History ❖ “Big Data” Systems ❖ The MapReduce/Hadoop Craze ❖ Spark and Other Dataflow Systems ❖ Key-Value NoSQL Systems ❖ Graph Processing Systems ❖ Advanced Analytics/ML Systems ❖ 2

Lifecycle/Tasks of ML-based Analytics Feature Engineering Data acquisition Inference Training Data preparation Monitoring Model Selection 3

ML 101: Popular Forms of ML Generalized Linear Models (GLMs); from statistics Bayesian Networks ; inspired by causal reasoning Decision Tree-based : CART, Random Forest, Gradient- Boosted Trees (GBT), etc.; inspired by symbolic logic Support Vector Machines (SVMs); inspired by psychology Artificial Neural Networks (ANNs): Multi-Layer Perceptrons (MLPs), Convolutional NNs (CNNs), Recurrent NNs (RNNs), Transformers, etc.; inspired by brain neuroscience 4

Advanced Analytics/ML Systems Q: What is a Machine Learning (ML) System? ❖ A data processing system (aka data system ) for mathematically advanced data analysis ops (inferential or predictive), i.e., beyond just SQL aggregates ❖ Statistical analysis; ML, deep learning (DL); data mining (domain-specific applied ML + feature eng.) ❖ High-level APIs for expressing statistical/ML/DL computations over large datasets 5

Data Management Concerns in ML Key concerns in ML: Q: How do “ML Systems” relate to ML? Accuracy Runtime efficiency (sometimes) Additional key practical concerns in ML Systems: ML Systems : ML :: Computer Systems : TCS Scalability (and efficiency at scale) Long-standing Usability concerns in the Manageability DB systems Developability world! Q: How does it fit within production systems and workflows? Q: How to simplify the implementation of such systems? Q: What if the dataset is larger than single-node RAM? Can often trade off accuracy a bit to gain on the rest! Q: How are the features and models configured? 6

Conceptual System Stack Analogy Relational DB Systems ML Systems First-Order Logic Learning Theory Theory Optimization Theory Complexity Theory Program Matrix Algebra Relational Algebra Formalism Gradient Descent Program Declarative TensorFlow? Specification Query Language R? Scikit-learn? Program Query Optimization ??? Modification Execution Parallel Relational Depends on ML Algorithm Primitives Operator Dataflows Hardware CPU, GPU, FPGA, NVM, RDMA, etc. 7

Categorizing ML Systems ❖ Orthogonal Dimensions of Categorization : 1. Scalability: In-memory libraries vs Scalable ML system (works on larger-than-memory datasets) 2. Target Workloads: General ML library vs Decision tree-oriented vs Deep learning, etc. 3. Implementation Reuse: Layered on top of scalable data system vs Custom from-scratch framework 8

Major Existing ML Systems General ML libraries: In-memory: Disk-based files: Layered on RDBMS/Spark: Cloud-native: “AutoML” platforms: Decision tree-oriented: Deep learning-oriented: 9

<latexit sha1_base64="RqFgvBkpZLuXZDIZXEvdMwy9X6U=">AB8XicbVDLSsNAFL2pr1pfVZduBotQUriA10W3bisYB/YhjCZTtqhk0mYmYgh9C/cuFDErX/jzr9x2mahrQcuHM65l3v8WPOlLbtb6uwtLyulZcL21sbm3vlHf3WipKJKFNEvFIdnysKGeCNjXTnHZiSXHoc9r2RzcTv/1IpWKRuNdpTN0QDwQLGMHaSA/VJ4+doNRjx165YtfsKdAicXJSgRwNr/zV60ckCanQhGOluo4dazfDUjPC6bjUSxSNMRnhAe0aKnBIlZtNLx6jI6P0URBJU0Kjqfp7IsOhUmnom84Q6Ga9ybif1430cGVmzERJ5oKMlsUJBzpCE3eR30mKdE8NQTycytiAyxESbkEomBGf+5UXSOq05Z7WLu/NK/TqPowgHcAhVcOAS6nALDWgCAQHP8ApvlrJerHfrY9ZasPKZfgD6/MHRuCQAw=</latexit> <latexit sha1_base64="2xHoHxDTpYj9FZ94WjRGA4eUSI=">ACIXicbVBNSwJBGJ61L7OvrY5dhiRQENntg7wIUpcOHQxSA9eW2XFWB2dnl5nZSBb/Spf+SpcORXiL/kyj7sG0BwYenud5mfd9vIhRqSzr28isrK6tb2Q3c1vbO7t75v5BU4axwKSBQxaKBw9JwignDUVIw+RICjwGl5g+uJ3oiQtKQ36thRDoB6nHqU4yUlyzcltwAqT6np+0RkVYhY6MAzehVXv0yCGDhaFLS9CfC5Xgs0uLRdfMW2VrCrhM7JTkQYq6a46dbojgHCFGZKybVuR6iRIKIoZGeWcWJI4QHqkbamHAVEdpLphSN4opUu9EOhH1dwqs5PJCiQch4OjlZVC56E/E/rx0rv9JKI9iRTiefeTHDKoQTuqCXSoIVmyoCcKC6l0h7iOBsNKl5nQJ9uLJy6R5WrbPyhd35/naVpHFhyBY1ANrgENXAD6qABMHgBb+ADfBqvxrvxZYxn0YyRzhyCPzB+fgE0I6Gu</latexit> <latexit sha1_base64="3nwsg8hxgtnGtmOBHQ5mjABxB8g=">ACMnicbVDLSgMxFM34rPU16tJNsAiuyowPFdFXeiugn1Ip5RMmlDM5khuaOUod/kxi8RXOhCEbd+hOm0C9t6IHA4515y7vFjwTU4zps1N7+wuLScW8mvrq1vbNpb21UdJYqyCo1EpOo+0UxwySrAQbB6rBgJfcFqfu9y6NcemNI8knfQj1kzJB3JA04JGKl3wTn2AsJdCkR6dWglXE/SGsD7AEPmZ5w69hTvNMFolT0OHc45ZdcIpOBjxL3DEpoDHKLfvFa0c0CZkEKojWDdeJoZkSBZwKNsh7iWYxoT3SYQ1DJTFpml28gDvG6WNg0iZJwFn6t+NlIRa90PfTA5T6mlvKP7nNRIzpopl3ECTNLR0EiMER42B9uc8UoiL4hCpusmLaJYpQMC3nTQnu9MmzpHpYdI+KJ7fHhdLFuI4c2kV76AC56BSV0DUqowqi6Am9og/0aT1b79aX9T0anbPGOztoAtbPLyQXq1A=</latexit> ML as Numeric Optimization ❖ Recall that an ML model is a parametric function: f : D W × D X → D Y ❖ Training: Process of fitting model parameters from data ❖ Training can be expressed in this form for many ML models; aka “empirical risk minimization” (ERM) aka “loss” function: n X L ( W ) = l ( y i , f ( W , x i )) ( x i , y i ) is a training example i =1 ❖ l() is a differentiable function; can be compositions ❖ GLMs, linear SVMs, and ANNs fit the above template 10

CSE 232A Database System Implementation Arun Kumar Topic 8: Data - PowerPoint PPT Presentation

CSE 232A Database System Implementation Arun Kumar Topic 8: Data Systems for ML Workloads Book: Data Management in ML Systems by Morgan & Claypool Publishing 1 Big Data Systems Parallel RDBMSs and Cloud-Native RDBMSs

CSE 232A Database System Implementation Arun Kumar Topic 9: ML for RDBMSs Optional; this

CSE 232A Graduate Database Systems Arun Kumar Topic 1: Data Storage Chapters 8 and 9 of Cow

CSE 232A Graduate Database Systems Arun Kumar Topic 4: Query Optimization Chapters 12 and

CSE 232A Graduate Database Systems Fall 2019 Arun Kumar 1 About Myself 2009: Bachelors in

CSE 232A Graduate Database Systems Arun Kumar Topic 2: Indexing and Sorting Chapters 10,

CSE 232A Graduate Database Systems Arun Kumar About Paper Reviews 1 Goal of Peer Review in

CSE 232A Graduate Database Systems Arun Kumar Review Discussion 1 Review Question Which

CSE 232A Graduate Database Systems Arun Kumar Topic 5: Data Integration and Cleaning Slide

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Advanced Database CS 525: Organization? Advanced Database =Database Implementation

CSE 132B CSE 132B Database Systems Applications Database Systems Applications Alin Deutsch

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

Database Cracking September 7, 2016 CSE 662 - Database Languages & Runtimes 1 Row Stores

Database Cracking Languages and Runtimes for Big Data CSE 662 - Database Languages & Runtimes

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017

Experiences and Perspectives from Applying MBSE in Manufacturing

Lectures 1&2: Introduction to Supply Chain Management Supply Chain Management Quality

deploy Automating Cloud Testing and Deployment with Deploy Monday 9/16/2013 5:10pm Room

Mquinas de Estado Mquinas de

Creating Shared Value to End Extreme Poverty with Science, Technology, Innovation, and

Depth Estimation for Ranking Query Optimization KarlSchnaitter,UCSantaCruz

GaN HEMT Reliability J. A. del Alamo and J. Joh Microsystems Technology Laboratories, MIT ESREF

Laboratoire Kastler Brossel Collge de France, ENS, UPMC, CNRS Introduction to Ultracold Atoms

CSE 232A Database System Implementation Arun Kumar Topic 8: Data - PowerPoint PPT Presentation

CSE 232A Database System Implementation Arun Kumar Topic 8: Data Systems for ML Workloads Book: Data Management in ML Systems by Morgan & Claypool Publishing 1 Big Data Systems Parallel RDBMSs and Cloud-Native RDBMSs

CSE 232A Database System Implementation Arun Kumar Topic 9: ML for RDBMSs Optional; this

CSE 232A Graduate Database Systems Arun Kumar Topic 1: Data Storage Chapters 8 and 9 of Cow

CSE 232A Graduate Database Systems Arun Kumar Topic 4: Query Optimization Chapters 12 and

CSE 232A Graduate Database Systems Fall 2019 Arun Kumar 1 About Myself 2009: Bachelors in

CSE 232A Graduate Database Systems Arun Kumar Topic 2: Indexing and Sorting Chapters 10,

CSE 232A Graduate Database Systems Arun Kumar About Paper Reviews 1 Goal of Peer Review in

CSE 232A Graduate Database Systems Arun Kumar Review Discussion 1 Review Question Which

CSE 232A Graduate Database Systems Arun Kumar Topic 5: Data Integration and Cleaning Slide

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Advanced Database CS 525: Organization? Advanced Database =Database Implementation

CSE 132B CSE 132B Database Systems Applications Database Systems Applications Alin Deutsch

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

Database Cracking September 7, 2016 CSE 662 - Database Languages &amp; Runtimes 1 Row Stores

Database Cracking Languages and Runtimes for Big Data CSE 662 - Database Languages &amp; Runtimes

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017

Experiences and Perspectives from Applying MBSE in Manufacturing

Lectures 1&amp;2: Introduction to Supply Chain Management Supply Chain Management Quality

deploy Automating Cloud Testing and Deployment with Deploy Monday 9/16/2013 5:10pm Room

Mquinas de Estado Mquinas de

Creating Shared Value to End Extreme Poverty with Science, Technology, Innovation, and

Depth Estimation for Ranking Query Optimization KarlSchnaitter,UCSantaCruz

GaN HEMT Reliability J. A. del Alamo and J. Joh Microsystems Technology Laboratories, MIT ESREF

Laboratoire Kastler Brossel Collge de France, ENS, UPMC, CNRS Introduction to Ultracold Atoms

Database Cracking September 7, 2016 CSE 662 - Database Languages & Runtimes 1 Row Stores

Database Cracking Languages and Runtimes for Big Data CSE 662 - Database Languages & Runtimes

Lectures 1&2: Introduction to Supply Chain Management Supply Chain Management Quality