cse 232a database system implementation
play

CSE 232A Database System Implementation Arun Kumar Topic 8: Data - PowerPoint PPT Presentation

CSE 232A Database System Implementation Arun Kumar Topic 8: Data Systems for ML Workloads Book: Data Management in ML Systems by Morgan & Claypool Publishing 1 Big Data Systems Parallel RDBMSs and Cloud-Native RDBMSs


  1. CSE 232A 
 Database System Implementation Arun Kumar Topic 8: Data Systems for ML Workloads Book: “Data Management in ML Systems” by Morgan & Claypool Publishing 1

  2. “Big Data” Systems Parallel RDBMSs and Cloud-Native RDBMSs ❖ Beyond RDBMSs: A Brief History ❖ “Big Data” Systems ❖ The MapReduce/Hadoop Craze ❖ Spark and Other Dataflow Systems ❖ Key-Value NoSQL Systems ❖ Graph Processing Systems ❖ Advanced Analytics/ML Systems ❖ 2

  3. Lifecycle/Tasks of ML-based Analytics Feature Engineering Data acquisition Inference Training Data preparation Monitoring Model Selection 3

  4. ML 101: Popular Forms of ML Generalized Linear Models (GLMs); from statistics Bayesian Networks ; inspired by causal reasoning Decision Tree-based : CART, Random Forest, Gradient- Boosted Trees (GBT), etc.; inspired by symbolic logic Support Vector Machines (SVMs); inspired by psychology Artificial Neural Networks (ANNs): Multi-Layer Perceptrons (MLPs), Convolutional NNs (CNNs), Recurrent NNs (RNNs), Transformers, etc.; inspired by brain neuroscience 4

  5. Advanced Analytics/ML Systems Q: What is a Machine Learning (ML) System? ❖ A data processing system (aka data system ) for mathematically advanced data analysis ops (inferential or predictive), i.e., beyond just SQL aggregates ❖ Statistical analysis; ML, deep learning (DL); data mining (domain-specific applied ML + feature eng.) ❖ High-level APIs for expressing statistical/ML/DL computations over large datasets 5

  6. Data Management Concerns in ML Key concerns in ML: Q: How do “ML Systems” relate to ML? Accuracy Runtime efficiency (sometimes) Additional key practical concerns in ML Systems: ML Systems : ML :: Computer Systems : TCS Scalability (and efficiency at scale) Long-standing Usability concerns in the Manageability DB systems Developability world! Q: How does it fit within production systems and workflows? Q: How to simplify the implementation of such systems? Q: What if the dataset is larger than single-node RAM? Can often trade off accuracy a bit to gain on the rest! Q: How are the features and models configured? 6

  7. Conceptual System Stack Analogy Relational DB Systems ML Systems First-Order Logic Learning Theory Theory Optimization Theory Complexity Theory Program Matrix Algebra Relational Algebra Formalism Gradient Descent Program Declarative TensorFlow? Specification Query Language R? Scikit-learn? Program Query Optimization ??? Modification Execution Parallel Relational Depends on ML Algorithm Primitives Operator Dataflows Hardware CPU, GPU, FPGA, NVM, RDMA, etc. 7

  8. Categorizing ML Systems ❖ Orthogonal Dimensions of Categorization : 1. Scalability: In-memory libraries vs Scalable ML system (works on larger-than-memory datasets) 2. Target Workloads: General ML library vs Decision tree-oriented vs Deep learning, etc. 3. Implementation Reuse: Layered on top of scalable data system vs Custom from-scratch framework 8

  9. Major Existing ML Systems General ML libraries: In-memory: Disk-based files: Layered on RDBMS/Spark: Cloud-native: “AutoML” platforms: Decision tree-oriented: Deep learning-oriented: 9

  10. <latexit sha1_base64="RqFgvBkpZLuXZDIZXEvdMwy9X6U=">AB8XicbVDLSsNAFL2pr1pfVZduBotQUriA10W3bisYB/YhjCZTtqhk0mYmYgh9C/cuFDErX/jzr9x2mahrQcuHM65l3v8WPOlLbtb6uwtLyulZcL21sbm3vlHf3WipKJKFNEvFIdnysKGeCNjXTnHZiSXHoc9r2RzcTv/1IpWKRuNdpTN0QDwQLGMHaSA/VJ4+doNRjx165YtfsKdAicXJSgRwNr/zV60ckCanQhGOluo4dazfDUjPC6bjUSxSNMRnhAe0aKnBIlZtNLx6jI6P0URBJU0Kjqfp7IsOhUmnom84Q6Ga9ybif1430cGVmzERJ5oKMlsUJBzpCE3eR30mKdE8NQTycytiAyxESbkEomBGf+5UXSOq05Z7WLu/NK/TqPowgHcAhVcOAS6nALDWgCAQHP8ApvlrJerHfrY9ZasPKZfgD6/MHRuCQAw=</latexit> <latexit sha1_base64="2xHoHxDTpYj9FZ94WjRGA4eUSI=">ACIXicbVBNSwJBGJ61L7OvrY5dhiRQENntg7wIUpcOHQxSA9eW2XFWB2dnl5nZSBb/Spf+SpcORXiL/kyj7sG0BwYenud5mfd9vIhRqSzr28isrK6tb2Q3c1vbO7t75v5BU4axwKSBQxaKBw9JwignDUVIw+RICjwGl5g+uJ3oiQtKQ36thRDoB6nHqU4yUlyzcltwAqT6np+0RkVYhY6MAzehVXv0yCGDhaFLS9CfC5Xgs0uLRdfMW2VrCrhM7JTkQYq6a46dbojgHCFGZKybVuR6iRIKIoZGeWcWJI4QHqkbamHAVEdpLphSN4opUu9EOhH1dwqs5PJCiQch4OjlZVC56E/E/rx0rv9JKI9iRTiefeTHDKoQTuqCXSoIVmyoCcKC6l0h7iOBsNKl5nQJ9uLJy6R5WrbPyhd35/naVpHFhyBY1ANrgENXAD6qABMHgBb+ADfBqvxrvxZYxn0YyRzhyCPzB+fgE0I6Gu</latexit> <latexit sha1_base64="3nwsg8hxgtnGtmOBHQ5mjABxB8g=">ACMnicbVDLSgMxFM34rPU16tJNsAiuyowPFdFXeiugn1Ip5RMmlDM5khuaOUod/kxi8RXOhCEbd+hOm0C9t6IHA4515y7vFjwTU4zps1N7+wuLScW8mvrq1vbNpb21UdJYqyCo1EpOo+0UxwySrAQbB6rBgJfcFqfu9y6NcemNI8knfQj1kzJB3JA04JGKl3wTn2AsJdCkR6dWglXE/SGsD7AEPmZ5w69hTvNMFolT0OHc45ZdcIpOBjxL3DEpoDHKLfvFa0c0CZkEKojWDdeJoZkSBZwKNsh7iWYxoT3SYQ1DJTFpml28gDvG6WNg0iZJwFn6t+NlIRa90PfTA5T6mlvKP7nNRIzpopl3ECTNLR0EiMER42B9uc8UoiL4hCpusmLaJYpQMC3nTQnu9MmzpHpYdI+KJ7fHhdLFuI4c2kV76AC56BSV0DUqowqi6Am9og/0aT1b79aX9T0anbPGOztoAtbPLyQXq1A=</latexit> ML as Numeric Optimization ❖ Recall that an ML model is a parametric function: f : D W × D X → D Y ❖ Training: Process of fitting model parameters from data ❖ Training can be expressed in this form for many ML models; aka “empirical risk minimization” (ERM) aka “loss” function: n X L ( W ) = l ( y i , f ( W , x i )) ( x i , y i ) is a training example i =1 ❖ l() is a differentiable function; can be compositions ❖ GLMs, linear SVMs, and ANNs fit the above template 10

Recommend


More recommend