CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: Classical ML Training at Scale Chapters 2, 5, and 6 of MLSys book 1
Academic ML 101 “Classical” ML Generalized Linear Models (GLMs); from statistics Bayesian Networks ; inspired by causal reasoning Decision Tree-based : CART, Random Forest, Gradient- Boosted Trees (GBT), etc.; inspired by symbolic logic Support Vector Machines (SVMs); inspired by psychology Artificial Neural Networks (ANNs): Multi-Layer Perceptrons (MLPs), Convolutional NNs (CNNs), Recurrent NNs (RNNs), Transformers, etc.; inspired by brain neuroscience 2
Real-World ML 101 GLMs Tree learners Deep Learning 3 https://www.kaggle.com/c/kaggle-survey-2019
Scalable ML Training in the Lifecycle Feature Engineering Data acquisition Serving Training & Inference Data preparation Monitoring Model Selection 4
Scalable ML Training in the Big Picture 5
ML Systems Q: What is a Machine Learning (ML) System? ❖ A data processing system (aka data system ) for mathematically advanced data analysis operations (inferential or predictive): ❖ Statistical analysis; ML, deep learning (DL); data mining (domain-specific applied ML + feature eng.) ❖ High-level APIs to express ML computations over (large) datasets ❖ Execution engine to run ML computations efficiently and in a scalable manner 6
But what exactly does it mean for an ML system to be “scalable”? 7
Outline ❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability 8
Background: Memory Hierarchy CPU ~MBs A C C E S S C Y C L E S ~100GB/s ~$2/MB Cache Access Speed 100s Capacity ~10GBs Main Price ~10GB/s ~$5/GB Memory 10 5 – 10 6 ~TBs ~GB/s ~$200/TB Flash Storage 10 7 – 10 8 Magnetic Hard Disk Drive ~10TBs (HDD) ~200MB/s ~$40/TB 9
Memory Hierarchy in Action Q: What does this program do when run with ‘python’? (Assume tmp.csv is in current working directory) import pandas as p m = p.read_csv(‘tmp.csv’,header=None) tmp.py s = m.sum().sum() print(s) 1,2,3 tmp.csv 4,5,6 10
Memory Hierarchy in Action Rough sequence of events when program is executed Computations done by Processor Processor CU ALU Store; Retrieve Retrieve; ‘21’ Registers DRAM ‘21’ Process ‘21’ Caches Q: What if this does not fit in DRAM? Commands interpreted Bus Store; Retrieve I/O for Display I/O for code I/O for data Disk Monitor tmp.py tmp.csv 11
Scalable ML Systems ❖ ML systems that do not require the (training) dataset to fit entirely in main memory (DRAM) of one node ❖ Conversely, if the system thrashes when data file does not fit in RAM, it is not scalable Basic Idea : Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor 12
Scalable ML Systems 4 main approaches to scale ML to large data: ❖ Single-node disk : Paged access from file on local disk ❖ Remote read : Paged access from disk(s) over network ❖ Distributed memory : Fits on a cluster’s total DRAM ❖ Distributed disk : Fits on a cluster’s full set of disks 13
Evolution of Scalable ML Systems ML on Scalability Cloud ML Dataflow Systems Manageability Late 1990s to 1980s Mid 2000s Mid 2010s— Mid Late 2000s to Early 2010s 1990s S Parameter Server Deep Learning Systems In-RDBMS ML Systems ML System Developability Abstractions 14
Major Existing ML Systems General ML libraries: In-memory: Disk-based files: Layered on RDBMS/Spark: Cloud-native: “AutoML” platforms: Decision tree-oriented: Deep learning-oriented: 15
Outline ❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability 16
ML Algorithm = Program Over Data Basic Idea : Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor ❖ To scale an ML program’s computations, split them up to operate over “chunks” of data at a time ❖ How to split up an ML program this way can be non-trivial! ❖ Depends on data access pattern of the algorithm ❖ A large class of ML algorithms do just sequential scans for iterative numerical optimization 17
<latexit sha1_base64="uvgXp+dpvG7N5TecG7ye7m4Eg=">AB+XicbVC7TsMwFL0pr1JeAUYWiwqJqUp4CMYKFsYi9SW1IXJcp7XqOJHtFKqof8LCAEKs/Akbf4PTdoCWI1k6Oude3eMTJwp7TjfVmFldW19o7hZ2tre2d2z9w+aKk4loQ0S81i2A6woZ4I2NOcthNJcRw2gqGt7nfGlGpWCzqepxQL8J9wUJGsDaSb9vdCOtBEGaPk4c6evKZb5edijMFWibunJRhjpvf3V7MUkjKjThWKmO6yTay7DUjHA6KXVTRNMhrhPO4YKHFHlZdPkE3RilB4KY2me0Giq/t7IcKTUOArMZJ5TLXq5+J/XSXV47WVMJKmgswOhSlHOkZ5DajHJCWajw3BRDKTFZEBlphoU1bJlOAufnmZNM8q7nl8v6iXL2Z1GEIziGU3DhCqpwBzVoAIERPMrvFmZ9WK9Wx+z0YI13zmEP7A+fwCVhJOh</latexit> <latexit sha1_base64="wtingO0LqaDZ0Zshlk1cGJ54IY=">ACO3icbVDLSgMxFM3UV62vqks3wSJUKWXGB7opFN24rGJroa1DJs20oUlmSDJqGea/3PgT7ty4caGIW/emD6haDwROzrmXe+/xQkaVtu1nKzUzOze/kF7MLC2vrK5l1zdqKogkJlUcsEDWPaQIo4JUNdWM1ENJEPcYufZ6ZwP/+pZIRQNxpfshaXHUEdSnGkjudnLJke6/nxXKzB0tw+KU6RrLDqUjceOInsKki7sa05CQ3ArJ836UF6OcnFYV7l+7utmcXbSHgNPEGZMcGKPiZp+a7QBHnAiNGVKq4dihbpkVNMWMJlmpEiIcA91SMNQgThRrXh4ewJ3jNKGfiDNExoO1Z8dMeJK9blnKgd7qr/eQPzPa0TaP2nFVISRJgKPBvkRgzqAgyBhm0qCNesbgrCkZleIu0girE3cGROC8/fkaVLbLzoHxaOLw1z5dBxHGmyBbZAHDjgGZXAOKqAKMHgAL+ANvFuP1qv1YX2OSlPWuGcT/IL19Q0ZDK6q</latexit> Numerical Optimization in ML ❖ Many regression and classification models in ML are formulated as a (constrained) minimization problem ❖ E.g., logistic and linear regression, linear SVM, etc. ❖ Aka “Empirical Risk Minimization” (ERM) D Y X1 X2 X3 n w ∗ = argmin w X 0 1b 1c 1d l ( y i , f ( w , x i )) 1 2b 2c 2d i =1 1 3b 3c 3d 0 4b 4c 4d ❖ GLMs define hyperplanes … … … … and use f() that is a scalar function of distances: w T x i 18
<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">ACOXicbVDLSgMxFM34rPVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGUk7qipFWKAjyXUa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit> <latexit sha1_base64="ayIl16A2siz4c7l2gSKsLvcCX4=">ACOXicbVDLahtBEJyVk1hRXop9zGWICEiEiF3HJj4K+5JDgpED9AqonfUKw2anV1mei3Eot/yxX+RW8AXHxJCrvmBjB6HSE7BQFVzXRXlClpyfe/e6WDBw8fHZYfV548fb8RfXlUdemuRHYEalKT8Ci0pq7JAkhf3MICSRwl40u1z5vSs0Vqb6Cy0yHCYw0TKWAshJo2o7TICmUVzMl1+L+uxt0FjyUGFMYEw657u897xEAl4qCFSwD/V9wONUbXmN/01+H0SbEmNbdEeVb+F41TkCWoSCqwdBH5GwIMSaFwWQlzixmIGUxw4KiGBO2wWF+5G+cMuZxatzTxNfqvxMFJNYuksglV4vafW8l/s8b5BSfDwups5xQi81Hca4pXxVIx9Lg4LUwhEQRrpduZiCAUGu7IorIdg/+T7pnjSD982z6e1sW2jJ7xV6zOgvYB9ZiH1mbdZhg1+yW/WA/vRvzvl/d5ES9525pjtwPvzF5+7rMc=</latexit> <latexit sha1_base64="Gn3z209XWihwvStV6+mufXVXz8s=">ACH3icbVDLSgMxFM3UV62vUZdugkVoZQZ35tC0Y0LFxXsA9o6ZNJMG5rJDElGLcP8iRt/xY0LRcRd/8b0saitBwKHc84l9x43ZFQqyxoaqaXldW19HpmY3Nre8fc3avJIBKYVHAtFwkSMclJVDHSCAVBvstI3e1fj/z6IxGSBvxeDULS9lGXU49ipLTkmOe3uZaPVM/14qckD0uwJSPfiWnJTh4ZLmBQwvQm8kUnh2aztm1ipaY8BFYk9JFkxRcyfVifAkU+4wgxJ2bStULVjJBTFjCSZViRJiHAfdUlTU458Itvx+L4EHmlA71A6McVHKuzEzHypRz4rk6O9pTz3kj8z2tGyrtsx5SHkSIcTz7yIgZVAEdlwQ4VBCs20ARhQfWuEPeQFjpSjO6BHv+5EVSOy7aJ8Wzu9Ns+WpaRxocgEOQAza4AGVwAyqgCjB4AW/gA3war8a78WV8T6IpYzqzD/7AGP4C0FChmg=</latexit> Batch Gradient Descent for ML n X L ( w ) = l ( y i , f ( w , x i )) i =1 ❖ For many ML models, loss function l() is convex ; so is L() ❖ But closed-form minimization is typically infeasible ❖ Batch Gradient Descent: ❖ Iterative numerical procedure to find an optimal w ❖ Initialize w to some value w (0) n ❖ Compute gradient : X r L ( w ( k ) ) = r l ( y i , f ( w ( k ) , x i )) i =1 ❖ Descend along gradient: w ( k +1) w ( k ) � η r L ( w ( k ) ) (Aka Update Rule ) ❖ Repeat until we get close to w* , aka convergence 19
Recommend
More recommend