data lake to ai on gpus cpus can no longer handle the
play

Data Lake to AI on GPUs CPUs can no longer handle the growing data - PowerPoint PPT Presentation

Data Lake to AI on GPUs CPUs can no longer handle the growing data demands of data science workloads Slow Process Suboptimal Infrastructure Preparing data and training models Hundreds to tens of thousands of CPU can take days or even weeks.


  1. Data Lake to AI on GPUs

  2. CPUs can no longer handle the growing data demands of data science workloads Slow Process Suboptimal Infrastructure Preparing data and training models Hundreds to tens of thousands of CPU can take days or even weeks. servers are needed in data centers. @blazingdb @blazingdb

  3. GPUs are well known for accelerating the training of machine learning and deep learning models. Performance Machine improvements Learning increase at scale. 40x Improvement Deep Learning over CPU. (Neural Networks) @blazingdb @blazingdb

  4. But data preparation still happens on CPUs, and can’t keep up with GPU accelerated machine learning. • Apache Spark Query ETL ML Train • Apache Spark + GPU ML ML Query ETL Train Enterprise GPU users find it challenging to “Feed the Beast” . @blazingdb @blazingdb

  5. An end-to-end analytics solution on GPUs is the only way to maximize GPU power. RAPIDS (All GPU) ML Query ETL Train Expertise: Expertise: Expertise: · Python · GPU DBMS · CUDA · Data Science · GPU Columnar Analytics · Machine Learning · Machine Learning · Data Lakes · Deep Learning @blazingdb @blazingdb

  6. RAPIDS, the end-to-end GPU analytics ecosystem import cudf A set of open source libraries for GPU from cuml import KNN accelerating data preparation and import numpy as np machine learning . np_float = np.array([ [1,2,3], #Point 1 [1,2,3], #Point 2 [1,2,3], #Point 3 ]).astype('float32') Data Preparation Model Training Visualization gdf_float = cudf.DataFrame() gdf_float['dim_0'] = np.ascontiguousarray(np_float[:,0]) gdf_float['dim_1'] = np.ascontiguousarray(np_float[:,1]) gdf_float['dim_2'] = np.ascontiguousarray(np_float[:,2]) cuDF cuML cuGRAPH print('n_samples = 3, n_dims = 3') print(gdf_float) Data Preparation Machine Learning Graph Analytics knn_float = KNN(n_gpus=1) knn_float.fit(gdf_float) Distance,Index = knn_float.query(gdf_float,k=3) # Get 3 nearest neighbors In GPU Memory print(Index) print(Distance) @blazingdb @blazingdb

  7. BlazingSQL: The GPU SQL Engine on RAPIDS A SQL engine built on RAPIDS. Query enterprise data lakes lightning fast with full interoperability with the RAPIDS stack. @blazingdb @blazingdb

  8. BlazingSQL, The GPU SQL Engine for RAPIDS from blazingsql import BlazingContext A SQL engine built on RAPIDS. Query enterprise data lakes lightning fast with bc = BlazingContext() full interoperability with RAPIDS stack. #Register Filesystem bc.hdfs('data', host='129.13.0.12', port=54310) cuDF Data Preparation # Create Table bc.create_table('performance', file_type='parquet', path='hdfs://data/performance/') cuML Machine Learning #Execute Query In GPU Memory result_gdf = bc.run_query('SELECT * FROM performance WHERE YEAR(maturity_date)>2005') cuGRAPH Graph Analytics print(result_gdf) @blazingdb @blazingdb

  9. Getting Started Demo @blazingdb @blazingdb

  10. BlazingSQL + XGBoost Loan Risk Demo Train a model to assess risk of new mortgage loans based on Fannie Mae loan performance data Mortgage Data ETL/ 4.22M Loans XGBoost Training Feature Engineering 148M Perf. Records CSV Files on HDFS + 1 Nodes 4 Nodes + 8 vCPUs per node + 16 vCPUs per node + CLUSTER CLUSTER 30GB RAM 1 Tesla T4 GPU 2560 16GB CUDA Cores VRAM @blazingdb @blazingdb

  11. RAPIDS + BlazingSQL outperforms traditional CPU pipelines Demo Timings (ETL Phase) 3.8GB (1 x T4) 3.8GB (4 Nodes) 15.6GB (1 x T4) 15.6GB (4 Nodes) TIME IN SECONDS 0’’ 1000’’ 2000’’ 3000’’ @blazingdb @blazingdb

  12. Scale up the data on a DGX 4 x V100 GPUs @blazingdb @blazingdb

  13. BlazingSQL + Graphistry Netflow Analysis Visually analyze the VAST netflow data set inside Graphistry in order to quickly detect anomalous events. Netflow Data ETL Visualization 65M Events 1,440 Devices 2 Weeks @blazingdb @blazingdb

  14. Benchmarks Netflow Demo Timings (ETL Only) @blazingdb @blazingdb

  15. Benefits of BlazingSQL Blazing Fast. Data Lake to RAPIDS Massive time savings with our Query data from Data Lakes GPU accelerated ETL pipeline. directly with SQL in to GPU memory, let RAPIDS do the rest. Minimal Code Changes Required. Stateless and Simple. RAPIDS with BlazingSQL mirrors Underlying services being Pandas and SQL interfaces for stateless reduces complexity seamless onboarding. and increase extensibility. @blazingdb @blazingdb

  16. Upcoming BlazingSQL Releases VO.1 VO.2 VO.3 VO.4 VO.5 Query Direct Query String Physical Plan Distributed GDFs Flat Files Support Optimizer Scheduler Use the PyBlazing Integrate FileSystem API, String support and string Partition culling for where SQL queries are fanned connection to execute SQL adding the ability to operation support. clauses and joins. out across multiple GPUs queries on GDFs that are directly query flat files and servers. loaded by the cuDF API (Apache Parquet & CSV) inside distributed file systems. @blazingdb @blazingdb

  17. Get Started BlazingSQL is quick to get up and running using either DockerHub or Conda Install: @blazingdb @blazingdb

Recommend


More recommend