Data Lake to AI on GPUs
CPUs can no longer handle the growing data demands of data science workloads Slow Process Suboptimal Infrastructure Preparing data and training models Hundreds to tens of thousands of CPU can take days or even weeks. servers are needed in data centers. @blazingdb @blazingdb
GPUs are well known for accelerating the training of machine learning and deep learning models. Performance Machine improvements Learning increase at scale. 40x Improvement Deep Learning over CPU. (Neural Networks) @blazingdb @blazingdb
But data preparation still happens on CPUs, and can’t keep up with GPU accelerated machine learning. • Apache Spark Query ETL ML Train • Apache Spark + GPU ML ML Query ETL Train Enterprise GPU users find it challenging to “Feed the Beast” . @blazingdb @blazingdb
An end-to-end analytics solution on GPUs is the only way to maximize GPU power. RAPIDS (All GPU) ML Query ETL Train Expertise: Expertise: Expertise: · Python · GPU DBMS · CUDA · Data Science · GPU Columnar Analytics · Machine Learning · Machine Learning · Data Lakes · Deep Learning @blazingdb @blazingdb
RAPIDS, the end-to-end GPU analytics ecosystem import cudf A set of open source libraries for GPU from cuml import KNN accelerating data preparation and import numpy as np machine learning . np_float = np.array([ [1,2,3], #Point 1 [1,2,3], #Point 2 [1,2,3], #Point 3 ]).astype('float32') Data Preparation Model Training Visualization gdf_float = cudf.DataFrame() gdf_float['dim_0'] = np.ascontiguousarray(np_float[:,0]) gdf_float['dim_1'] = np.ascontiguousarray(np_float[:,1]) gdf_float['dim_2'] = np.ascontiguousarray(np_float[:,2]) cuDF cuML cuGRAPH print('n_samples = 3, n_dims = 3') print(gdf_float) Data Preparation Machine Learning Graph Analytics knn_float = KNN(n_gpus=1) knn_float.fit(gdf_float) Distance,Index = knn_float.query(gdf_float,k=3) # Get 3 nearest neighbors In GPU Memory print(Index) print(Distance) @blazingdb @blazingdb
BlazingSQL: The GPU SQL Engine on RAPIDS A SQL engine built on RAPIDS. Query enterprise data lakes lightning fast with full interoperability with the RAPIDS stack. @blazingdb @blazingdb
BlazingSQL, The GPU SQL Engine for RAPIDS from blazingsql import BlazingContext A SQL engine built on RAPIDS. Query enterprise data lakes lightning fast with bc = BlazingContext() full interoperability with RAPIDS stack. #Register Filesystem bc.hdfs('data', host='129.13.0.12', port=54310) cuDF Data Preparation # Create Table bc.create_table('performance', file_type='parquet', path='hdfs://data/performance/') cuML Machine Learning #Execute Query In GPU Memory result_gdf = bc.run_query('SELECT * FROM performance WHERE YEAR(maturity_date)>2005') cuGRAPH Graph Analytics print(result_gdf) @blazingdb @blazingdb
Getting Started Demo @blazingdb @blazingdb
BlazingSQL + XGBoost Loan Risk Demo Train a model to assess risk of new mortgage loans based on Fannie Mae loan performance data Mortgage Data ETL/ 4.22M Loans XGBoost Training Feature Engineering 148M Perf. Records CSV Files on HDFS + 1 Nodes 4 Nodes + 8 vCPUs per node + 16 vCPUs per node + CLUSTER CLUSTER 30GB RAM 1 Tesla T4 GPU 2560 16GB CUDA Cores VRAM @blazingdb @blazingdb
RAPIDS + BlazingSQL outperforms traditional CPU pipelines Demo Timings (ETL Phase) 3.8GB (1 x T4) 3.8GB (4 Nodes) 15.6GB (1 x T4) 15.6GB (4 Nodes) TIME IN SECONDS 0’’ 1000’’ 2000’’ 3000’’ @blazingdb @blazingdb
Scale up the data on a DGX 4 x V100 GPUs @blazingdb @blazingdb
BlazingSQL + Graphistry Netflow Analysis Visually analyze the VAST netflow data set inside Graphistry in order to quickly detect anomalous events. Netflow Data ETL Visualization 65M Events 1,440 Devices 2 Weeks @blazingdb @blazingdb
Benchmarks Netflow Demo Timings (ETL Only) @blazingdb @blazingdb
Benefits of BlazingSQL Blazing Fast. Data Lake to RAPIDS Massive time savings with our Query data from Data Lakes GPU accelerated ETL pipeline. directly with SQL in to GPU memory, let RAPIDS do the rest. Minimal Code Changes Required. Stateless and Simple. RAPIDS with BlazingSQL mirrors Underlying services being Pandas and SQL interfaces for stateless reduces complexity seamless onboarding. and increase extensibility. @blazingdb @blazingdb
Upcoming BlazingSQL Releases VO.1 VO.2 VO.3 VO.4 VO.5 Query Direct Query String Physical Plan Distributed GDFs Flat Files Support Optimizer Scheduler Use the PyBlazing Integrate FileSystem API, String support and string Partition culling for where SQL queries are fanned connection to execute SQL adding the ability to operation support. clauses and joins. out across multiple GPUs queries on GDFs that are directly query flat files and servers. loaded by the cuDF API (Apache Parquet & CSV) inside distributed file systems. @blazingdb @blazingdb
Get Started BlazingSQL is quick to get up and running using either DockerHub or Conda Install: @blazingdb @blazingdb
Recommend
More recommend