GOAI ONE YEAR LATER Joshua Patterson, Director AI Infrastructure 3/27/18 @datametrician
THE WORLD WE ANALYZE Realities of Data 2
IN A FINITE CRISIS CPU Performance Has Plateaued 10 7 10 6 Transistors 10 5 1.1X per year (thousands) 10 4 10 3 1.5X per year 10 2 Single-threaded perf 1980 1990 2000 2010 2020 3
IN A FINITE CRISIS GPU Performance Grows 10 7 GPU-Computing perf 1000X 1.5X per year By 2025 10 6 10 5 1.1X per year 10 4 10 3 1.5X per year 10 2 Single-threaded perf 1980 1990 2000 2010 2020 4
IN A FINITE CRISIS CPU Performance Has Plateaued Peak Double Precision 8.0 7.0 6.0 5.0 TFLOPS 4.0 3.0 2.0 1.0 0.0 2008 2010 2012 2014 2016 2017 5 NVIDIA GPU x86 CPU
PRE-52 WEEKS LATER Fast Was Made Slow Continuum Gunrock H2O.ai Graphistry Read Data APP B GPU APP B Copy & Convert Data CPU GPU Copy & Convert GPU APP A Copy & Convert Data APP A BlazingDB MapD Load Data 6
GPU LEADERS UNITE Could We Do Better Than Big Data? 7
TRADITIONAL DATA SCIENCE ON GPUS Lots of glue code and plagued by copy and converts Hadoop Processing, Reading from disk HDFS HDFS HDFS HDFS HDFS Query ETL ML Train Read Write Read Write Read Spark In-Memory Processing 25-100x Improvement Less code HDFS Language flexible Query ETL ML Train Read Primarily In-Memory GPU/Spark In-Memory Processing 5-10x Improvement More code Language rigid HDFS ReadQuery CPU GPU GPU CPU GPU ML Read ETL Substantially on GPU Read Write Write Read Train • Each system has a different internal memory format with mostly overlapping functionality • Depending on the workflow, 80+% of time and computation is wasted on the serialization, 8 deserialization, and copying of data
PRE-52 WEEKS LATER What We Want Continuum Gunrock H2O.ai Graphistry Read Data GPU CPU Memory GPU Buffer BlazingDB MapD Load Data 9
GOAI AND THE BIG DATA ECOSYSTEM No copy and converts on the GPU, compatible with Apache Arrow No Copy & Converts - Full Interoperability Continuum Gunrock H2O.ai GPU Simantex Data Graphistry Frame • All systems utilize the same memory format, so overhead for cross-system nvGRAPH communication is minimized and projects can share features and functionality • Most, if not all of the GPU Data Frame functionality is going back into Apache Arrow • Currently three GPU Data Frame libraries: libgdf (C library), pygdf (Python library), and Dask_gdf (multi-gpu, multi-node Python library) BlazingDB MapD 10 github.com/apache/arrow github.com/gpuopenanalytics
DATA SCIENCE ON GPUS WITH GOAI + GDF Faster Data Access Less Data Movement Hadoop Processing, Reading from disk HDFS HDFS HDFS HDFS HDFS Query ETL ML Train Read Write Read Write Read Spark In-Memory Processing 25-100x Improvement Less code HDFS Language flexible Query ETL ML Train Read Primarily In-Memory GPU/Spark In-Memory Processing 5-10x Improvement More code Language rigid HDFS GPU ReadQuery CPU GPU CPU GPU ML Read ETL Substantially on GPU Read Write Write Read Train End to End GPU Processing (GOAI) 10-25x Improvement Same code Arrow Language flexible ML Query ETL Primarily on GPU Read Train 11
ANACONDA – PyGDF & DaskGDF Moving From Traditional Flows Traditional Workflows Arrays Data Database ETL Model Sparse Frame Matrix Manipulation of columns Data originates from a Training with database occur here (encoding, many transformations, training algorithms to Nearly all data curation variable creation). find the most happens within the database accurate (joins, group bys, unions, Data structure is method etc…) converted from a dataframe to a matrix or Database has already dealt arrays with providing structure to the data and contains nearly all useable data for ML The output of a database and its functionality is a data frame where additional ETL is minimal 12
PYGDF & DaskGDF UDFs Python -> GPU Acceleration • Write a custom function in Python that gets JIT compiled into a GPU Kernel by Numba • Functions can be applied by row, column, or groupby group 13
ANACONDA – PyGDF & DaskGDF To Complex Flows Complex Workflows Array Data Data ETL Model Sparse Lake Frame Matrix Many Array Data ETL Model Database Data ETL Sparse Frame Matrix Types Array Data ETL Model Sparse Streams Frame Matrix Data originates from wherever ETL process is in charge of moving Training with many developers can find data, and it’s data into one usable format (from csv, algorithms occur. stored in many formats. xml, json, db formats, hadoop formats, etc…) Feedback loop to ETL then With many groups using data in occurs. different ways, data is stored in Data curation performed on all the formats for maximum usability, but disparate data Back in the ETL process: if pushes more manipulation to the a subset of the data is the ETL functions Subsets of the data created for different root cause of accuracy modeling targets issues, new subsets are The output of all these sources are formed for new algorithmic varying data frames with varying approaches. The rest of traditional ETL occurs. structure 14
PYGDF NEW JOINS Faster Join Support Coming TPCH Query 21 – End to End Results Using 32-bit Keys* TIME (MS) SF1 SF10 SF100 CPU (single-threaded) 1329 31731 465064 300x V100 (PCIe3) 22 164 1521 3.2x V100 (3xNVLINK2) 12 45 466 TPCH Query 4 – End to End Results Using 32-bit Keys* TIME (MS) SF1 SF10 SF100 CPU (single-threaded) 150 2041 24960 26x V100 (PCIe3) 13 105 946 3.1x V100 (3xNVLINK2) 7 23 308 15 *A *Assuming g the input tables are loaded and pinned in system memory
PYGDF NEW JOINS Takeaways GPU memory capacity is not a limiting factor GPU query performance up to 2-3 orders of magnitude better than CPU GPU query perf is dominated by the CPU-GPU interconnect throughput NVLINK systems show 3x better E2E query performance compared to PCIe Thanks Nikolay Sakharnykh! S8417 - Breaking the Speed of Interconnect with Compression for Database Applications – Tuesday, Mar 27, 2:00pm – Room 210F 16
BLAZINGDB JOINS GOAI Scale Out Data Warehousing 17
BLAZINGDB Scale Out Data Warehousing Compression/Decompression on GPU to SCHEMA improve throughput METADATA DATA • Compression/Descompression • Filtering (Predicate Pushdown) • Aggregations • Transformations DATA LAKE • Joins • Sorting/Ordering Same system more interoperability 0001010100001001011010110 Parquet In Arrow/GDF Out • RAM Cache • Disk Cache • HDD • SSD 18
BLAZINGDB The Future Coming Soon Common Data Layer STORAGE GDF (Data Lake) Arrow INGEST 19
H2O.AI ML On GPU First 3 algorithms • GLM • K-Means • GBM (XGBoost) 20
H2O.AI ML On GPU 2 new algorithms • tSVD • PCA 21
XGBOOST Faster, More Scalable, & Better Inferencing Scalability increase from 16GB to 100GB on DGX-1 Performance Improvement not only on single-GPU, but multi-GPU scaling GBDT Inference Library Thanks Andrey Adinets, Vinay Deshpande, and Thejaswi Nanditale! 22
NVGRAPH Arrow to Graph • Ported nvGRAPH to run natively on GPU DataFrame, so we can use two columns as source and destinations to define an unweighted graph. • Breadth First Search, Jaccard Similarity and Pagerank with Python Bindings • Developing Hornet integration for GoAi as well as Gunrock • 1x P100 2-3 orders of magnitude faster than i7-3930K NetworkX Python Library 23
IBM SNAPML More Proof of ML on GPU 24
MAPD In Memory GPU Database 100x Faster Queries Speed of Thought Visualization + MapD Core MapD Immerse A visual analytics engine that A fast, relational, column store leverages the speed + rendering database powered by GPUs capabilities of MapD Core 25
MAPD Improvements Since GTC17 Multi-source Dashboards MapD Immerse Multi-layer GeoCharts Auto-refresh for streaming data Charting - Combo chart, multi-measure line chart, stacked bar Performance - Joins, string literal comparisons MapD Core Data Ingestion - Read from Kafka, compressed files, S3 Major rendering performance improvements O(1-10MM ) polygons in ~ms Arrow - improved GPU memory mgt, pymapd with bi-directional arrow-based ingest 26
MAPD MapD Presto GPU Database Performance 35 30 30 25 25 20 20 https://github.com/NVIDIA/presto-mapd-connector 15 10 8 8GPU MapD alone up to 40x dual 20-core CPU on 6 4 5 inferencing streaming data 1.2 1.2 1.2 0.1 0.1 0.1 0 PRESTO ON JSON PRESTO ON PARQUET MAPD PRESTO + MAPD Faux Multi-node MapD Presto being developed 10 mins 30 mins 60 mins 27
MAPD Dashboards Comparison vs Kibana 28
MAPD MapD Immerse vs Elastic Kibana 300 MapD Immerse (DGX) x MapD Immerse (P2) Elastic Kibana Time to Fully Load (seconds) 200 100 < 12s < 9s 0 1 6 11 16 21 26 31 Days of Data 29
GRAPHISTRY Accelerated visual graph analytics and investigation platform 30
GRAPHISTRY Improvements since GTC17 CSV, JSON, ETC Arrow.js GPU DATA FRAME https://www.npmjs.com/package/arrow 31
TESLA V100 32GB WORLD’S MOST ADVANCED DATA CENTER GPU WITH 2X THE MEMORY 5,120 CUDA cores 640 NEW Tensor cores 7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS 20MB SM RF | 16MB Cache 32GB HBM2 @ 900 GB/s | 300 GB/s NVLink 32
33
34
35
Recommend
More recommend