gpu open analytics initiative
play

GPU OPEN ANALYTICS INITIATIVE END-TO-END ACCELERATED ANALYTICS Brad - PowerPoint PPT Presentation

GPU OPEN ANALYTICS INITIATIVE END-TO-END ACCELERATED ANALYTICS Brad Rees, Ph.D. - Senior Solution Architect - NVIDIA GTC DC, November 2017 The AI Computing Company AGENDA TWO PARTS Discuss Analysis from the Perspective of Data Science


  1. GPU OPEN ANALYTICS INITIATIVE END-TO-END ACCELERATED ANALYTICS Brad Rees, Ph.D. - Senior Solution Architect - NVIDIA GTC DC, November 2017 The AI Computing Company

  2. AGENDA – TWO PARTS Discuss Analysis from the Perspective of Data Science “ Data science , also known as data-driven • Part 1 science , is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data … ” • Big Data and Spark - WIkipedia • GPU Barriers Better Exploration ∝ Better Science • Part 2 Faster Analytics yield better Exploration • GOAI Fail Fast Needs to be Embraces I have not failed. I've just found 10,000 ways that won't work. - Thomas A. Edison

  3. the Big Data Catalyst The Glue that Binds Big Data • Spark has become synonymous with Hadoop and Big Data • It’s the interface/API for big data app to app communication • The processing layer for big data and leading ML framework

  4. SPARK IS NOT ENOUGH We Want More Efficiency and Speed • Common issue is speed at scale • Scaling out to get the necessary speed for mission critical workloads is prohibitively expensive • Clients want core ML on GPU Commercial Government HPC We need a GPU-equivalent to Spark … But there are some Barriers

  5. GPU ADOPTION • Too much data movement BARRIERS • Too many makeshift data formats • No inter-GPU communication Concerns: • No Python API for data manipulation Too Hard to Integrate GPUs • • No all inclusive Machine Learning Library Not suited for Data Science •

  6. DATA MOVEMENT AND TRANSFORMATION The bane of productivity / performance • Too much time spent Moving data Data movement and conversion • hinder any performance gains No Inter-GPU Communication • CPU

  7. Parquet GML CSV Panda Avro HDFS XML Numpy DATA FORMATS JSON Pickle CSC ProtoBuf CSR COO Plain Text vs Binary Compressed vs Uncompressed * Not a complete list

  8. ARE THE GPU BARRIERS TO GREAT? Is there any hope? ☹️ Data movement ☹️ Data formats ☹️ Inter-GPU communication ☹️ No Python API for data manipulation ☹️ No all inclusive Machine Learning Library

  9. GPU OPEN ANALYTICS INITIATIVE Luckily others were also thinking about the problems • Formed in March at Strata SJ; Launched at GTC in May • Goal: GOAI seeks to foster open collaboration between GPU analytics projects and products to enable data scientists to efficiently combine the best tools for their workflows.

  10. ACCELERATED ANALYTICS ECOSYSTEM Prior State (pre-March 2017) ● Fragmented with too INTERACTION Graphistry Jupyter NB many holes MapD Immerse ● Still too reliant on CPU for moving data between applications Data Manipulation ● 80-90% of data science is PROCESSING accelerated analytics, not MapD Anaconda * deep learning yet AND Fast Data BlazingDB NV Graph (Dask (Streaming) (“SQL”) ANALYTICS “Python”) IN GPU MEMORY Many Columnar Data Frames DATA (everyone has their own makeshift data frame) STRUCTURE Key: Open Source Free to Use STORAGE MapD GPU Ram BlazingDB Disk Closed Source * Primarily x86 w/ some GPU acceleration

  11. ACCELERATED ANALYTICS ECOSYSTEM Post-March 2017 INTERACTION Graphistry Jupyter NB MapD Immerse Data Manipulation PROCESSING MapD Anaconda AND H2O (Data. H2O.ai (GPU Fast Data BlazingDB NV Graph (Dask Table “R”) MLlib) (Streaming) (“SQL”) ANALYTICS “Python”) IN GPU MEMORY Standard Columnar Data Frame DATA (Open Sourced/Free to Use from MapD) STRUCTURE Key: Open Source Free to Use STORAGE MapD + BlazingDB MapD GPU Ram BlazingDB Disk System Memory Closed Source

  12. LEARNING FROM APACHE ARROW Interoperability Big Data ecosystem facing similar issues Major push in the big data world to remove bottlenecks of copy & converting data between systems Apache Arrow™ enables execution engines to take advantage of the • latest SIMD (Single input multiple data) operations Columnar layout is optimized for data locality for • better performance on modern hardware like CPUs and GPUs. The Arrow memory format supports zero-copy • reads for lightning-fast data access without serialization overhead.

  13. THE GPU DATA FRAME First GOAI Project ✓ Data movement ✓ Data formats ✓ Inter-GPU communication ✓ Python API ✓ Machine Learning Library CPU So …. What does this get me?

  14. SEAMLESS CALLS BETWEEN APPLICATIONS What does GOAI get me? Big improvement for Data Science Load data into MapD • • Call an H2O ML algorithm All via Anaconda Python • • Within a Jupyter Notebook Demos available on goai github

  15. SEAMLESS CALLS BETWEEN APPLICATIONS What does GOAI get me? Big improvement for Data Science Load data into MapD • pygdf: Python library for manipulating GDFs • Call an H2O ML algorithm • Creating GDFs from numpy arrays and Pandas DataFrames • Performing math operations on columns All via Anaconda Python • • Import/export via CUDA IPC • Sort, join, reductions • Within a Jupyter Notebook • JIT compilation of group by and filter kernels using Numba Demos available on goai github

  16. SIMPLE DATA CONVERSION Convert from Pandas and Numpy

  17. Several Examples Available on GOAI GitHub

  18. GOAL OF GOAI Better Adoption with Better Usability and TCO Hadoop Processing, Reading from disk HDFS HDFS HDFS HDFS HDFS SQL Query ETL Train Read Write Read Write Read Spark In-Memory Processing 25-100x Improvement Large TCO benefit Less code over Hadoop Language flexible HDFS Large Adoption SQL Query ETL ML Train Primarily In-Memory Read GPU + Spark In-Memory Processing 5-10x Improvement Small TCO benefit More code over Spark Language rigid HDFS GPU SQL CPU GPU CPU GPU ML Small Adoption Read ETL Substantially on GPU Read Read Query Write Write Read Train End-to-End GPU Processing (GOAI) 25-100x Improvement Large TCO benefit Same code over Spark Language flexible Arrow SQL ML Large Adoption? Query ETL Primarily on GPU Read Train

  19. • libgdf: C library of helper functions: • Copying GDF metadata block to the host and parsing it INITIAL LIBRARIES to a host-side struct • Importing/exporting via CUDA IPC GPU Data Frame • CUDA kernels to perform element-wise math operations on GDF columns. • CUDA sort, join, and reduction operations on GDFs. github.com/gpuopenanalytics • pygdf: Python library for manipulating GDFs • Creating GDFs from numpy arrays and Pandas DataFrames • Performing math operations on columns • Import/export via CUDA IPC • Sort, join, reductions • JIT compilation of group by and filter kernels using Numba • dask_gdf: Extension for Dask to work with distributed GDFs. • Same operations as pygdf, but working on GDFs chunked onto different GPUs and different servers.

  20. ABOUT ~8.5x speedup on half a DGX Python on GPU... ~100x speedup using MapD on to produce a robust GLM via Numba and Pandas half a DGX to analyze census 10-fold cross-validation vs an 8 data vs a 20 node Spark cluster node Spark cluster ~5X faster than Redshift to utilize full >50x speedup in ~100x more cyber security data disk storage and system memory performing pagerank on a interactively visualized using an graph on half a DGX vs intuitive layout algorithm on a an 8 node Spark cluster single GPU as a connected graph

  21. MapD GPU-accelerated analytics platform Consists of MapD Core database and MapD Immerse MapD Core database is an in-GPU-memory, columnar, open-source, GPU-accelerated, SQL database. MapD Enterprise brings distributed and high availability modes, GPU-accelerated backend rendering, Kerberos/LDAP security, and ODBC/JDBC. MapD Immerse is a visual analytics platform on top of the MapD Core database that allows data scientists and analysts to interactively explore large datasets.

  22. 1.1 BILLION TAXI RIDES BENCHMARK GPU Memory based Query 1 Query 2 Query 3 Query 4 8134 19624 85942 10190 5000 databases 8x to 15x faster 4500 than CPU in- 4000 memory databases 3500 such as Redshift. Time in Milliseconds 2970 3000 100x to 485x faster 2500 2250 than Spark 2000 on 11-servers 1560 1500 1250 1209 Open Source core 1000 795 DBMS 596 518 372 500 150 80 21 Free Community 0 Edition MapD DGX-1 Kinetica DGX-1 Redshift 6-node Spark 11-node @marklit82 Source: MapD Benchmarks on DGX-1 from internal NVIDIA testing following guidelines of Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS

  23. BlazingDb GPU-accelerated petabyte scale data warehouse Consists of BlazingDB database BlazingDB database is a disk-based, columnar, GPU-accelerated SQL database. BlazingDB has distributed and high availability modes, JDBC, and Python/C# APIs. BlazingDB offers a Community Edition that can be downloaded for free and has an Enterprise Edition that you can launch today on AWS.

  24. Blazing DB high performance SQL on petabyte scale Blazing speedup BlazingDB SQL is built on a columnar relational data model. Enterprise grade security through Spring Security BlazingDB distributes both data and computation to multiple instances, for more data, or faster query speeds • https://blazingdb.com/

Recommend


More recommend