how to achieve real time analytics on a data lake using
play

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark - PowerPoint PPT Presentation

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark Brooks - Principal System Engineer @ Kinetica May 09, 2017 The Challenge: How to maintain analytic performance while dealing with: Larger data volumes Streaming data


  1. HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark Brooks - Principal System Engineer @ Kinetica May 09, 2017

  2. The Challenge: How to maintain analytic performance while dealing with: • Larger data volumes • Streaming data with minimal end-to-end latency • Ad-hoc drill down (you can’t pre-aggregate everything) 2

  3. Architectural and Design Approaches 1. One database to rule them all 2. SQL on Hadoop (or directly on the Data Lake) 3. Data Lake + NoSQL + Spark + Search + Cache +… 4. Lambda Architecture 5. Kappa Architecture 6. Next generation hardware acceleration 3

  4. One Database To Rule Them All 4

  5. SQL on a Data Lake 5 Credit: https://www.slideshare.net/Bigdatapump/sql-on-hadoop-49494494

  6. Hadoop + NoSQL + Search + Memory Cache +… 6 Credit: Matt Turck - https://www.slideshare.net/mjft01/big-data-landscape-matt-turck-may-2014

  7. Lambda Architecture Credit: Nathan Marz http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html James Kinley http://jameskinley.tumblr.com/tagged/Lambda 7

  8. Lambda Architecture Credit: James Kinley http://jameskinley.tumblr.com/tagged/Lambda 7

  9. Kappa Architecture Credit: Jay Kreps https://www.oreilly.com/ideas/questioning-the-lambda-architecture 8

  10. Kappa Architecture Stream processing systems already have a notion of parallelism; why not just handle reprocessing by increasing the parallelism and replaying history very, very fast? Credit: Jay Kreps https://www.oreilly.com/ideas/questioning-the-lambda-architecture 8

  11. Next Generation Hardware Acceleration Consider a system with these characteristics: • Horizontally Scalable • Low end-to-end latency • Powerful enough to not require pre-aggregation This is now possible… Credit: Jay Kreps https://www.oreilly.com/ideas/questioning-the-lambda-architecture 8

  12. GPU Accelerated Compute 1990 - 2000’s 2005… 2010… 2017… AT SCALE PROCESSING BECOMES THE BOTTLENECK DATA WAREHOUSE AFFORDABLE MEMORY GPU ACCELERATED COMPUTE DISTRIBUTED STORAGE RDBMS & Data Warehouse Hadoop and MapReduce Affordable memory allows for GPU cores bulk process tasks in parallel - far more efficient for many technologies enable enables distributed storage and faster data read and write. data-intensive tasks than CPUs organizations to store and processing across multiple HANA, MemSQL, & Exadata which process those tasks linearly. analyze growing volumes of data machines. provide faster analytics. on high performance machines, but at high cost. Storing massive volumes of data becomes more affordable, but performance is slow 12

  13. Kinetica: Core HTTP Head Node GPU Accelerated ANALYTICS DATABASE ACCELERATED BY GPUs Columnar In-memory Database A1 B1 C1 A2 B2 C2 Columnar in-memory database A3 B3 C3 A4 B4 C4 Data available much like a traditional RDBMS… rows, columns Disk Commodity Hardware w/ GPUs Data held in-memory; persisted to disk KINETICA Interact with Kinetica through its native REST API, Java, Python, JavaScript, NodeJS, C++, SQL, etc… as well as with various connectors Native GIS & IP address object support VERY FAST: Ideal for OLAP workloads Typical hardware setup: 256GB - 1TB memory with 2-4 GPUs per node. 13

  14. Multi-Head Ingest and Scale-Out Architecture ON-DEMAND SCALE OUT HTTP Head Node HTTP Head Node HTTP Head Node Columnar Columnar Columnar In-memory In-memory In-memory A1 C1 A1 C1 A1 C1 B1 B1 B1 + A2 B2 C2 A2 B2 C2 A2 B2 C2 A3 C3 A3 C3 A3 C3 B3 B3 B3 A4 C4 A4 C4 A4 C4 B4 B4 B4 Disk Disk Disk Commodity Hardware Commodity Hardware Commodity Hardware w/ GPUs w/ GPUs w/ GPUs MULTI-HEAD INGEST 19

  15. Real-Time Data Handlers for Structured & Unstructured Data APIs VISUALIZATION via ODBC/JDBC GEOSPATIAL CAPABILITIES Geometric Java API C++ API WMS Objects Tracks WKT JavaScript API Node.js API Geospatial REST API Python API Endpoints OPEN SOURCE INTEGRATION HTTP Head Node HTTP Head Node HTTP Head Node HTTP Head Node Apache NiFi Columnar Columnar Columnar Columnar In-memory In-memory In-memory In-memory Apache Kafka A1 B1 C1 A1 B1 C1 A1 C1 A1 C1 B1 B1 Apache Spark A2 B2 C2 A2 B2 C2 A2 B2 C2 A2 B2 C2 A3 C3 A3 B3 C3 B3 A3 B3 C3 A3 B3 C3 Apache Storm A4 C4 A4 B4 C4 B4 A4 B4 C4 A4 B4 C4 OTHER INTEGRATION Disk Disk Disk Disk Message Queues Commodity Hardware Commodity Hardware Commodity Hardware Commodity Hardware w/ GPUs ETL Tools w/ GPUs w/ GPUs w/ GPUs KINETICA CLUSTER Streaming Tools On-Demand Scale 20

  16. Parallel Ingest Provides High Performance Streaming 1 NODE (1TB/2GPU) PARALLEL INGEST 1 NODE (1TB/2GPU) 1 NODE (1TB/2GPU) Each node of the system can share the task of data ingest, provides more and faster throughput. It can be made faster simply by adding more nodes. No compute is used on ingest ! 16

  17. Speed Layer for the Data Lake Parallel ingestion of events Kinetica is speed layer with real- ANALYSTS Put, get, scan time analytic capabilities Amazon Kinesis HDFS for archival store MOBILE ALERTING Much looser coupling than USERS SYSTEMS Kinetica traditional lambda architecture EVENTS Connectors Execute complex analytics on the fly Batch mode Spark or MR jobs DASHBOARDS & APPLICATIONS can push data to Kinetica as MESSAGE STREAM BROKERS PROCESSING needed for fast query on data loaded from the data lake Parallel Ingestion HDFS / AWS S3 / GCS / Azure Data Lake ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° 17

  18. Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle Parallel ingestion of events Lambda-type architecture for ANALYSTS Teradata or Oracle Amazon Kinesis Fast GPU Kinetica is speed layer with accelerated, in- near-real-time analytic Memory Database MOBILE ALERTING Converge ML, AI, capabilities USERS SYSTEMS Streaming DATA IN Kinetica MOTION Converge Machine Learning, AND REST Connectors streaming and location DASHBOARDS & APPLICATIONS analytics and fast Query and STREAM / ETL PROCESSING Analytics with Kinetica and RDBMS DATA WAREHOUSE / TRANSACTIONAL 18

  19. Advanced In-Database Analytics ORCHESTRATION LAYER WITH USER-DEFINED FUNCTIONS (UDFs) 1. User-defined functions (UDFs) can receive table data, PHYSICAL / VIRTUAL SERVER do arbitrary computations, and save output to a separate table in a distributed manner. Table A Data returned to Table n output table for Table B 2. UDFs have direct access to CUDA APIs – enables further analysis Table C compute-to-grid analytics for logic deployed within Kinetica. Proc Server /exec/proc/UDF_A/ UDF_B UDF_n UDF_A 3. Works with custom code, or packaged code. Opens UDFs exposed from RESTful endpoint the way for machine learning/artificial intelligence CUDA Libraries libraries such as TensorFlow, BIDMach, Caffe and Torch to work on data directly within Kinetica. GPU 4. Available now with C++ & Java bindings. n number of Kinetica servers 19

  20. Kinetica Architecture ETL / STREAM Native BI / GIS / APPS PROCESSING APIs STREAMING DATA KINETICA ‘REVEAL’ SQL PARALLEL INGEST ON DEMAND SCALE OUT + Geospatial WMS BI DASHBOARDS Custom 1TB MEM / 2 GPU CARDS Connectors UDFs TRANSACTIONAL DATA In-Database Processing CUSTOM APPS & GEOSPATIAL ERP / CRM / ML Libs CUSTOM LOGIC BIDMach 20

  21. AI & BI on One GPU-Accelerated Database CUSTOM APPLICATIONS HIGH FIDELITY GEOSPATIAL PIPELINE BUSINESS INTELLIGENCE ODBC Native / JDBC REST API WMS SQL BUSINESS USERS HIGH PERFORMANCE ANALYTICS DATABASE UDF UDF UDF BIDMach DATA SCIENTISTS / DEVELOPERS MACHINE LEARNING PREDICTIVE MODELS e.g. Risk Management, & DEEP LEARNING GPU-ACCELERATED Sales Volume, Fraud. DATA SCIENCE 21

  22. 50-100x Faster on Queries with Large Datasets WHEN COMPARED TO LEADING IN-MEMORY ALTERNATIVES • Large retailer tested complex SQL queries on 3 years of retail data (150bn rows) SELECT (Q10) • 10 node Kinetica cluster against 30TB+ cluster from next best alternative GROUP BY (Q5) • GPU is able to perform many instructions in parallel. Huge performance gains on aggregations, group bys, joins, etc. SUM (Q1) • Kinetica sustained ingest of 1.3bn 0 5 10 15 20 25 30 35 40 45 50 objects/minute with 70 attributes per row Kinetica Leading In-Memory DB More Details 22

  23. Distributed Geospatial Pipeline • NATIVE VISUALIZATION IS DESIGNED FOR FAST MOVING, LOCATION-BASED DATA Native Geospatial Object Types • Points, Shapes, Tracks, Labels Native Geospatial Functions Filters (by area, by series, by geometry, etc.) • Aggregation (histograms) • • Geofencing - triggers Video generation (based on dates/times) • Generate Map Overlay Imagery (via WMS) • Rasterize points Style based on attributes (class-break) • Heat maps • 23 23

  24. Full-Text Search Kinetica includes powerful text search functionality, “Rain Tire” ~5 including : "Union Tranquility"~10 • Exact Phrases [100 TO 200] • Boolean – AND / OR • Wildcards • Grouping • Fuzzy Search (Damerau-Levenshtein optimal string alignment algorithm) • N-Gram Term Proximity Search • Term Boosting Relevance Prioritization 22

Recommend


More recommend