apache ignite as mpp accelerator
play

Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda - PowerPoint PPT Presentation

Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda About us Why do traditional DWH needs in-memory grid? Real Time Analytics for Telco Cases Integrating Apache Ignite with Arenadata DB Using the power of


  1. Apache Ignite as MPP Accelerator Alexander Ermakov, CTO

  2. Agenda • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco Cases • Integrating Apache Ignite with Arenadata DB • Using the power of in-memory computing with MPP (Example)

  3. <About us>

  4. Who we are? • Arenadata unites a keen team of developers & engineers working on building enterprise data platform. • We are contributors of Open Source Projects: • Greenplum • Apache PXF • Apache Bigtop • Members of ODPi (Linux Foundation) since 2015

  5. ODPi Compliant Platforms

  6. Arenadata Enterprise Data Platform Platform Extension Framework

  7. Arenadata - Open Source store.arenadata.io

  8. Our Partners

  9. Why DWH needs in-memory grid?

  10. New Generation of Business Cases READING SMART METERS FACEBOOK UPLOADS EVERY 15 MINUTES IS 250 MILLION 3000X MORE PHOTOS EACH DAY DATA INTENSIVE Mobile Sensors Video Surveillance Social Media Smart Grids COST TO SEQUENCE OIL RIGS GENERATE 25000 ONE GENOME HAS FALLEN FROM $100M DATA POINTS PER IN 2001 SECOND TO $10K IN 2011 TO $1K IN 2014 Medical Imaging Oil Exploration Stock Market Gene Sequencing

  11. Data Value Chain ms seconds hours weeks months year years+

  12. Data Warehouse Sources Transport Transform Store Analyze DWH ELT & DQ Batch API ES Data Mart BI DDS ODS CDC OLTP SP Table

  13. Data Lake Sources Transport Transform Store Analyze Queue DWH ELT & DQ Batch API API ES Data Mart BI DDS ODS … CDC OLTP Hadoop SP SQL On HDFS Table Hadoop

  14. Lambda Architecture Sources Transport Transform Store Analyze Real Time STG App Batch App Queue DWH ELT & DQ Batch API ES Data Mart BI DDS ODS … CDC Hadoop SQL On HDFS Hadoop

  15. Kappa Architecture Sources Transport Transform Store Analyze Real Time STG App Batch App Queue BI …

  16. Real Time Analytics for Telco Cases

  17. Customer Retention / Connection Breakdowns

  18. Geo Marketing

  19. Migrating from a Reactive, Static and Constrained Model… Ingest Store Analytics Data Lake HDFS Coding based Hard to change No real-time information Labor intensive Based on expensive ETL Inefficient

  20. To Pro-Active, Self-Improving, Machine Learning Systems In-Memory Data Stream Pipeline Real-Time Data Expert System / Data Lake HDFS Machine Learning Continuous Learning Multiple Data Sources Continuous Improvement Real-Time Processing Continuous Adapting Store Everything

  21. Sandboxes Data Feeds Historical Data Stream Processing Expert Systems Data Lake Machine Learning HDFS Business Value Smart Decisions

  22. Data Streaming Reference Architecture Data Feeds Transactional Apps Analytic Apps Data Stream Pipeline Real Time Data & Expert Systems & Advanced Distributed Computing Machine Learning Analytics Data Lake

  23. Data Streaming Reference Architecture Data Feeds Transactional Apps Analytic Apps Data Lake

  24. Integrate Apache Ignite with Arenadata DB

  25. Arenadata Grid

  26. Arenadata Grid Use Cases

  27. Arenadata DB Architecture Flexible framework for processing large datasets Master Host and Standby Master Host Master coordinates work with Segment Hosts SQL Segment Host with one or more Segment Instances Standby Master Segment Instances process queries in parallel Master Host Segment Hosts have their own CPU, disk and memory (shared nothing) High speed interconnect for … Segment Segment Segment Segment Segment continuous pipelining of data processing

  28. Greenpum Core Development • Zstandard support (will be added to stable at 6.0.0 due to naming convention) • PXF development: we bet a lot. Ignite integration, push down feature, JDBC & Ignite stable release • Few bugs and a lot of issues

  29. Parallel Query Optimizer • Cost-based optimization looks for the most efficient plan PHYSICAL EXECUTION PLAN FROM SQL OR MAPREDUCE • Physical plan contains scans, joins, Gather Motion 4:1(Slice 3) sorts, aggregations, etc. Sort HashAggregate • Global planning avoids sub-optimal HashJoin ‘SQL pushing’ to segments Redistribute Motion 4:4(Slice 1) Hash • Directly inserts ‘motion’ HashJoin HashJoin Seq Scan on line item nodes for inter-segment Hash Seq Scan on customer Hash Seq Scan on orders Broadcast Motion 4:4(Slice 2) communication Seq Scan on motion 29

  30. MADlib: Toolkit for Advanced Big Data Analytics • Better Parallelism – Algorithms designed to leverage MPP or Hadoop architecture • Better Scalability – Algorithms scale as your data set scales – No data movement • Better Predictive Accuracy – Using all data, not a sample, may improve accuracy • Open Source – Available for customization and optimization by user 30

  31. MADlib In-Database Functions Descriptive Statistics Predictive Modeling Library Sketch-based Estimators Machine Learning Algorithms Generalized Linear Models • CountMin (Cormode- • ARIMA • Linear Regression Muthukrishnan) • Principal Component Analysis (PCA) • Logistic Regression • FM (Flajolet-Martin) • Association Rules (Affinity Analysis, Market • Multinomial Logistic Regression • MFV (Most Frequent Basket) • Cox Proportional Hazards Values) • Topic Modeling (Parallel LDA) • Regression Correlation • Decision Trees • Elastic Net Regularization Summary • Ensemble Learners (Random Forests) • Sandwich Estimators (Huber white, • Support Vector Machines clustered, marginal effects) Support Modules • Conditional Random Field (CRF) • Clustering (K-means) Array Operations • Cross Validation Sparse Vectors Matrix Factorization Random Sampling • Singular Value Decomposition (SVD) Linear Systems Probability Functions • Sparse and Dense Solvers 31

  32. Polymorphic Table Storage • Provide the choice of processing model for any Historical data table or any individual partition (Years) – Enable Information Lifecycle Management slow HDD (ILM) Actual data (months) • Storage types can be mixed within a table or regular HDD database Now data – Four table types: heap, row-oriented AO, (hours) column-oriented, external SSD – Block compression: Gzip (levels 1-9), Zstd – Columnar compression: RLE Single table

  33. Platform eXtension Framework (PXF) • An advanced version of Greenplum external tables • Supports connectors for HDFS, HBase and Hive, JDBC, Ignite (Arenadata DB) • Provides extensible framework API to enable custom connector

  34. PXF Profiles • HDFS Files CREATE EXTERNAL TABLE pxf_sales_part( item_name TEXT, • Ignite item_type TEXT, supplier_key INTEGER, • JDBC item_price DOUBLE PRECISION, delivery_state TEXT, • Avro delivery_city TEXT ) • HBase LOCATION (‘pxf://grid_host? Profile=Ingite&IGNITE_CACHE=test&BUFFER_ • Hive SIZE=10000’ ); – Text based – SequenceFile – RCFile – ORCFile

  35. PXF Profiles <profile> <name>Ignite</name> <plugins> <fragmenter>IgniteFragmenter</fragmenter> <accessor>IgniteAccessor</accessor> <resolver>IgniteResolver</resolver> <analyzer>IgniteAnalyzer</analyzer> </plugins> </profile>

  36. PXF Classes • Fragmenter – returns a list of source data fragments and their location • Accessor – access a given list of fragments read them and return records • Resolver – deserialize each record according to a given schema or technique • Analyzer – returns statistics about the source data

  37. PXF Pushdown Feature Date User_id Message 21-01-2018 16 <message> Grid external table Pushdown filter 01-11-2018 500 <message> Latency: milliseconds … partition by Date ( Executed in external system Cost per GB: $$$ partition1: Date => 01-01-2018 15-05-2018 2042 <message> partition2: Date < 01-01-2018 and Date => 01-01-2015 partition3: Date < 01-01-2015 ) 17-09-2017 15 <message> … where Date > 20-01-2018 Regular ADB table … where Date < 18-09-2017 15-06-2016 55 <message> Latency: seconds Cost per GB: $$ … where Date > 16-06-2017 24-12-2015 3510 <message> Partition filter AND User_id < 400 01-01-2012 19 <message> Hadoop external table Pushdown 26-04-2013 42 <message> Latency: tens of seconds Cost per GB: $ 23-05-2010 17 <message>

  38. PXF Pushdown Feature

  39. Using power of In-Memory computing with MPP

  40. Test Bench Arenadata Unified Internal Ignite1 Ignite2 Data Platform Affinity Functions PXF interaction Greenplum Greenplum Greenplum Seg1 Seg2 Master Hadoop Hadoop Hadoop Datanode1 Datanode2 Namenode

  41. Creating Table in MPP

  42. Creating External Table for Apache Ignite & Load Data

  43. Creating External Table in Hive & Load Data

  44. Exchange Partitions with External Tables

  45. Target Table

  46. Execution Plan prt2: Greenplum Heap Partition prt1: Ignite Cache Partition

  47. Thank you! Questions?

Recommend


More recommend