Apache Ignite as MPP Accelerator Alexander Ermakov, CTO
Agenda • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco Cases • Integrating Apache Ignite with Arenadata DB • Using the power of in-memory computing with MPP (Example)
<About us>
Who we are? • Arenadata unites a keen team of developers & engineers working on building enterprise data platform. • We are contributors of Open Source Projects: • Greenplum • Apache PXF • Apache Bigtop • Members of ODPi (Linux Foundation) since 2015
ODPi Compliant Platforms
Arenadata Enterprise Data Platform Platform Extension Framework
Arenadata - Open Source store.arenadata.io
Our Partners
Why DWH needs in-memory grid?
New Generation of Business Cases READING SMART METERS FACEBOOK UPLOADS EVERY 15 MINUTES IS 250 MILLION 3000X MORE PHOTOS EACH DAY DATA INTENSIVE Mobile Sensors Video Surveillance Social Media Smart Grids COST TO SEQUENCE OIL RIGS GENERATE 25000 ONE GENOME HAS FALLEN FROM $100M DATA POINTS PER IN 2001 SECOND TO $10K IN 2011 TO $1K IN 2014 Medical Imaging Oil Exploration Stock Market Gene Sequencing
Data Value Chain ms seconds hours weeks months year years+
Data Warehouse Sources Transport Transform Store Analyze DWH ELT & DQ Batch API ES Data Mart BI DDS ODS CDC OLTP SP Table
Data Lake Sources Transport Transform Store Analyze Queue DWH ELT & DQ Batch API API ES Data Mart BI DDS ODS … CDC OLTP Hadoop SP SQL On HDFS Table Hadoop
Lambda Architecture Sources Transport Transform Store Analyze Real Time STG App Batch App Queue DWH ELT & DQ Batch API ES Data Mart BI DDS ODS … CDC Hadoop SQL On HDFS Hadoop
Kappa Architecture Sources Transport Transform Store Analyze Real Time STG App Batch App Queue BI …
Real Time Analytics for Telco Cases
Customer Retention / Connection Breakdowns
Geo Marketing
Migrating from a Reactive, Static and Constrained Model… Ingest Store Analytics Data Lake HDFS Coding based Hard to change No real-time information Labor intensive Based on expensive ETL Inefficient
To Pro-Active, Self-Improving, Machine Learning Systems In-Memory Data Stream Pipeline Real-Time Data Expert System / Data Lake HDFS Machine Learning Continuous Learning Multiple Data Sources Continuous Improvement Real-Time Processing Continuous Adapting Store Everything
Sandboxes Data Feeds Historical Data Stream Processing Expert Systems Data Lake Machine Learning HDFS Business Value Smart Decisions
Data Streaming Reference Architecture Data Feeds Transactional Apps Analytic Apps Data Stream Pipeline Real Time Data & Expert Systems & Advanced Distributed Computing Machine Learning Analytics Data Lake
Data Streaming Reference Architecture Data Feeds Transactional Apps Analytic Apps Data Lake
Integrate Apache Ignite with Arenadata DB
Arenadata Grid
Arenadata Grid Use Cases
Arenadata DB Architecture Flexible framework for processing large datasets Master Host and Standby Master Host Master coordinates work with Segment Hosts SQL Segment Host with one or more Segment Instances Standby Master Segment Instances process queries in parallel Master Host Segment Hosts have their own CPU, disk and memory (shared nothing) High speed interconnect for … Segment Segment Segment Segment Segment continuous pipelining of data processing
Greenpum Core Development • Zstandard support (will be added to stable at 6.0.0 due to naming convention) • PXF development: we bet a lot. Ignite integration, push down feature, JDBC & Ignite stable release • Few bugs and a lot of issues
Parallel Query Optimizer • Cost-based optimization looks for the most efficient plan PHYSICAL EXECUTION PLAN FROM SQL OR MAPREDUCE • Physical plan contains scans, joins, Gather Motion 4:1(Slice 3) sorts, aggregations, etc. Sort HashAggregate • Global planning avoids sub-optimal HashJoin ‘SQL pushing’ to segments Redistribute Motion 4:4(Slice 1) Hash • Directly inserts ‘motion’ HashJoin HashJoin Seq Scan on line item nodes for inter-segment Hash Seq Scan on customer Hash Seq Scan on orders Broadcast Motion 4:4(Slice 2) communication Seq Scan on motion 29
MADlib: Toolkit for Advanced Big Data Analytics • Better Parallelism – Algorithms designed to leverage MPP or Hadoop architecture • Better Scalability – Algorithms scale as your data set scales – No data movement • Better Predictive Accuracy – Using all data, not a sample, may improve accuracy • Open Source – Available for customization and optimization by user 30
MADlib In-Database Functions Descriptive Statistics Predictive Modeling Library Sketch-based Estimators Machine Learning Algorithms Generalized Linear Models • CountMin (Cormode- • ARIMA • Linear Regression Muthukrishnan) • Principal Component Analysis (PCA) • Logistic Regression • FM (Flajolet-Martin) • Association Rules (Affinity Analysis, Market • Multinomial Logistic Regression • MFV (Most Frequent Basket) • Cox Proportional Hazards Values) • Topic Modeling (Parallel LDA) • Regression Correlation • Decision Trees • Elastic Net Regularization Summary • Ensemble Learners (Random Forests) • Sandwich Estimators (Huber white, • Support Vector Machines clustered, marginal effects) Support Modules • Conditional Random Field (CRF) • Clustering (K-means) Array Operations • Cross Validation Sparse Vectors Matrix Factorization Random Sampling • Singular Value Decomposition (SVD) Linear Systems Probability Functions • Sparse and Dense Solvers 31
Polymorphic Table Storage • Provide the choice of processing model for any Historical data table or any individual partition (Years) – Enable Information Lifecycle Management slow HDD (ILM) Actual data (months) • Storage types can be mixed within a table or regular HDD database Now data – Four table types: heap, row-oriented AO, (hours) column-oriented, external SSD – Block compression: Gzip (levels 1-9), Zstd – Columnar compression: RLE Single table
Platform eXtension Framework (PXF) • An advanced version of Greenplum external tables • Supports connectors for HDFS, HBase and Hive, JDBC, Ignite (Arenadata DB) • Provides extensible framework API to enable custom connector
PXF Profiles • HDFS Files CREATE EXTERNAL TABLE pxf_sales_part( item_name TEXT, • Ignite item_type TEXT, supplier_key INTEGER, • JDBC item_price DOUBLE PRECISION, delivery_state TEXT, • Avro delivery_city TEXT ) • HBase LOCATION (‘pxf://grid_host? Profile=Ingite&IGNITE_CACHE=test&BUFFER_ • Hive SIZE=10000’ ); – Text based – SequenceFile – RCFile – ORCFile
PXF Profiles <profile> <name>Ignite</name> <plugins> <fragmenter>IgniteFragmenter</fragmenter> <accessor>IgniteAccessor</accessor> <resolver>IgniteResolver</resolver> <analyzer>IgniteAnalyzer</analyzer> </plugins> </profile>
PXF Classes • Fragmenter – returns a list of source data fragments and their location • Accessor – access a given list of fragments read them and return records • Resolver – deserialize each record according to a given schema or technique • Analyzer – returns statistics about the source data
PXF Pushdown Feature Date User_id Message 21-01-2018 16 <message> Grid external table Pushdown filter 01-11-2018 500 <message> Latency: milliseconds … partition by Date ( Executed in external system Cost per GB: $$$ partition1: Date => 01-01-2018 15-05-2018 2042 <message> partition2: Date < 01-01-2018 and Date => 01-01-2015 partition3: Date < 01-01-2015 ) 17-09-2017 15 <message> … where Date > 20-01-2018 Regular ADB table … where Date < 18-09-2017 15-06-2016 55 <message> Latency: seconds Cost per GB: $$ … where Date > 16-06-2017 24-12-2015 3510 <message> Partition filter AND User_id < 400 01-01-2012 19 <message> Hadoop external table Pushdown 26-04-2013 42 <message> Latency: tens of seconds Cost per GB: $ 23-05-2010 17 <message>
PXF Pushdown Feature
Using power of In-Memory computing with MPP
Test Bench Arenadata Unified Internal Ignite1 Ignite2 Data Platform Affinity Functions PXF interaction Greenplum Greenplum Greenplum Seg1 Seg2 Master Hadoop Hadoop Hadoop Datanode1 Datanode2 Namenode
Creating Table in MPP
Creating External Table for Apache Ignite & Load Data
Creating External Table in Hive & Load Data
Exchange Partitions with External Tables
Target Table
Execution Plan prt2: Greenplum Heap Partition prt1: Ignite Cache Partition
Thank you! Questions?
Recommend
More recommend