Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda - PowerPoint PPT Presentation

Apache Ignite as MPP Accelerator Alexander Ermakov, CTO

Agenda • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco Cases • Integrating Apache Ignite with Arenadata DB • Using the power of in-memory computing with MPP (Example)

Who we are? • Arenadata unites a keen team of developers & engineers working on building enterprise data platform. • We are contributors of Open Source Projects: • Greenplum • Apache PXF • Apache Bigtop • Members of ODPi (Linux Foundation) since 2015

ODPi Compliant Platforms

Arenadata Enterprise Data Platform Platform Extension Framework

Arenadata - Open Source store.arenadata.io

Our Partners

Why DWH needs in-memory grid?

New Generation of Business Cases READING SMART METERS FACEBOOK UPLOADS EVERY 15 MINUTES IS 250 MILLION 3000X MORE PHOTOS EACH DAY DATA INTENSIVE Mobile Sensors Video Surveillance Social Media Smart Grids COST TO SEQUENCE OIL RIGS GENERATE 25000 ONE GENOME HAS FALLEN FROM $100M DATA POINTS PER IN 2001 SECOND TO $10K IN 2011 TO $1K IN 2014 Medical Imaging Oil Exploration Stock Market Gene Sequencing

Data Value Chain ms seconds hours weeks months year years+

Data Warehouse Sources Transport Transform Store Analyze DWH ELT & DQ Batch API ES Data Mart BI DDS ODS CDC OLTP SP Table

Data Lake Sources Transport Transform Store Analyze Queue DWH ELT & DQ Batch API API ES Data Mart BI DDS ODS … CDC OLTP Hadoop SP SQL On HDFS Table Hadoop

Lambda Architecture Sources Transport Transform Store Analyze Real Time STG App Batch App Queue DWH ELT & DQ Batch API ES Data Mart BI DDS ODS … CDC Hadoop SQL On HDFS Hadoop

Kappa Architecture Sources Transport Transform Store Analyze Real Time STG App Batch App Queue BI …

Real Time Analytics for Telco Cases

Customer Retention / Connection Breakdowns

Geo Marketing

Migrating from a Reactive, Static and Constrained Model… Ingest Store Analytics Data Lake HDFS Coding based Hard to change No real-time information Labor intensive Based on expensive ETL Inefficient

To Pro-Active, Self-Improving, Machine Learning Systems In-Memory Data Stream Pipeline Real-Time Data Expert System / Data Lake HDFS Machine Learning Continuous Learning Multiple Data Sources Continuous Improvement Real-Time Processing Continuous Adapting Store Everything

Sandboxes Data Feeds Historical Data Stream Processing Expert Systems Data Lake Machine Learning HDFS Business Value Smart Decisions

Data Streaming Reference Architecture Data Feeds Transactional Apps Analytic Apps Data Stream Pipeline Real Time Data & Expert Systems & Advanced Distributed Computing Machine Learning Analytics Data Lake

Data Streaming Reference Architecture Data Feeds Transactional Apps Analytic Apps Data Lake

Integrate Apache Ignite with Arenadata DB

Arenadata Grid

Arenadata Grid Use Cases

Arenadata DB Architecture Flexible framework for processing large datasets Master Host and Standby Master Host Master coordinates work with Segment Hosts SQL Segment Host with one or more Segment Instances Standby Master Segment Instances process queries in parallel Master Host Segment Hosts have their own CPU, disk and memory (shared nothing) High speed interconnect for … Segment Segment Segment Segment Segment continuous pipelining of data processing

Greenpum Core Development • Zstandard support (will be added to stable at 6.0.0 due to naming convention) • PXF development: we bet a lot. Ignite integration, push down feature, JDBC & Ignite stable release • Few bugs and a lot of issues

Parallel Query Optimizer • Cost-based optimization looks for the most efficient plan PHYSICAL EXECUTION PLAN FROM SQL OR MAPREDUCE • Physical plan contains scans, joins, Gather Motion 4:1(Slice 3) sorts, aggregations, etc. Sort HashAggregate • Global planning avoids sub-optimal HashJoin ‘SQL pushing’ to segments Redistribute Motion 4:4(Slice 1) Hash • Directly inserts ‘motion’ HashJoin HashJoin Seq Scan on line item nodes for inter-segment Hash Seq Scan on customer Hash Seq Scan on orders Broadcast Motion 4:4(Slice 2) communication Seq Scan on motion 29

MADlib: Toolkit for Advanced Big Data Analytics • Better Parallelism – Algorithms designed to leverage MPP or Hadoop architecture • Better Scalability – Algorithms scale as your data set scales – No data movement • Better Predictive Accuracy – Using all data, not a sample, may improve accuracy • Open Source – Available for customization and optimization by user 30

MADlib In-Database Functions Descriptive Statistics Predictive Modeling Library Sketch-based Estimators Machine Learning Algorithms Generalized Linear Models • CountMin (Cormode- • ARIMA • Linear Regression Muthukrishnan) • Principal Component Analysis (PCA) • Logistic Regression • FM (Flajolet-Martin) • Association Rules (Affinity Analysis, Market • Multinomial Logistic Regression • MFV (Most Frequent Basket) • Cox Proportional Hazards Values) • Topic Modeling (Parallel LDA) • Regression Correlation • Decision Trees • Elastic Net Regularization Summary • Ensemble Learners (Random Forests) • Sandwich Estimators (Huber white, • Support Vector Machines clustered, marginal effects) Support Modules • Conditional Random Field (CRF) • Clustering (K-means) Array Operations • Cross Validation Sparse Vectors Matrix Factorization Random Sampling • Singular Value Decomposition (SVD) Linear Systems Probability Functions • Sparse and Dense Solvers 31

Polymorphic Table Storage • Provide the choice of processing model for any Historical data table or any individual partition (Years) – Enable Information Lifecycle Management slow HDD (ILM) Actual data (months) • Storage types can be mixed within a table or regular HDD database Now data – Four table types: heap, row-oriented AO, (hours) column-oriented, external SSD – Block compression: Gzip (levels 1-9), Zstd – Columnar compression: RLE Single table

Platform eXtension Framework (PXF) • An advanced version of Greenplum external tables • Supports connectors for HDFS, HBase and Hive, JDBC, Ignite (Arenadata DB) • Provides extensible framework API to enable custom connector

PXF Profiles • HDFS Files CREATE EXTERNAL TABLE pxf_sales_part( item_name TEXT, • Ignite item_type TEXT, supplier_key INTEGER, • JDBC item_price DOUBLE PRECISION, delivery_state TEXT, • Avro delivery_city TEXT ) • HBase LOCATION (‘pxf://grid_host? Profile=Ingite&IGNITE_CACHE=test&BUFFER_ • Hive SIZE=10000’ ); – Text based – SequenceFile – RCFile – ORCFile

PXF Profiles <profile> <name>Ignite</name> <plugins> <fragmenter>IgniteFragmenter</fragmenter> <accessor>IgniteAccessor</accessor> <resolver>IgniteResolver</resolver> <analyzer>IgniteAnalyzer</analyzer> </plugins> </profile>

PXF Classes • Fragmenter – returns a list of source data fragments and their location • Accessor – access a given list of fragments read them and return records • Resolver – deserialize each record according to a given schema or technique • Analyzer – returns statistics about the source data

PXF Pushdown Feature Date User_id Message 21-01-2018 16 <message> Grid external table Pushdown filter 01-11-2018 500 <message> Latency: milliseconds … partition by Date ( Executed in external system Cost per GB: $$$ partition1: Date => 01-01-2018 15-05-2018 2042 <message> partition2: Date < 01-01-2018 and Date => 01-01-2015 partition3: Date < 01-01-2015 ) 17-09-2017 15 <message> … where Date > 20-01-2018 Regular ADB table … where Date < 18-09-2017 15-06-2016 55 <message> Latency: seconds Cost per GB: $$ … where Date > 16-06-2017 24-12-2015 3510 <message> Partition filter AND User_id < 400 01-01-2012 19 <message> Hadoop external table Pushdown 26-04-2013 42 <message> Latency: tens of seconds Cost per GB: $ 23-05-2010 17 <message>

PXF Pushdown Feature

Using power of In-Memory computing with MPP

Test Bench Arenadata Unified Internal Ignite1 Ignite2 Data Platform Affinity Functions PXF interaction Greenplum Greenplum Greenplum Seg1 Seg2 Master Hadoop Hadoop Hadoop Datanode1 Datanode2 Namenode

Creating Table in MPP

Creating External Table for Apache Ignite & Load Data

Creating External Table in Hive & Load Data

Exchange Partitions with External Tables

Target Table

Execution Plan prt2: Greenplum Heap Partition prt1: Ignite Cache Partition

Thank you! Questions?

Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda - PowerPoint PPT Presentation

Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda About us Why do traditional DWH needs in-memory grid? Real Time Analytics for Telco Cases Integrating Apache Ignite with Arenadata DB Using the power of

Gilead-MPP Licence Overview December 2017 LEGAL May contain MPP and/or MPP licensee confidential

Ingesting Streaming Data for Analysis in Apache Ignite Pat Patterson StreamSets

Ultimate Guide to Successful Cross-Platform Deployments with Apache Ignite Pavel Petroshenko

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Using Distributed Tracing to Resolve Performance Issues in Apache Ignite Greg Stachnick, Director

Computing at MPP Arthur Erhardt MPP Computing Commission + Fachabteilung IT MPP Project Review,

Computing at MPP Arthur Erhardt MPP Computing Commission + Fachabteilung IT MPP Project Review,

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Ignite Extensions - Modularization Saikat Maitra Twitter @samaitra Github samaitra

Predicting Share Prices in Real-Time with Apache Spark and Apache Ignite MANUEL MOURATO Summary

How-to for real-time streaming and analytics at scale with Apache Kafka and Apache Ignite Viktor

real-time alerting, analytics and reporting at scale with Apache Kafka and Apache Ignite

secure payments anywhere Contents Cont ents Who are MPP Global Solutions? eSuite

PMT Measurements of PEN and Friends at MPP Oliver Schulz oschulz@mpp.mpg.de PEN Meeting,

Course notes on Computational Optimal Transport Gabriel Peyr e CNRS & DMA Ecole

Vertex Operator Super Algebras on a Riemann Surface Alexander Zuevsky National University of

Simple Eulerian Methods for Compressible Fluids in Domains with Moving Boundaries Alina Chertock

CS171 Visualization Alexander Lex alex@seas.harvard.edu Graphs [xkcd] This Week Reading: VAD,

Ontology Learning: Framework, Techniques and a Software Environment MEANING WS Presentation, San

VMM Emulation of Intel Hardware Transactional Memory Maciej Swiech, Kyle Hale, Peter Dinda

Near optimal finite time identification of arbitrary linear dynamical systems Tuhin Sarkar &

Improved Low-Memory Subset Sum and LPN Algorithms via Multiple Collision January 2019 , Nancy

Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda - PowerPoint PPT Presentation

Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda About us Why do traditional DWH needs in-memory grid? Real Time Analytics for Telco Cases Integrating Apache Ignite with Arenadata DB Using the power of

Gilead-MPP Licence Overview December 2017 LEGAL May contain MPP and/or MPP licensee confidential

Ingesting Streaming Data for Analysis in Apache Ignite Pat Patterson StreamSets

Ultimate Guide to Successful Cross-Platform Deployments with Apache Ignite Pavel Petroshenko

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Using Distributed Tracing to Resolve Performance Issues in Apache Ignite Greg Stachnick, Director

Computing at MPP Arthur Erhardt MPP Computing Commission + Fachabteilung IT MPP Project Review,

Computing at MPP Arthur Erhardt MPP Computing Commission + Fachabteilung IT MPP Project Review,

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Ignite Extensions - Modularization Saikat Maitra Twitter @samaitra Github samaitra

Predicting Share Prices in Real-Time with Apache Spark and Apache Ignite MANUEL MOURATO Summary

How-to for real-time streaming and analytics at scale with Apache Kafka and Apache Ignite Viktor

real-time alerting, analytics and reporting at scale with Apache Kafka and Apache Ignite

secure payments anywhere Contents Cont ents Who are MPP Global Solutions? eSuite

PMT Measurements of PEN and Friends at MPP Oliver Schulz oschulz@mpp.mpg.de PEN Meeting,

Course notes on Computational Optimal Transport Gabriel Peyr e CNRS &amp; DMA Ecole

Vertex Operator Super Algebras on a Riemann Surface Alexander Zuevsky National University of

Simple Eulerian Methods for Compressible Fluids in Domains with Moving Boundaries Alina Chertock

CS171 Visualization Alexander Lex alex@seas.harvard.edu Graphs [xkcd] This Week Reading: VAD,

Ontology Learning: Framework, Techniques and a Software Environment MEANING WS Presentation, San

VMM Emulation of Intel Hardware Transactional Memory Maciej Swiech, Kyle Hale, Peter Dinda

Near optimal finite time identification of arbitrary linear dynamical systems Tuhin Sarkar &amp;

Improved Low-Memory Subset Sum and LPN Algorithms via Multiple Collision January 2019 , Nancy

Course notes on Computational Optimal Transport Gabriel Peyr e CNRS & DMA Ecole

Near optimal finite time identification of arbitrary linear dynamical systems Tuhin Sarkar &