Turbocharge your MySQL analytics with ElasticSearch Guillaume - PowerPoint PPT Presentation

Turbocharge your MySQL analytics with ElasticSearch Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017

About the Speaker Guillaume Lefranc • Data Architect at Productsup • Replication Manager for MySQL and MariaDB - Lead Architect • DBA Manager at MariaDB Corporation • Infrastructure Consultant at Coinigy • DB Architect at dailymotion.com

Takeaways In this presentation, we will speak about: ● How Elasticsearch works as a document and column store ● What are its strengths and weaknesses when it comes to analytics ● How to sync data with MySQL ● How to build aggregations

Case study Disclaimer: This case study is about medium data, not big data

Case study Disclaimer: This case study is about medium data, not big data What is medium data?

Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs

Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data: ● User activity (Clicks)

Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data: ● User activity (Clicks) ● Market Data

Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data: ● User activity (Clicks) ● Market Data ● Trips

Use Cases Business cases: ● A ride-sharing app Example dataset: NYC Taxi Data (6 months: 78 million trips)

Use Cases Business cases: ● A ride-sharing app Example dataset: NYC Taxi Data (6 months: 78 million trips) ● Cryptocurrency market data 200 million documents per month - Courtesy of coinigy.com

MySQL and Medium Data ● Medium data can scale well in MySQL (SELECT … WHERE id = ?)

MySQL and Medium Data ● Medium data can scale well in MySQL (SELECT … WHERE id = ?) ● … not with Analytics

MySQL and Medium Data ● Medium data can scale well in MySQL (SELECT … WHERE id = ?) ● … not with Analytics ● Every case cannot be covered by an index

MySQL and Medium Data ● Medium data can scale well in MySQL (SELECT … WHERE id = ?) ● … not with Analytics ● Every case cannot be covered by an index ● Aggregations can be slow, especially if doing table scans

Elasticsearch - What? ● “You know, for Search” -> Inverted Index

Elasticsearch - What? ● “You know, for Search” -> Inverted Index ● Document Store

Elasticsearch - What? ● “You know, for Search” -> Inverted Index ● Document Store ● REST API ○ POST /market_data -d ‘{ "market": "BTC/USD", value: “3418.03953” }

Elasticsearch - What? ● “You know, for Search” -> Inverted Index ● Document Store ● REST API ○ POST /market_data -d ‘{ "market": "BTC/USD", value: “3418.03953” } ● JSON Native

Elasticsearch - Anatomy of a Document { "_index" : "raw-2016-07", "_type" : "market_raw", "_id" : "108051174765", "_score" : 7.455347, "_source" : { "quantity" : 0.64130859, "time_local" : "2016-07-12 06:45:34", "type" : "SELL", "market" : "USD/BTC", "total" : 414.52263332, "@timestamp" : "2016-07-12T06:45:34.000Z", "price" : 646.37, "exchange" : "BITS", "id" : 108051174765, "tradeid" : "11649948" }

Elasticsearch - Why? ● Distributed

Elasticsearch - Why? ● Distributed ● Fault Tolerant

Elasticsearch - Why? ● Distributed ● Fault Tolerant ● Scales Horizontally

Elasticsearch - Column Store ● All fields are indexed by default

Elasticsearch - Column Store ● All fields are indexed by default ● Query is (almost) always an index scan

Elasticsearch - Column Store ● All fields are indexed by default ● Query is (almost) always an index scan ● Doc values ○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords)

Elasticsearch - Column Store ● All fields are indexed by default ● Query is (almost) always an index scan ● Doc values ○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords) ● Not a general purpose column-store replacement

Elasticsearch - Column Store ● All fields are indexed by default ● Query is (almost) always an index scan ● Doc values ○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords) ● Not a general purpose column-store replacement ● Not exactly fast as a document DB Bibliography: ● https://www.elastic.co/blog/elasticsearch-as-a-column-store ● https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html

Alternatives (Open Source) ● MariaDB Columnstore ● Yandex Clickhouse ● Apache Spark Bibliography: https://www.percona.com/blog/2017/03/17/column-store-database-benchmarks-mariadb-columnstore-vs -clickhouse-vs-apache-spark/

MySQL Analytics: Performance Fast if data is in the index Very slow if linear scans have to be used Example: SELECT DATE(pickup_datetime) AS date, SUM(total_amount) AS earnings FROM trips WHERE driver_id=102 GROUP BY date ORDER BY pickup_datetime DESC; -> returns in milliseconds SELECT DATE(pickup_datetime) AS date, SUM(total_amount) AS earnings FROM trips GROUP BY date ORDER BY pickup_datetime DESC; -> 4 minutes

Importing data: Logstash ● Logstash is an open source data collection engine with real-time pipelining capabilities. ● The L in ELK Stack ● ETL for ElasticSearch ● Pipeline model (input -> filter -> output)

Importing data - Input input { jdbc { jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar" jdbc_driver_class => "com.mysql.jdbc.Driver" jdbc_connection_string => "jdbc:mysql://db1:3306/taxi_platform?useCursorFetch=true" jdbc_user => "root" jdbc_password => "admin" jdbc_fetch_size => 100000 statement => "SELECT id, driver_id, passenger_id, pickup_datetime, dropoff_datetime, CONCAT(pickup_latitude, ',', pickup_longitude) AS pickup_location, CONCAT(dropoff_latitude,',', dropoff_longitude) AS dropoff_location, payment_type, total_amount from trips WHERE id > :sql_last_value" use_column_value => true tracking_column => id #schedule => "*/5 * * * *" }

Importing data - Input ● Dealing with large result sets jdbc_connection_string => "jdbc:mysql://db1:3306/taxi_platform? useCursorFetch =true" jdbc_fetch_size => 10000 ● SQL Last Value use_column_value => true tracking_column => id statement => "SELECT … WHERE id > :sql_last_value" ● Scheduler schedule => "*/5 * * * *" Reference: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html

Importing Data - Filters Filters (Input Transformation) filter { mutate { convert => [ "pickup_datetime", "string" ] } date { match => [ "pickup_datetime", "ISO8601" ] } } ● Date field is used by partitioning, if there is no suitable field, Elasticsearch will use the current date and time

Importing Data - Output output { elasticsearch { hosts => [ "es1:9200" ] user => "elastic" password => "elasticpassword" index => "taxi-%{+YYYY-MM}" document_type => "trips" document_id => "%{id}" } }

Importing Data - Output ● Document partitioning index => "taxi-%{+YYYY-MM}" ● Matching ID with MySQL document_id => "%{id}"

Schema Design - Indexes ● Document Partitioning ● ElasticSearch Types ● Number of Indices ● Number of Shards ● Replication Factor

Schema Design - Indexes PUT /taxi { "settings" : { "index" : { "number_of_shards" : 5, "number_of_replicas" : 1 } } } PUT _template/template1 { “template”: “taxi*”, “settings”: {

Schema Design - Mapping ● Indexing with the optimal type ● Avoiding Full Text Search indexing (aka text or “analyzed”) ● Type overview ○ keyword ○ long ○ byte ○ date ○ geo_point ○ scaled_float

Turbocharge your MySQL analytics with ElasticSearch Guillaume - PowerPoint PPT Presentation

Turbocharge your MySQL analytics with ElasticSearch Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017 About the Speaker Guillaume Lefranc Data Architect at Productsup Replication

Performance Guide for MySQL Cluster Mikael Ronstrm, Ph.D Senior MySQL Architect Sun

MySQL Replication Update MySQL 5.5 (GA) & MySQL 5.6.2 (Dev. Milestone) Lars Thalmann

MySQL Proxy meets: binlogs Jan Kneschke MySQL Enterprise Tools mailto: jan@mysql.com What is

MySQL Proxy Making MySQL more flexible Jan Kneschke jan@mysql.com MySQL Proxy proxy-servers

MySQL Group Replication & MySQL InnoDB Cluster Production Ready? Kenny Gryp MySQL Practice

MySQL Cluster und MySQL Proxy Alles Online Diese Slides gibt es auch unter:

Reducing Risk When Upgrading Your MySQL Environment Kenny Gryp MySQL Practice Manager My

PHP and MySQL Dr. E. Benoist Winter Term 2006-2007 PHP and MySQL 1 PHP and MySQL Introduction

More on gdb for MySQL DBAs or Using gdb to study MySQL internals and as a last resort Valerii

JSON Logging with Elasticsearch Radu Gheorghe search statistics Where do your logs end up?

Elasticsearch T E G

Percona MySQL About Me Qunar.com DB Director

Forecasting MySQL Scalability Baron Schwartz O'Reilly MySQL Conference & Expo 2011

PHP + MySQL MySQL on the command line is great and all well not its not really that great

gdb tips and tricks for MySQL DBAs or How gdb can help you to solve MySQL problems Valerii

Managing MySQL at Scale Pradeep Nayak & Junyi (Luke) Lu Production Engineers - MySQL Infra

Architecture Case Study: 1 Key Word in Context (KWIC) Aims: To demonstrate key features

MySQL Index Cookbook Deep & Wide Index Tutorial Rick James Feb., 2015 TOC Preface

Learned Index Structures paper by Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis

Storage and Indexing (Appendix D, Chapter 10B Kroenke) 1 Database Design Process

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Database System Architecture Index Structures Hector Garcia-Molina Stijn Vansummeren Index

LSH-Based Probabilistic Pruning of Inverted Indices for Sets and Ranked Lists Koninika Pal and

Dropping on the Edge: Flexibility and Dropping on the Edge: Flexibility and Trac Conrmation

Sambuz

Useful Links

Newsletter

Mail Us

Turbocharge your MySQL analytics with ElasticSearch Guillaume - PowerPoint PPT Presentation

Turbocharge your MySQL analytics with ElasticSearch Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017 About the Speaker Guillaume Lefranc Data Architect at Productsup Replication

Performance Guide for MySQL Cluster Mikael Ronstrm, Ph.D Senior MySQL Architect Sun

MySQL Replication Update MySQL 5.5 (GA) &amp; MySQL 5.6.2 (Dev. Milestone) Lars Thalmann

MySQL Proxy meets: binlogs Jan Kneschke MySQL Enterprise Tools mailto: jan@mysql.com What is

MySQL Proxy Making MySQL more flexible Jan Kneschke jan@mysql.com MySQL Proxy proxy-servers

MySQL Group Replication &amp; MySQL InnoDB Cluster Production Ready? Kenny Gryp MySQL Practice

MySQL Cluster und MySQL Proxy Alles Online Diese Slides gibt es auch unter:

Reducing Risk When Upgrading Your MySQL Environment Kenny Gryp MySQL Practice Manager My

PHP and MySQL Dr. E. Benoist Winter Term 2006-2007 PHP and MySQL 1 PHP and MySQL Introduction

More on gdb for MySQL DBAs or Using gdb to study MySQL internals and as a last resort Valerii

JSON Logging with Elasticsearch Radu Gheorghe search statistics Where do your logs end up?

Elasticsearch T E G

Percona MySQL About Me Qunar.com DB Director

Forecasting MySQL Scalability Baron Schwartz O'Reilly MySQL Conference &amp; Expo 2011

PHP + MySQL MySQL on the command line is great and all well not its not really that great

gdb tips and tricks for MySQL DBAs or How gdb can help you to solve MySQL problems Valerii

Managing MySQL at Scale Pradeep Nayak &amp; Junyi (Luke) Lu Production Engineers - MySQL Infra

Architecture Case Study: 1 Key Word in Context (KWIC) Aims: To demonstrate key features

MySQL Index Cookbook Deep &amp; Wide Index Tutorial Rick James Feb., 2015 TOC Preface

Learned Index Structures paper by Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis

Storage and Indexing (Appendix D, Chapter 10B Kroenke) 1 Database Design Process

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Database System Architecture Index Structures Hector Garcia-Molina Stijn Vansummeren Index

LSH-Based Probabilistic Pruning of Inverted Indices for Sets and Ranked Lists Koninika Pal and

Dropping on the Edge: Flexibility and Dropping on the Edge: Flexibility and Trac Conrmation

Sambuz

Useful Links

Newsletter

Mail Us

MySQL Replication Update MySQL 5.5 (GA) & MySQL 5.6.2 (Dev. Milestone) Lars Thalmann

MySQL Group Replication & MySQL InnoDB Cluster Production Ready? Kenny Gryp MySQL Practice

Forecasting MySQL Scalability Baron Schwartz O'Reilly MySQL Conference & Expo 2011

Managing MySQL at Scale Pradeep Nayak & Junyi (Luke) Lu Production Engineers - MySQL Infra

MySQL Index Cookbook Deep & Wide Index Tutorial Rick James Feb., 2015 TOC Preface