turbocharge your mysql analytics with elasticsearch
play

Turbocharge your MySQL analytics with ElasticSearch Guillaume - PowerPoint PPT Presentation

Turbocharge your MySQL analytics with ElasticSearch Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017 About the Speaker Guillaume Lefranc Data Architect at Productsup Replication


  1. Turbocharge your MySQL analytics with ElasticSearch Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017

  2. About the Speaker Guillaume Lefranc • Data Architect at Productsup • Replication Manager for MySQL and MariaDB - Lead Architect • DBA Manager at MariaDB Corporation • Infrastructure Consultant at Coinigy • DB Architect at dailymotion.com

  3. Takeaways In this presentation, we will speak about: ● How Elasticsearch works as a document and column store ● What are its strengths and weaknesses when it comes to analytics ● How to sync data with MySQL ● How to build aggregations

  4. Case study Disclaimer: This case study is about medium data, not big data

  5. Case study Disclaimer: This case study is about medium data, not big data What is medium data?

  6. Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs

  7. Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data: ● User activity (Clicks)

  8. Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data: ● User activity (Clicks) ● Market Data

  9. Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data: ● User activity (Clicks) ● Market Data ● Trips

  10. Use Cases Business cases: ● A ride-sharing app Example dataset: NYC Taxi Data (6 months: 78 million trips)

  11. Use Cases Business cases: ● A ride-sharing app Example dataset: NYC Taxi Data (6 months: 78 million trips) ● Cryptocurrency market data 200 million documents per month - Courtesy of coinigy.com

  12. MySQL and Medium Data ● Medium data can scale well in MySQL (SELECT … WHERE id = ?)

  13. MySQL and Medium Data ● Medium data can scale well in MySQL (SELECT … WHERE id = ?) ● … not with Analytics

  14. MySQL and Medium Data ● Medium data can scale well in MySQL (SELECT … WHERE id = ?) ● … not with Analytics ● Every case cannot be covered by an index

  15. MySQL and Medium Data ● Medium data can scale well in MySQL (SELECT … WHERE id = ?) ● … not with Analytics ● Every case cannot be covered by an index ● Aggregations can be slow, especially if doing table scans

  16. Elasticsearch - What? ● “You know, for Search” -> Inverted Index

  17. Elasticsearch - What? ● “You know, for Search” -> Inverted Index ● Document Store

  18. Elasticsearch - What? ● “You know, for Search” -> Inverted Index ● Document Store ● REST API ○ POST /market_data -d ‘{ "market": "BTC/USD", value: “3418.03953” }

  19. Elasticsearch - What? ● “You know, for Search” -> Inverted Index ● Document Store ● REST API ○ POST /market_data -d ‘{ "market": "BTC/USD", value: “3418.03953” } ● JSON Native

  20. Elasticsearch - Anatomy of a Document { "_index" : "raw-2016-07", "_type" : "market_raw", "_id" : "108051174765", "_score" : 7.455347, "_source" : { "quantity" : 0.64130859, "time_local" : "2016-07-12 06:45:34", "type" : "SELL", "market" : "USD/BTC", "total" : 414.52263332, "@timestamp" : "2016-07-12T06:45:34.000Z", "price" : 646.37, "exchange" : "BITS", "id" : 108051174765, "tradeid" : "11649948" }

  21. Elasticsearch - Why? ● Distributed

  22. Elasticsearch - Why? ● Distributed ● Fault Tolerant

  23. Elasticsearch - Why? ● Distributed ● Fault Tolerant ● Scales Horizontally

  24. Elasticsearch - Why? ● Distributed ● Fault Tolerant ● Scales Horizontally

  25. Elasticsearch - Column Store ● All fields are indexed by default

  26. Elasticsearch - Column Store ● All fields are indexed by default ● Query is (almost) always an index scan

  27. Elasticsearch - Column Store ● All fields are indexed by default ● Query is (almost) always an index scan ● Doc values ○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords)

  28. Elasticsearch - Column Store ● All fields are indexed by default ● Query is (almost) always an index scan ● Doc values ○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords) ● Not a general purpose column-store replacement

  29. Elasticsearch - Column Store ● All fields are indexed by default ● Query is (almost) always an index scan ● Doc values ○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords) ● Not a general purpose column-store replacement ● Not exactly fast as a document DB Bibliography: ● https://www.elastic.co/blog/elasticsearch-as-a-column-store ● https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html

  30. Alternatives (Open Source) ● MariaDB Columnstore ● Yandex Clickhouse ● Apache Spark Bibliography: https://www.percona.com/blog/2017/03/17/column-store-database-benchmarks-mariadb-columnstore-vs -clickhouse-vs-apache-spark/

  31. MySQL Analytics: Performance Fast if data is in the index Very slow if linear scans have to be used Example: SELECT DATE(pickup_datetime) AS date, SUM(total_amount) AS earnings FROM trips WHERE driver_id=102 GROUP BY date ORDER BY pickup_datetime DESC; -> returns in milliseconds SELECT DATE(pickup_datetime) AS date, SUM(total_amount) AS earnings FROM trips GROUP BY date ORDER BY pickup_datetime DESC; -> 4 minutes

  32. Importing data: Logstash ● Logstash is an open source data collection engine with real-time pipelining capabilities. ● The L in ELK Stack ● ETL for ElasticSearch ● Pipeline model (input -> filter -> output)

  33. Importing data - Input input { jdbc { jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar" jdbc_driver_class => "com.mysql.jdbc.Driver" jdbc_connection_string => "jdbc:mysql://db1:3306/taxi_platform?useCursorFetch=true" jdbc_user => "root" jdbc_password => "admin" jdbc_fetch_size => 100000 statement => "SELECT id, driver_id, passenger_id, pickup_datetime, dropoff_datetime, CONCAT(pickup_latitude, ',', pickup_longitude) AS pickup_location, CONCAT(dropoff_latitude,',', dropoff_longitude) AS dropoff_location, payment_type, total_amount from trips WHERE id > :sql_last_value" use_column_value => true tracking_column => id #schedule => "*/5 * * * *" }

  34. Importing data - Input ● Dealing with large result sets jdbc_connection_string => "jdbc:mysql://db1:3306/taxi_platform? useCursorFetch =true" jdbc_fetch_size => 10000 ● SQL Last Value use_column_value => true tracking_column => id statement => "SELECT … WHERE id > :sql_last_value" ● Scheduler schedule => "*/5 * * * *" Reference: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html

  35. Importing Data - Filters Filters (Input Transformation) filter { mutate { convert => [ "pickup_datetime", "string" ] } date { match => [ "pickup_datetime", "ISO8601" ] } } ● Date field is used by partitioning, if there is no suitable field, Elasticsearch will use the current date and time

  36. Importing Data - Output output { elasticsearch { hosts => [ "es1:9200" ] user => "elastic" password => "elasticpassword" index => "taxi-%{+YYYY-MM}" document_type => "trips" document_id => "%{id}" } }

  37. Importing Data - Output ● Document partitioning index => "taxi-%{+YYYY-MM}" ● Matching ID with MySQL document_id => "%{id}"

  38. Schema Design - Indexes ● Document Partitioning ● ElasticSearch Types ● Number of Indices ● Number of Shards ● Replication Factor

  39. Schema Design - Indexes PUT /taxi { "settings" : { "index" : { "number_of_shards" : 5, "number_of_replicas" : 1 } } } PUT _template/template1 { “template”: “taxi*”, “settings”: {

  40. Schema Design - Mapping ● Indexing with the optimal type ● Avoiding Full Text Search indexing (aka text or “analyzed”) ● Type overview ○ keyword ○ long ○ byte ○ date ○ geo_point ○ scaled_float

Recommend


More recommend