Turbocharge your MySQL analytics with ElasticSearch Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017
About the Speaker Guillaume Lefranc • Data Architect at Productsup • Replication Manager for MySQL and MariaDB - Lead Architect • DBA Manager at MariaDB Corporation • Infrastructure Consultant at Coinigy • DB Architect at dailymotion.com
Takeaways In this presentation, we will speak about: ● How Elasticsearch works as a document and column store ● What are its strengths and weaknesses when it comes to analytics ● How to sync data with MySQL ● How to build aggregations
Case study Disclaimer: This case study is about medium data, not big data
Case study Disclaimer: This case study is about medium data, not big data What is medium data?
Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs
Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data: ● User activity (Clicks)
Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data: ● User activity (Clicks) ● Market Data
Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data: ● User activity (Clicks) ● Market Data ● Trips
Use Cases Business cases: ● A ride-sharing app Example dataset: NYC Taxi Data (6 months: 78 million trips)
Use Cases Business cases: ● A ride-sharing app Example dataset: NYC Taxi Data (6 months: 78 million trips) ● Cryptocurrency market data 200 million documents per month - Courtesy of coinigy.com
MySQL and Medium Data ● Medium data can scale well in MySQL (SELECT … WHERE id = ?)
MySQL and Medium Data ● Medium data can scale well in MySQL (SELECT … WHERE id = ?) ● … not with Analytics
MySQL and Medium Data ● Medium data can scale well in MySQL (SELECT … WHERE id = ?) ● … not with Analytics ● Every case cannot be covered by an index
MySQL and Medium Data ● Medium data can scale well in MySQL (SELECT … WHERE id = ?) ● … not with Analytics ● Every case cannot be covered by an index ● Aggregations can be slow, especially if doing table scans
Elasticsearch - What? ● “You know, for Search” -> Inverted Index
Elasticsearch - What? ● “You know, for Search” -> Inverted Index ● Document Store
Elasticsearch - What? ● “You know, for Search” -> Inverted Index ● Document Store ● REST API ○ POST /market_data -d ‘{ "market": "BTC/USD", value: “3418.03953” }
Elasticsearch - What? ● “You know, for Search” -> Inverted Index ● Document Store ● REST API ○ POST /market_data -d ‘{ "market": "BTC/USD", value: “3418.03953” } ● JSON Native
Elasticsearch - Anatomy of a Document { "_index" : "raw-2016-07", "_type" : "market_raw", "_id" : "108051174765", "_score" : 7.455347, "_source" : { "quantity" : 0.64130859, "time_local" : "2016-07-12 06:45:34", "type" : "SELL", "market" : "USD/BTC", "total" : 414.52263332, "@timestamp" : "2016-07-12T06:45:34.000Z", "price" : 646.37, "exchange" : "BITS", "id" : 108051174765, "tradeid" : "11649948" }
Elasticsearch - Why? ● Distributed
Elasticsearch - Why? ● Distributed ● Fault Tolerant
Elasticsearch - Why? ● Distributed ● Fault Tolerant ● Scales Horizontally
Elasticsearch - Why? ● Distributed ● Fault Tolerant ● Scales Horizontally
Elasticsearch - Column Store ● All fields are indexed by default
Elasticsearch - Column Store ● All fields are indexed by default ● Query is (almost) always an index scan
Elasticsearch - Column Store ● All fields are indexed by default ● Query is (almost) always an index scan ● Doc values ○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords)
Elasticsearch - Column Store ● All fields are indexed by default ● Query is (almost) always an index scan ● Doc values ○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords) ● Not a general purpose column-store replacement
Elasticsearch - Column Store ● All fields are indexed by default ● Query is (almost) always an index scan ● Doc values ○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords) ● Not a general purpose column-store replacement ● Not exactly fast as a document DB Bibliography: ● https://www.elastic.co/blog/elasticsearch-as-a-column-store ● https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html
Alternatives (Open Source) ● MariaDB Columnstore ● Yandex Clickhouse ● Apache Spark Bibliography: https://www.percona.com/blog/2017/03/17/column-store-database-benchmarks-mariadb-columnstore-vs -clickhouse-vs-apache-spark/
MySQL Analytics: Performance Fast if data is in the index Very slow if linear scans have to be used Example: SELECT DATE(pickup_datetime) AS date, SUM(total_amount) AS earnings FROM trips WHERE driver_id=102 GROUP BY date ORDER BY pickup_datetime DESC; -> returns in milliseconds SELECT DATE(pickup_datetime) AS date, SUM(total_amount) AS earnings FROM trips GROUP BY date ORDER BY pickup_datetime DESC; -> 4 minutes
Importing data: Logstash ● Logstash is an open source data collection engine with real-time pipelining capabilities. ● The L in ELK Stack ● ETL for ElasticSearch ● Pipeline model (input -> filter -> output)
Importing data - Input input { jdbc { jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar" jdbc_driver_class => "com.mysql.jdbc.Driver" jdbc_connection_string => "jdbc:mysql://db1:3306/taxi_platform?useCursorFetch=true" jdbc_user => "root" jdbc_password => "admin" jdbc_fetch_size => 100000 statement => "SELECT id, driver_id, passenger_id, pickup_datetime, dropoff_datetime, CONCAT(pickup_latitude, ',', pickup_longitude) AS pickup_location, CONCAT(dropoff_latitude,',', dropoff_longitude) AS dropoff_location, payment_type, total_amount from trips WHERE id > :sql_last_value" use_column_value => true tracking_column => id #schedule => "*/5 * * * *" }
Importing data - Input ● Dealing with large result sets jdbc_connection_string => "jdbc:mysql://db1:3306/taxi_platform? useCursorFetch =true" jdbc_fetch_size => 10000 ● SQL Last Value use_column_value => true tracking_column => id statement => "SELECT … WHERE id > :sql_last_value" ● Scheduler schedule => "*/5 * * * *" Reference: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html
Importing Data - Filters Filters (Input Transformation) filter { mutate { convert => [ "pickup_datetime", "string" ] } date { match => [ "pickup_datetime", "ISO8601" ] } } ● Date field is used by partitioning, if there is no suitable field, Elasticsearch will use the current date and time
Importing Data - Output output { elasticsearch { hosts => [ "es1:9200" ] user => "elastic" password => "elasticpassword" index => "taxi-%{+YYYY-MM}" document_type => "trips" document_id => "%{id}" } }
Importing Data - Output ● Document partitioning index => "taxi-%{+YYYY-MM}" ● Matching ID with MySQL document_id => "%{id}"
Schema Design - Indexes ● Document Partitioning ● ElasticSearch Types ● Number of Indices ● Number of Shards ● Replication Factor
Schema Design - Indexes PUT /taxi { "settings" : { "index" : { "number_of_shards" : 5, "number_of_replicas" : 1 } } } PUT _template/template1 { “template”: “taxi*”, “settings”: {
Schema Design - Mapping ● Indexing with the optimal type ● Avoiding Full Text Search indexing (aka text or “analyzed”) ● Type overview ○ keyword ○ long ○ byte ○ date ○ geo_point ○ scaled_float
Recommend
More recommend