Large scale data processing pipelines at trivago: a use case - PowerPoint PPT Presentation

Large scale data processing pipelines at trivago: a use case 2016-11-15, Sevilla, Spain Clemens Valiente

Clemens Valiente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente

Data driven PR and External Communication Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel prices. This knowledge then is used by our Content Marketing & Communication Department (CMC) to write stories and articles. 3

The past: Data pipeline 2010 – 2015 6

The past: Data pipeline 2010 – 2015 Java Software Engineering 7

The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence 8

The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 9

The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years 10

The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions Restrictions - - Around one million hotels Only single night stays - - 250 booking websites Only prices from - Travellers search for up to European visitors - 180 days in advance Prices cached up to 30 - Data collected over five minutes - years One price per hotel, website and arrival date per day - “Insert ignore”: The first price per key wins 11

The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions Restrictions Size of data - - - Around one million hotels Only single night stays We collected a total of 56 - - 250 booking websites Only prices from billion prices in those five - Travellers search for up to European visitors years - - 180 days in advance Prices cached up to 30 Towards the end of this - Data collected over five minutes pipeline in early 2015 on - years One price per hotel, average around 100 million website and arrival date prices per day were written per day to BI - “Insert ignore”: The first price per key wins 12

Refactoring the pipeline: Requirements • Scales with an arbitrary amount of data (future proof) • reliable and resilient • low performance impact on Java backend • long term storage of raw input data • fast processing of filtered and aggregated data • Open source • we want to log everything: • more prices • Length of stay, room type, breakfast info, room category, domain • with more information • Net & gross price, city tax, resort fee, affiliate fee, VAT 18

Present data pipeline 2016 – ingestion Düsseldorf 19

Present data pipeline 2016 – ingestion Düsseldorf 20

Present data pipeline 2016 – ingestion San Francisco Düsseldorf Hong Kong 21

Present data pipeline 2016 – processing Camus 22

Present data pipeline 2016 – processing Camus CMC 25

Present data pipeline 2016 – facts & figures Cluster specifications - 51 machines - 1.7 PB disc space, 60% used - 3.6 TB memory in Yarn - 1440 VCores (24-32 Cores per machine) 26

Present data pipeline 2016 – facts & figures Cluster specifications Data Size (price log) - - 51 machines 2.6 trillion messages - 1.7 PB disc space, 60% collected so far - used 7 billion messages/day - - 3.6 TB memory in Yarn 160 TB of data - 1440 VCores (24-32 Cores per machine) 27

Present data pipeline 2016 – facts & figures Cluster specifications Data Size (price log) Data processing - - - 51 machines 2.6 trillion messages Camus: 30 mappers writing - 1.7 PB disc space, 60% collected so far data in 10 minute intervals - - used 7 billion messages/day First aggregation/filtering - - 3.6 TB memory in Yarn 160 TB of data stage in Hive runs in 30 - 1440 VCores (24-32 Cores minutes with 5 days of per machine) CPU time spent - Impala Queries across >100 GB of result tables usually done within a few seconds 28

Present data pipeline 2016 – results after one and a half years in production • Very reliable, barely any downtime or service interuptions of the system • Java team is very happy – less load on their system • BI team is very happy – more data, more ressources to process it • CMC team is very happy • Faster results • Better quality of results due to more data • More detailed results • => Shorter research phase, more and better stories • => Less requests & workload for BI 29

Present data pipeline 2016 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors 30

Present data pipeline 2016 – use cases & status quo Uses for price information Other data sources and - Monitoring price parity in usage hotel market - Clicklog information from - Anomaly and fraud our website and mobile detection app - Price feed for online - Used for marketing marketing performance analysis, - Display of price product tests, invoice development and generation etc delivering price alerts to website visitors 31

Present data pipeline 2016 – use cases & status quo Uses for price information Other data sources and Status quo - - Monitoring price parity in Our entire BI business usage hotel market logic runs on and through - Clicklog information from - Anomaly and fraud the kafka – hadoop our website and mobile detection pipeline app - - Price feed for online Almost all departments rely - Used for marketing marketing on data, insights and performance analysis, - Display of price metrics delivered by product tests, invoice development and hadoop generation etc - delivering price alerts to Most of the company could website visitors not do their job without hadoop data 32

Future data pipeline 2016/2017 Camus CMC 33

Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Camus CMC 34

Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Camus Stream processing Kafka Streams CMC Streaming SQL 35

Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Kafka Connect or Gobblin Stream processing Kafka Streams CMC Streaming SQL 36

Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Kafka Connect or Gobblin Stream processing Kafka Streams CMC Streaming SQL 37

Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Kafka Connect or Gobblin Kylin / Hbase Stream processing Kafka Streams CMC Streaming SQL 38

Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Stream processing Kafka Streams CMC Streaming SQL 39

Future data pipeline 2016/2017 CMC Streams local state * https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/ 40

Key challenges and learnings Mastering hadoop - Finding your log files - Interpreting error messages correctly - Understanding settings and how to use them to solve problem - Store data in wide, denormalised Hive tables in parquet format and nested data types 41

Key challenges and learnings Mastering hadoop Using hadoop - - Finding your log files Offer easy hadoop access - Interpreting error to users (Impala / Hive messages correctly JDBC with visualisation - Understanding settings tools) - and how to use them to Educate users on how to solve problem write good code, strict - Store data in wide, guidelines and code denormalised Hive tables review - in parquet format and deployment process: nested data types jenkins deploys git repository with oozie definitions and hive scripts to hdfs 42

Large scale data processing pipelines at trivago: a use case - PowerPoint PPT Presentation

Large scale data processing pipelines at trivago: a use case 2016-11-15, Sevilla, Spain Clemens Valiente Clemens Valiente Senior Data Engineer trivago Dsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems A. Gharaibeh, E.

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is

Large-scale Data Processing and Optimisation Eiko Yoneki University of Cambridge Computer

Motivation Large-Scale Data Processing MapReduce: Want to use 1000s of CPUs Simplified

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde2015 DATA

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde DATA

Optimization for data processing at a large scale Sparsity4PSL Summer School Emilie Chouzenoux

Challenges for Large-scale Data Processing Eiko Yoneki University of Cambridge Computer

Challenges for Large-scale Data Processing Eiko Yoneki University of Cambridge Computer

Ethics in Techniques for large-scale data Graham J.L. Kemp TECHNIQUES FOR LARGE-SCALE DATA

MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review

Granula: Toward Fine-grained Performance Analysis of Large-scale Graph Processing Platforms Wing

Exploration of declarative languages applicability to development of large-scale data processing

StreamCloud: A Large Scale Data Streaming System Gulisano, Vincenzo Jimenez-Peris, Ricardo

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters Yingyi Bu, UC Irvine Horizon

Large-Scale Data Processing and Optimisation (LSDPO) Session 1: Introduction Eiko Yoneki

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Performance Enhancement with Speculative Execution Based Parallelism for Processing Large-scale

Large-scale data processing [with Apache Hadoop] [and friends] [at BiG Grid] Evert Lammerts

THE MECHANICS OF TESTING LARGE DATA PIPELINES MATHIEU BASTIAN Head of Data Engineering,

HelP: High-level Primitives for Large- Scale Graph Processing Semih Salihoglu Stanford

Large-Scale Data Engineering Data warehousing with MapReduce event.cwi.nl/lsde2015 Todays

Designing Hybrid Data Processing Systems for Heterogeneous Servers Peter Pietzuch Large-Scale

Open tools and methods for large scale segmentation of Very High Resolution satellite images