Large scale data processing pipelines at trivago: a use case 2016-11-15, Sevilla, Spain Clemens Valiente
Clemens Valiente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente
Data driven PR and External Communication Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel prices. This knowledge then is used by our Content Marketing & Communication Department (CMC) to write stories and articles. 3
Data driven PR and External Communication Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel prices. This knowledge then is used by our Content Marketing & Communication Department (CMC) to write stories and articles. 4
Data driven PR and External Communication Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel prices. This knowledge then is used by our Content Marketing & Communication Department (CMC) to write stories and articles. 5
The past: Data pipeline 2010 – 2015 6
The past: Data pipeline 2010 – 2015 Java Software Engineering 7
The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence 8
The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 9
The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years 10
The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions Restrictions - - Around one million hotels Only single night stays - - 250 booking websites Only prices from - Travellers search for up to European visitors - 180 days in advance Prices cached up to 30 - Data collected over five minutes - years One price per hotel, website and arrival date per day - “Insert ignore”: The first price per key wins 11
The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions Restrictions Size of data - - - Around one million hotels Only single night stays We collected a total of 56 - - 250 booking websites Only prices from billion prices in those five - Travellers search for up to European visitors years - - 180 days in advance Prices cached up to 30 Towards the end of this - Data collected over five minutes pipeline in early 2015 on - years One price per hotel, average around 100 million website and arrival date prices per day were written per day to BI - “Insert ignore”: The first price per key wins 12
The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 13
The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 14
The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 15
The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 16
The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 17
Refactoring the pipeline: Requirements • Scales with an arbitrary amount of data (future proof) • reliable and resilient • low performance impact on Java backend • long term storage of raw input data • fast processing of filtered and aggregated data • Open source • we want to log everything: • more prices • Length of stay, room type, breakfast info, room category, domain • with more information • Net & gross price, city tax, resort fee, affiliate fee, VAT 18
Present data pipeline 2016 – ingestion Düsseldorf 19
Present data pipeline 2016 – ingestion Düsseldorf 20
Present data pipeline 2016 – ingestion San Francisco Düsseldorf Hong Kong 21
Present data pipeline 2016 – processing Camus 22
Present data pipeline 2016 – processing Camus 23
Present data pipeline 2016 – processing Camus 24
Present data pipeline 2016 – processing Camus CMC 25
Present data pipeline 2016 – facts & figures Cluster specifications - 51 machines - 1.7 PB disc space, 60% used - 3.6 TB memory in Yarn - 1440 VCores (24-32 Cores per machine) 26
Present data pipeline 2016 – facts & figures Cluster specifications Data Size (price log) - - 51 machines 2.6 trillion messages - 1.7 PB disc space, 60% collected so far - used 7 billion messages/day - - 3.6 TB memory in Yarn 160 TB of data - 1440 VCores (24-32 Cores per machine) 27
Present data pipeline 2016 – facts & figures Cluster specifications Data Size (price log) Data processing - - - 51 machines 2.6 trillion messages Camus: 30 mappers writing - 1.7 PB disc space, 60% collected so far data in 10 minute intervals - - used 7 billion messages/day First aggregation/filtering - - 3.6 TB memory in Yarn 160 TB of data stage in Hive runs in 30 - 1440 VCores (24-32 Cores minutes with 5 days of per machine) CPU time spent - Impala Queries across >100 GB of result tables usually done within a few seconds 28
Present data pipeline 2016 – results after one and a half years in production • Very reliable, barely any downtime or service interuptions of the system • Java team is very happy – less load on their system • BI team is very happy – more data, more ressources to process it • CMC team is very happy • Faster results • Better quality of results due to more data • More detailed results • => Shorter research phase, more and better stories • => Less requests & workload for BI 29
Present data pipeline 2016 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors 30
Present data pipeline 2016 – use cases & status quo Uses for price information Other data sources and - Monitoring price parity in usage hotel market - Clicklog information from - Anomaly and fraud our website and mobile detection app - Price feed for online - Used for marketing marketing performance analysis, - Display of price product tests, invoice development and generation etc delivering price alerts to website visitors 31
Present data pipeline 2016 – use cases & status quo Uses for price information Other data sources and Status quo - - Monitoring price parity in Our entire BI business usage hotel market logic runs on and through - Clicklog information from - Anomaly and fraud the kafka – hadoop our website and mobile detection pipeline app - - Price feed for online Almost all departments rely - Used for marketing marketing on data, insights and performance analysis, - Display of price metrics delivered by product tests, invoice development and hadoop generation etc - delivering price alerts to Most of the company could website visitors not do their job without hadoop data 32
Future data pipeline 2016/2017 Camus CMC 33
Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Camus CMC 34
Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Camus Stream processing Kafka Streams CMC Streaming SQL 35
Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Kafka Connect or Gobblin Stream processing Kafka Streams CMC Streaming SQL 36
Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Kafka Connect or Gobblin Stream processing Kafka Streams CMC Streaming SQL 37
Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Kafka Connect or Gobblin Kylin / Hbase Stream processing Kafka Streams CMC Streaming SQL 38
Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Stream processing Kafka Streams CMC Streaming SQL 39
Future data pipeline 2016/2017 CMC Streams local state * https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/ 40
Key challenges and learnings Mastering hadoop - Finding your log files - Interpreting error messages correctly - Understanding settings and how to use them to solve problem - Store data in wide, denormalised Hive tables in parquet format and nested data types 41
Key challenges and learnings Mastering hadoop Using hadoop - - Finding your log files Offer easy hadoop access - Interpreting error to users (Impala / Hive messages correctly JDBC with visualisation - Understanding settings tools) - and how to use them to Educate users on how to solve problem write good code, strict - Store data in wide, guidelines and code denormalised Hive tables review - in parquet format and deployment process: nested data types jenkins deploys git repository with oozie definitions and hive scripts to hdfs 42
Recommend
More recommend