large scale data processing pipelines at trivago a use
play

Large scale data processing pipelines at trivago: a use case - PowerPoint PPT Presentation

Large scale data processing pipelines at trivago: a use case 2016-11-15, Sevilla, Spain Clemens Valiente Clemens Valiente Senior Data Engineer trivago Dsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years


  1. Large scale data processing pipelines at trivago: a use case 2016-11-15, Sevilla, Spain Clemens Valiente

  2. Clemens Valiente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente

  3. Data driven PR and External Communication Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel prices. This knowledge then is used by our Content Marketing & Communication Department (CMC) to write stories and articles. 3

  4. Data driven PR and External Communication Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel prices. This knowledge then is used by our Content Marketing & Communication Department (CMC) to write stories and articles. 4

  5. Data driven PR and External Communication Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel prices. This knowledge then is used by our Content Marketing & Communication Department (CMC) to write stories and articles. 5

  6. The past: Data pipeline 2010 – 2015 6

  7. The past: Data pipeline 2010 – 2015 Java Software Engineering 7

  8. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence 8

  9. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 9

  10. The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years 10

  11. The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions Restrictions - - Around one million hotels Only single night stays - - 250 booking websites Only prices from - Travellers search for up to European visitors - 180 days in advance Prices cached up to 30 - Data collected over five minutes - years One price per hotel, website and arrival date per day - “Insert ignore”: The first price per key wins 11

  12. The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions Restrictions Size of data - - - Around one million hotels Only single night stays We collected a total of 56 - - 250 booking websites Only prices from billion prices in those five - Travellers search for up to European visitors years - - 180 days in advance Prices cached up to 30 Towards the end of this - Data collected over five minutes pipeline in early 2015 on - years One price per hotel, average around 100 million website and arrival date prices per day were written per day to BI - “Insert ignore”: The first price per key wins 12

  13. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 13

  14. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 14

  15. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 15

  16. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 16

  17. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 17

  18. Refactoring the pipeline: Requirements • Scales with an arbitrary amount of data (future proof) • reliable and resilient • low performance impact on Java backend • long term storage of raw input data • fast processing of filtered and aggregated data • Open source • we want to log everything: • more prices • Length of stay, room type, breakfast info, room category, domain • with more information • Net & gross price, city tax, resort fee, affiliate fee, VAT 18

  19. Present data pipeline 2016 – ingestion Düsseldorf 19

  20. Present data pipeline 2016 – ingestion Düsseldorf 20

  21. Present data pipeline 2016 – ingestion San Francisco Düsseldorf Hong Kong 21

  22. Present data pipeline 2016 – processing Camus 22

  23. Present data pipeline 2016 – processing Camus 23

  24. Present data pipeline 2016 – processing Camus 24

  25. Present data pipeline 2016 – processing Camus CMC 25

  26. Present data pipeline 2016 – facts & figures Cluster specifications - 51 machines - 1.7 PB disc space, 60% used - 3.6 TB memory in Yarn - 1440 VCores (24-32 Cores per machine) 26

  27. Present data pipeline 2016 – facts & figures Cluster specifications Data Size (price log) - - 51 machines 2.6 trillion messages - 1.7 PB disc space, 60% collected so far - used 7 billion messages/day - - 3.6 TB memory in Yarn 160 TB of data - 1440 VCores (24-32 Cores per machine) 27

  28. Present data pipeline 2016 – facts & figures Cluster specifications Data Size (price log) Data processing - - - 51 machines 2.6 trillion messages Camus: 30 mappers writing - 1.7 PB disc space, 60% collected so far data in 10 minute intervals - - used 7 billion messages/day First aggregation/filtering - - 3.6 TB memory in Yarn 160 TB of data stage in Hive runs in 30 - 1440 VCores (24-32 Cores minutes with 5 days of per machine) CPU time spent - Impala Queries across >100 GB of result tables usually done within a few seconds 28

  29. Present data pipeline 2016 – results after one and a half years in production • Very reliable, barely any downtime or service interuptions of the system • Java team is very happy – less load on their system • BI team is very happy – more data, more ressources to process it • CMC team is very happy • Faster results • Better quality of results due to more data • More detailed results • => Shorter research phase, more and better stories • => Less requests & workload for BI 29

  30. Present data pipeline 2016 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors 30

  31. Present data pipeline 2016 – use cases & status quo Uses for price information Other data sources and - Monitoring price parity in usage hotel market - Clicklog information from - Anomaly and fraud our website and mobile detection app - Price feed for online - Used for marketing marketing performance analysis, - Display of price product tests, invoice development and generation etc delivering price alerts to website visitors 31

  32. Present data pipeline 2016 – use cases & status quo Uses for price information Other data sources and Status quo - - Monitoring price parity in Our entire BI business usage hotel market logic runs on and through - Clicklog information from - Anomaly and fraud the kafka – hadoop our website and mobile detection pipeline app - - Price feed for online Almost all departments rely - Used for marketing marketing on data, insights and performance analysis, - Display of price metrics delivered by product tests, invoice development and hadoop generation etc - delivering price alerts to Most of the company could website visitors not do their job without hadoop data 32

  33. Future data pipeline 2016/2017 Camus CMC 33

  34. Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Camus CMC 34

  35. Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Camus Stream processing Kafka Streams CMC Streaming SQL 35

  36. Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Kafka Connect or Gobblin Stream processing Kafka Streams CMC Streaming SQL 36

  37. Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Kafka Connect or Gobblin Stream processing Kafka Streams CMC Streaming SQL 37

  38. Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Kafka Connect or Gobblin Kylin / Hbase Stream processing Kafka Streams CMC Streaming SQL 38

  39. Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Stream processing Kafka Streams CMC Streaming SQL 39

  40. Future data pipeline 2016/2017 CMC Streams local state * https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/ 40

  41. Key challenges and learnings Mastering hadoop - Finding your log files - Interpreting error messages correctly - Understanding settings and how to use them to solve problem - Store data in wide, denormalised Hive tables in parquet format and nested data types 41

  42. Key challenges and learnings Mastering hadoop Using hadoop - - Finding your log files Offer easy hadoop access - Interpreting error to users (Impala / Hive messages correctly JDBC with visualisation - Understanding settings tools) - and how to use them to Educate users on how to solve problem write good code, strict - Store data in wide, guidelines and code denormalised Hive tables review - in parquet format and deployment process: nested data types jenkins deploys git repository with oozie definitions and hive scripts to hdfs 42

Recommend


More recommend