building highly reliable data pipelines datadog
play

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS - PowerPoint PPT Presentation

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng Barcelona 18 1 2 3 4 5 Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng


  1. Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng Barcelona ‘18 1

  2. 2

  3. 3

  4. 4

  5. 5

  6. Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng Barcelona ‘18 6

  7. Reliability is the probability that a system will produce correct outputs up to some given time t . Source: E.J. McClusky & S. Mitra (2004). "Fault Tolerance" in Computer Science Handbook 2ed. ed. A.B. Tucker. CRC Press. 7

  8. Highly reliable data pipelines 1. Architecture 8

  9. Highly reliable data pipelines 1. Architecture 2. Monitoring 9

  10. Highly reliable data pipelines 1. Architecture 2. Monitoring 3. Failures handling 10

  11. Historical metric queries Time series data metric system.load.1 timestamp 1526382440 value 0.92 tags host:i-xyz,env:dev,... 11

  12. Historical metric queries 1 point/second 12

  13. Historical metric queries 1 point/day 13

  14. Historical metric queries AWS S3 High resolution 1pt /sec data • Runs once a day. • Dozens of TBs of input data. Rollups • Trillions of points processed. pipeline Low 1pt /min resolution 1pt /hour data 1pt /day 14

  15. Highly reliable data pipelines 1. Architecture 2. Monitoring 3. Failures handling 15

  16. Our big data platform architecture USERS Datadog Web CLI Scheduler monitoring Luigi Spark WORKERS EMR EMR EMR EMR CLUSTERS S3 DATA 16

  17. Our big data platform architecture USERS Datadog Web CLI Scheduler monitoring Luigi Spark WORKERS EMR EMR EMR EMR CLUSTERS S3 DATA 17

  18. Many ephemeral clusters • New cluster for every pipeline. • Dozens of clusters at a time. • Median lifetime of ~3 hours. 18

  19. Total isolation We know what is happening and why. 19

  20. Pick the best hardware for each job c3 r3 For CPU-bound jobs For memory-bound jobs 20

  21. Scale up/down clusters • If we are behind. • Scale as we grow. • No more waiting on loaded clusters. 21

  22. Safer upgrades of EMR/Hadoop/Spark 5.13 5.12 5.12 5.12 5.12 22

  23. Spot-instance clusters Ridiculous savings + (up to 80% off the on-demand price) - Nodes can die at any time 23

  24. Spot-instance clusters Ridiculous savings + (up to 80% off the on-demand price) - Nodes can die at any time 24

  25. ? How can we build highly reliable data pipelines with instances killed randomly all the time? 25

  26. No long running jobs • The longer the job, the more work you lose on average. • The longer the job, the longer it takes to recover. 26

  27. No long running jobs Pipeline A Pipeline B Time (hours) 0 9 27

  28. No long running jobs Job failure Pipeline A Pipeline B Time (hours) 0 7 9 10 16 28

  29. Break down jobs into smaller pieces Vertically - persist intermediate data between transformations. Horizontally - partition the input data. 29

  30. Example Rollups pipeline Input Raw time series data data Aggregated time Output series data data (custom file format) 30

  31. Example Rollups pipeline Input Raw time series data data 1. Aggregate high 1 resolution data. 2. Store the 2 aggregated data Aggregated time Output in our custom file series data data (custom file format) format. 31

  32. Example Input Raw time series data data Vertical split 1 1. Aggregate high 1 resolution data. Aggregated time Checkpoint series data data (Parquet format) 2. Store the 2 aggregated data in our custom file 2 format. Aggregated time Output series data data (custom file format) 32

  33. A B Example Raw time series data Horizontal split C D 1 1. Aggregate high 1 resolution data. Aggregated time Checkpoint series data data (Parquet format) 2. Store the 2 aggregated data in our custom file 2 format. Aggregated time Output series data data (custom file format) 33

  34. Example Raw time series A B C D data Horizontal split 1 1. Aggregate high 1 resolution data. Aggregated time series data (Parquet format) 2. Store the 2 aggregated data 2 in our custom file format. Aggregated time A B C D series data (custom file format) 34

  35. Break down jobs into smaller pieces Performance Fault tolerance 35

  36. Lessons • Many clusters for better isolation. • Break down jobs into smaller pieces. • Trade-off between performance and fault tolerance. 36

  37. Highly reliable data pipelines 1. Architecture 2. Monitoring 3. Failures handling 37

  38. Cluster tagging #anomaly -detection #rollups 38

  39. Monitor cluster metrics 39

  40. Monitor cluster metrics 40

  41. Monitor cluster metrics 41

  42. Monitor cluster metrics 42

  43. Monitor work metrics More details: datadoghq.com/blog/monitoring-spark/ 43

  44. Monitor work metrics 44

  45. Monitor work metrics 45

  46. Monitor work metrics 46

  47. Monitor data lag 1 point/sec data lag 1 point/hour data lag 47

  48. Lessons • Measure, measure and measure! • Alert on meaningful and actionable metrics. • High level dashboards. 48

  49. Highly reliable data pipelines 1. Architecture 2. Monitoring 3. Failures handling 49

  50. 50

  51. Data pipelines will break Hardware Increasing Upstream Bad code failures volume of delays changes data 51

  52. Data pipelines will break 1. Recover fast. 2. Degrade gracefully. 52

  53. Recover fast • No long running job. • Switch from spot to on-demand clusters. • Increase cluster size. 53

  54. Recover fast: easy way to rerun jobs • Needed when jobs run but produce some bad data. • Not always trivial. 54

  55. Example: rerun the rollups pipeline s3://bucket/ 2018-01 2018-02 2018-03 2018-04 2018-05 55

  56. Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 56

  57. Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 Active location 57

  58. Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 Active location as-of_2018-05-22 58

  59. Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 Active location as-of_2018-05-22 59

  60. Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 Active location as-of_2018-05-22 60

  61. Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 as-of_2018-05-22 Active location as-of_2018-05-22_run-2 61

  62. Degrade A B C D gracefully • Isolate issues to a limited number of customers. • Keep the functionalities operational at the cost of performance/accuracy. A B C D 62

  63. Degrade gracefully: skip corrupted files • Job failure caused by limited corrupted input data. • Don’t ignore real widespread issues. 63

  64. Lessons • Think about potential issues ahead of time. • Have knobs ready to recover fast. • Have knobs ready to limit the customer facing impact. 64

  65. Conclusion Building highly reliable data pipelines 65

  66. Conclusion Building highly reliable data pipelines • Know your time constraints. 66

  67. Conclusion Building highly reliable data pipelines • Know your time constraints. • Break down jobs into small survivable pieces. 67

  68. Conclusion Building highly reliable data pipelines • Know your time constraints. • Break down jobs into small survivable pieces. • Monitor cluster metrics, job metrics and data lags. 68

  69. Conclusion Building highly reliable data pipelines • Know your time constraints. • Break down jobs into small survivable pieces. • Monitor cluster metrics, job metrics and data lags. • Think about failures ahead of time and get prepared. 69

  70. Thanks! We’re hiring! qf@datadoghq.com https://jobs.datadoghq.com 70

Recommend


More recommend