Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng Barcelona ‘18 1
2
3
4
5
Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng Barcelona ‘18 6
Reliability is the probability that a system will produce correct outputs up to some given time t . Source: E.J. McClusky & S. Mitra (2004). "Fault Tolerance" in Computer Science Handbook 2ed. ed. A.B. Tucker. CRC Press. 7
Highly reliable data pipelines 1. Architecture 8
Highly reliable data pipelines 1. Architecture 2. Monitoring 9
Highly reliable data pipelines 1. Architecture 2. Monitoring 3. Failures handling 10
Historical metric queries Time series data metric system.load.1 timestamp 1526382440 value 0.92 tags host:i-xyz,env:dev,... 11
Historical metric queries 1 point/second 12
Historical metric queries 1 point/day 13
Historical metric queries AWS S3 High resolution 1pt /sec data • Runs once a day. • Dozens of TBs of input data. Rollups • Trillions of points processed. pipeline Low 1pt /min resolution 1pt /hour data 1pt /day 14
Highly reliable data pipelines 1. Architecture 2. Monitoring 3. Failures handling 15
Our big data platform architecture USERS Datadog Web CLI Scheduler monitoring Luigi Spark WORKERS EMR EMR EMR EMR CLUSTERS S3 DATA 16
Our big data platform architecture USERS Datadog Web CLI Scheduler monitoring Luigi Spark WORKERS EMR EMR EMR EMR CLUSTERS S3 DATA 17
Many ephemeral clusters • New cluster for every pipeline. • Dozens of clusters at a time. • Median lifetime of ~3 hours. 18
Total isolation We know what is happening and why. 19
Pick the best hardware for each job c3 r3 For CPU-bound jobs For memory-bound jobs 20
Scale up/down clusters • If we are behind. • Scale as we grow. • No more waiting on loaded clusters. 21
Safer upgrades of EMR/Hadoop/Spark 5.13 5.12 5.12 5.12 5.12 22
Spot-instance clusters Ridiculous savings + (up to 80% off the on-demand price) - Nodes can die at any time 23
Spot-instance clusters Ridiculous savings + (up to 80% off the on-demand price) - Nodes can die at any time 24
? How can we build highly reliable data pipelines with instances killed randomly all the time? 25
No long running jobs • The longer the job, the more work you lose on average. • The longer the job, the longer it takes to recover. 26
No long running jobs Pipeline A Pipeline B Time (hours) 0 9 27
No long running jobs Job failure Pipeline A Pipeline B Time (hours) 0 7 9 10 16 28
Break down jobs into smaller pieces Vertically - persist intermediate data between transformations. Horizontally - partition the input data. 29
Example Rollups pipeline Input Raw time series data data Aggregated time Output series data data (custom file format) 30
Example Rollups pipeline Input Raw time series data data 1. Aggregate high 1 resolution data. 2. Store the 2 aggregated data Aggregated time Output in our custom file series data data (custom file format) format. 31
Example Input Raw time series data data Vertical split 1 1. Aggregate high 1 resolution data. Aggregated time Checkpoint series data data (Parquet format) 2. Store the 2 aggregated data in our custom file 2 format. Aggregated time Output series data data (custom file format) 32
A B Example Raw time series data Horizontal split C D 1 1. Aggregate high 1 resolution data. Aggregated time Checkpoint series data data (Parquet format) 2. Store the 2 aggregated data in our custom file 2 format. Aggregated time Output series data data (custom file format) 33
Example Raw time series A B C D data Horizontal split 1 1. Aggregate high 1 resolution data. Aggregated time series data (Parquet format) 2. Store the 2 aggregated data 2 in our custom file format. Aggregated time A B C D series data (custom file format) 34
Break down jobs into smaller pieces Performance Fault tolerance 35
Lessons • Many clusters for better isolation. • Break down jobs into smaller pieces. • Trade-off between performance and fault tolerance. 36
Highly reliable data pipelines 1. Architecture 2. Monitoring 3. Failures handling 37
Cluster tagging #anomaly -detection #rollups 38
Monitor cluster metrics 39
Monitor cluster metrics 40
Monitor cluster metrics 41
Monitor cluster metrics 42
Monitor work metrics More details: datadoghq.com/blog/monitoring-spark/ 43
Monitor work metrics 44
Monitor work metrics 45
Monitor work metrics 46
Monitor data lag 1 point/sec data lag 1 point/hour data lag 47
Lessons • Measure, measure and measure! • Alert on meaningful and actionable metrics. • High level dashboards. 48
Highly reliable data pipelines 1. Architecture 2. Monitoring 3. Failures handling 49
50
Data pipelines will break Hardware Increasing Upstream Bad code failures volume of delays changes data 51
Data pipelines will break 1. Recover fast. 2. Degrade gracefully. 52
Recover fast • No long running job. • Switch from spot to on-demand clusters. • Increase cluster size. 53
Recover fast: easy way to rerun jobs • Needed when jobs run but produce some bad data. • Not always trivial. 54
Example: rerun the rollups pipeline s3://bucket/ 2018-01 2018-02 2018-03 2018-04 2018-05 55
Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 56
Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 Active location 57
Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 Active location as-of_2018-05-22 58
Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 Active location as-of_2018-05-22 59
Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 Active location as-of_2018-05-22 60
Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 as-of_2018-05-22 Active location as-of_2018-05-22_run-2 61
Degrade A B C D gracefully • Isolate issues to a limited number of customers. • Keep the functionalities operational at the cost of performance/accuracy. A B C D 62
Degrade gracefully: skip corrupted files • Job failure caused by limited corrupted input data. • Don’t ignore real widespread issues. 63
Lessons • Think about potential issues ahead of time. • Have knobs ready to recover fast. • Have knobs ready to limit the customer facing impact. 64
Conclusion Building highly reliable data pipelines 65
Conclusion Building highly reliable data pipelines • Know your time constraints. 66
Conclusion Building highly reliable data pipelines • Know your time constraints. • Break down jobs into small survivable pieces. 67
Conclusion Building highly reliable data pipelines • Know your time constraints. • Break down jobs into small survivable pieces. • Monitor cluster metrics, job metrics and data lags. 68
Conclusion Building highly reliable data pipelines • Know your time constraints. • Break down jobs into small survivable pieces. • Monitor cluster metrics, job metrics and data lags. • Think about failures ahead of time and get prepared. 69
Thanks! We’re hiring! qf@datadoghq.com https://jobs.datadoghq.com 70
Recommend
More recommend