ETL pipeline to achieve reliability at scale By Isabel Lpez - PowerPoint PPT Presentation

ETL pipeline to achieve reliability at scale By Isabel López Andrade

Accounting at Smarkets

Accounting at Smarkets Reports generated daily using transactions from the exchange. In 2013, the average number of daily transactions was under 190K. In 2018, this figure is over 8.8M.

! Original pipeline • Di ff icult to identify errors. " • Manual work to regenerate reports and expert knowledge of the system. # • System too slow and unable to scale. It took more than one day to run. $ • Costly storage. %

Persistent storage Generate daily Generate Generate and monthly Accounting transaction files account Exchange reports statistics Requirements • Fault tolerance and reliability. • Fast io, availability, durability, and cost e ff icient. • Good processing performance. • Scalable.

Fault tolerance and reliability Vulnerabilities • Communication with exchange may fail. • Hardware or so " ware errors may happen while the job is running. Design solutions • Store transactions per day. • Compute financial statistics per day. • Retrieve the last two days worth of transactions. • Break the accounting job into modular Luigi tasks.

class GenerateHumanReadableAccountingReport(AccountingTask): def requires(self) -> luigi.Task: return GenerateAccountingReport() def run(self) -> None: with self.input().operate('r') as target_path: df_accounting = pd.read_parquet(target_path) with self.output().open('w') as file_: df_accounting.to_csv(file_, sep='\t', index=False) def output(self) -> luigi.Target: return self.get_target(path='data/reports/accounting-report.tsv')

E ff icient storage • Columnar storage. • Only read the columns needed for the task. • Row-based Minimised I/O. • E ff icient compression and encoding. Column-based • Python support. Parquet

E ff icient storage • High durability. • High availability. • Low maintenance. • Cost e ff icient. • Decoupling of processing and storage. • Python library boto/boto3. • Web interface.

Good performance Solution Requirements • • General purpose data Fast data processing. processing engine. • Scalable. • Massive parallel. Spark builds its own execution plans. • Caches data in RAM. • Python support.

Spark key concepts RDD Dataframes Resilient : fault-tolerant. Data organised in columns built on top of RDDs. Distributed : partitioned across multiple nodes. Better performance than RDDs. Dataset : collection of data. User friendly API. Create RDD RDD Transformation Lineage Action Result

Execution on Spark

Spark job from Luigi class GenerateSmarketsAccountReport(PySparkTask, AccountingTask): def requires(self) -> luigi.Task: return GenerateAccountingReport() def main(self, sc: pyspark.SparkContext) -> None: spark = pyspark.sql.SparkSession(sc) sdf_per_account = read_parquet(spark, self.input()) sdf_smarkets = sdf_per_account.filter( sdf_per_account.account_id == SMARKETS_ACCOUNT_ID ) write_parquet(sdf_smarkets, self.output()) def output(self) -> luigi.Target: return self.get_target( path=‘data/reports/accounting-report-smarkets.parquet' )

Scalability • Spark cluster. • Fast deployment. • Easy to use. • Flexible. • Seamless integration with S3 - EMRFS. Submit Step • Ability to shutdown the cluster when job is done without data loss. • Low cost. • Nice web interface.

EMR Slave Node(s) Master Node 1 Spark Client 2 YARN Resource 5 Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node YARN Resource YARN Container Manager YARN Container YARN Container

EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource 5 Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node 1 Client YARN Resource YARN Container Manager YARN Container YARN Container

EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource Context Manager Spark Driver YARN Container YARN Container YARN Container

EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource Context Manager Spark Driver 3 YARN Container YARN Container YARN Container

EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Spark Executor Spark Executor YARN Container YARN Container

EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource 5 Context Manager Spark Driver 3 YARN Container 4 Spark Executor Spark Executor YARN Container YARN Container

Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container

Lui gi t ask Add step to EMR Create Spark Upload task cluster to Poll for step Luigi Scheduler Luigi Event Handler cluster on EMR pickle to S3 execute status application script Destroy EMR no pending SUCCESS Cluster EmrSparkSubmitTask that FAILURE can be run succesfully Appl i cat i on scr i pt Create Unpickle Luigi Run Luigi task SparkContext task main() instance Run Accounting job Lui gi t ask m ai n( ) Dataframe and Store result in S3 RDD operations Create accounting Start Luigi Central Run Accounting container Scheduler wrapper task Create Spark Submit Luigi tasks Destroy EMR cluster on EMR to EMR cluster cluster Accounting container and EMR cluster share/save files using S3

Thanks!

+ + Parquet

Submit Spark application to EMR from Luigi Lui gi t ask Add step to EMR Luigi Event handler Create Spark Upload task cluster to Poll for step SUCCESS cluster on EMR pickle to S3 execute status Luigi Scheduler FAILURE application script no pending Destroy EMR PROCESS_FAILURE EmrSparkSubmitTask that cluster TIMEOUT can be run succesfully BROKEN_TASK DEPENDENCY_MISSING Appl i cat i on scr i pt Create Unpickle Luigi Run Luigi task SparkContext task main() instance Run Accounting task Lui gi t ask m ai n( ) Dataframe and Store result in S3 RDD operations Create accounting Start Luigi Central Run Accounting container Scheduler wrapper task Create Spark Submit Luigi tasks Destroy EMR cluster on EMR to EMR cluster cluster Accounting container and EMR cluster share/save files using S3

Shutdown EMR cluster Lui gi t ask Add step to EMR Create Spark Upload task cluster to Poll for step Luigi Scheduler execute Luigi Event Handler cluster on EMR pickle to S3 status application script Destroy EMR no pending SUCCESS Cluster EmrSparkSubmitTask that FAILURE can be run succesfully Appl i cat i on scr i pt Create - A task won’t raise an event if one dependency has failed. Unpickle Luigi Run Luigi task SparkContext task main() instance - In case of a dependency failure, we want to destroy cluster if the only tasks left depend on failing task. Run Accounting - Information about pending tasks and task dependencies fetched from Luigi Central task Lui gi t ask m ai n( ) Scheduler. Dataframe and Store result in S3 RDD operations Create accounting Start Luigi Central Run Accounting container Scheduler wrapper task Create Spark Submit Luigi tasks Destroy EMR cluster on EMR to EMR cluster cluster Accounting container and EMR cluster share/save files using S3

ETL pipeline to achieve reliability at scale By Isabel Lpez - PowerPoint PPT Presentation

ETL pipeline to achieve reliability at scale By Isabel Lpez Andrade Accounting at Smarkets Accounting at Smarkets Reports generated daily using transactions from the exchange. In 2013, the average number of daily transactions was under

Spire Missouri STL Pipeline November 13, 2019 STL Pipeline Supports strategy to modernize

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Welcome Flying not coasting on your Top Flight - Pipeline journey 2018. How to achieve

Optimizing Latency and Reliability of Pipeline Workflow Applications Anne Benoit Veronika

Goals for Today Learning Objective: Learn how to achieve reliability + availability in

we believe that there is no other way society will achieve large-scale progress against the

Reasoning Reliability in Wrikes Data Pipeline Wrike - A Collaborative Work Management

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Culture of Philanthropy Pipeline Building and Annual Giving October 16, 2019 Presented By:

Safety Reliability Longevity I am assuming: You have a place to work You

Overview Enview turns massive datasets into operational insights to support pipeline operational

Reliability at Scale A tale of Amazon Dynamo Presented by Yunhe Liu @ CS6410 Fall19 Slides

Pipeline for Our Example Using SCALE Atmel dataset A Acquire training data 1000 traces, random

- Reliability - Reliability What It Is, Why, and How Jason Nicholas, Ph.D. November 13,

DemandARefund.com Can we use technology to achieve justice on a mass scale? Philippa Heir

Software Reliability Categorizing and specifying the reliability of software systems CS 422

Pipeline Construction Pipeline Construction Challenges Challenges NAPCA Workshop August 19,

Hospital to Home in 7-10 days What do we want to achieve? How will we achieve this?

Chicago CoC Action Agenda Pipeline Expansion Work Group Overview of Committee Presentation to

Systems and Work Processes: The need of Integration, to achieve Operational Excellence SHASHWAT

New Hampshire Safety Division Underground damage prevention Pipeline safety Electrical

1,000 foot pipeline Connect Replacement (Saugus 3 and 4) Wells to Magic Mountain Pipeline

Pipeline A Presentation by Team Pipeline Ben Lai Brandon Bakhshai Jeffrey Serio Somya

Highlights Highlights of of New New Pipeline Pipeline Medicines Medicines Based on Meds