Open Source Summit 2017 Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi
Sadayuki Furuhashi An open-source hacker. A founder of Treasure Data, Inc. located in Silicon Valley. Github: @frsyuki OSS projects I founded:
What's Work fl ow Engine? • Automates your manual operations. • Load data → Clean up → Analyze → Build reports • Get customer list → Generate HTML → Send email • Monitor server status → Restart on abnormal • Backup database → Alert on failure • Run test → Package it → Deploy (Continuous Delivery)
Challenge: Multiple Cloud & Regions Di ff erent API, Di ff erent tools, Many scripts. On-Premises
Challenge: Multiple DB technologies Amazon Redshift Amazon S3 Amazon EMR
Challenge: Multiple DB technologies > Hi! Amazon Redshift > I'm a new technology! Amazon S3 Amazon EMR
Challenge: Modern complex data analytics Ingest Enrich Model Load Utilize Load Utilize Ingest Enrich Model Removing bot access Creating indexes Recommendation Application logs A/B Testing Geo location from IP API Data partitioning User attribute data Funnel analysis address Realtime ad bidding Data compression Ad impressions Segmentation Parsing User-Agent Visualize using BI analysis Statistics 3rd-party cookie data JOIN user attributes applications collection Machine learning to event logs
Traditional "false" solution > Poor error handling > Write once, Nobody reads > No alerts on failure #!/bin/bash > No alerts on too long run ./run_mysql_query.sh ./load_facebook_data.sh > No retrying on errors ./rsync_apache_logs.sh > No resuming ./start_emr_cluster.sh > No parallel execution for query in emr/*.sql; do ./run_emr_hive $query > No distributed execution done ./shutdown_emr_cluster.sh > No log collection ./run_redshift_queries.sh > No visualized monitoring ./call_finish_notification.sh > No modularization > No parameterization
Solution: Multi-Cloud Work fl ow Engine > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors Solves > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
Example in our case 2. load all tables to 3. Run queries 4. Create reports Treasure Data on Tableau Server (on-premises) 1. Dump data to BigQuery 5. Notify on slack
Work fl ow constructs
Unite Engineering & Analytic Teams Powerful for Engineers +wait_for_arrival: > Comfortable for advanced users s3_wait>: | bucket/www_${session_date}.csv Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows +load_table: redshift>: scripts/copy.sql
Unite Engineering & Analytic Teams Powerful for Engineers +wait_for_arrival: > Comfortable for advanced users s3_wait>: | bucket/www_${session_date}.csv Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows +load_table: redshift>: scripts/copy.sql + is a task > is an operator ${...} is a variable
Operator library Standard libraries _export: redshift>: runs Amazon Redshift queries emr>: create/shutdowns a cluster & runs td: steps database: workflow_temp s3_wait>: waits until a file is put on S3 pg>: runs PostgreSQL queries td>: runs Treasure Data queries +task1: td_for_each>: repeats task for result rows td>: queries/open.sql mail>: sends an email create_table: daily_open Open-source libraries You can release & use open-source +task2: operator libraries. td>: queries/close.sql create_table: daily_close
Parallel execution +load_data: Parallel execution _parallel: true Tasks under a same group run in parallel if _parallel option is set to +load_users: true. redshift>: copy/users.sql +load_items: redshift>: copy/items.sql
Loops & Parameters Parameter +send_email_to_active_users: td_for_each>: list_active.sql A task can propagate parameters to following tasks _do: +send: Loop email>: tempalte.txt to: ${td.for_each.addr} Generate subtasks dynamically so that Digdag applies the same set of operators to different data sets.
Grouping work fl ows... +task +task +task +task +task +task +task +task +task +task +task +task Ingest Enrich Model Load Utilize
Grouping work fl ows +model +learn +ingest +load +enrich +task +tasks +task +task +task +task +basket_analysis +task +task Ingest Enrich Model Load Utilize
Pushing work fl ows to a server with Docker image Digdag server schedule: > Develop on laptop, push it to a server. daily>: 01:30:00 > Workflows run periodically on a server. > Backfill timezone: Asia/Tokyo > Web editor & monitor _export: Docker docker: > Install scripts & dependences in a image: my_image:latest Docker image, not on a server. > Workflows can run anywhere including developer's laptop. +task: sh>: ./run_in_docker
Demo
Digdag is production-ready It's just like a web application. API & Visual UI scheduler & All task state executor Digdag Digdag PostgreSQL client server
Digdag is production-ready Stateless servers + Replicated DB API & scheduler & executor Visual UI Digdag All task state server Digdag HTTP Load Digdag PostgreSQL client Balancer server HA PostgreSQL
Digdag is production-ready Isolating API and execution for reliability API Digdag All task state server Digdag HTTP Load Digdag PostgreSQL client Balancer server HA Digdag server PostgreSQL Digdag server scheduler & executor
Digdag at Treasure Data 850 active work fl ows 3,600 work fl ows run every day 28,000 tasks run every day 400,000 work fl ow executions in total
Digdag & Open Source
Learning from my OSS projects • Make it pluggable! 700+ plugins in 6 years input/output, and fi lter 200+ plugins in 3 years input/output, parser/formatter, decoder/encoder, fi lter, and executor 70+ implementations in 8 years
Digdag also has plugin architecture 32 operators 7 schedulers 2 command executors 1 error noti fi cation module
Visit my website! https://digdag.io Sadayuki Furuhashi
Recommend
More recommend