automating work fl ows for analytics pipelines
play

Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi - PowerPoint PPT Presentation

Open Source Summit 2017 Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi Sadayuki Furuhashi An open-source hacker. A founder of Treasure Data, Inc. located in Silicon Valley. Github: @frsyuki OSS projects I founded: What's


  1. Open Source Summit 2017 Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi

  2. Sadayuki Furuhashi An open-source hacker. A founder of Treasure Data, Inc. located in Silicon Valley. Github: @frsyuki OSS projects I founded:

  3. What's Work fl ow Engine? • Automates your manual operations. • Load data → Clean up → Analyze → Build reports • Get customer list → Generate HTML → Send email • Monitor server status → Restart on abnormal • Backup database → Alert on failure • Run test → Package it → Deploy 
 (Continuous Delivery)

  4. Challenge: Multiple Cloud & Regions Di ff erent API, Di ff erent tools, Many scripts. On-Premises

  5. Challenge: Multiple DB technologies Amazon 
 Redshift Amazon S3 Amazon EMR

  6. Challenge: Multiple DB technologies > Hi! Amazon 
 Redshift > I'm a new technology! Amazon S3 Amazon EMR

  7. Challenge: Modern complex data analytics Ingest Enrich Model Load Utilize Load Utilize Ingest Enrich Model Removing bot access Creating indexes Recommendation Application logs A/B Testing Geo location from IP API Data partitioning User attribute data Funnel analysis address Realtime ad bidding Data compression Ad impressions Segmentation Parsing User-Agent Visualize using BI analysis Statistics 3rd-party cookie data JOIN user attributes applications collection Machine learning to event logs

  8. Traditional "false" solution > Poor error handling > Write once, Nobody reads > No alerts on failure #!/bin/bash > No alerts on too long run ./run_mysql_query.sh ./load_facebook_data.sh > No retrying on errors ./rsync_apache_logs.sh > No resuming ./start_emr_cluster.sh > No parallel execution for query in emr/*.sql; do ./run_emr_hive $query > No distributed execution done ./shutdown_emr_cluster.sh > No log collection ./run_redshift_queries.sh > No visualized monitoring ./call_finish_notification.sh > No modularization > No parameterization

  9. Solution: Multi-Cloud Work fl ow Engine > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors Solves > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization

  10. Example in our case 2. load all tables to 3. Run queries 4. Create reports Treasure Data on Tableau Server 
 (on-premises) 1. Dump data to BigQuery 5. Notify on slack

  11. Work fl ow constructs

  12. Unite Engineering & Analytic Teams Powerful for Engineers +wait_for_arrival: > Comfortable for advanced users s3_wait>: | bucket/www_${session_date}.csv Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows +load_table: redshift>: scripts/copy.sql

  13. Unite Engineering & Analytic Teams Powerful for Engineers +wait_for_arrival: > Comfortable for advanced users s3_wait>: | bucket/www_${session_date}.csv Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows +load_table: redshift>: scripts/copy.sql + is a task > is an operator ${...} is a variable

  14. Operator library Standard libraries _export: redshift>: runs Amazon Redshift queries emr>: create/shutdowns a cluster & runs td: steps database: workflow_temp s3_wait>: waits until a file is put on S3 pg>: runs PostgreSQL queries td>: runs Treasure Data queries +task1: td_for_each>: repeats task for result rows td>: queries/open.sql mail>: sends an email create_table: daily_open Open-source libraries You can release & use open-source +task2: operator libraries. td>: queries/close.sql create_table: daily_close

  15. 
 
 Parallel execution +load_data: Parallel execution _parallel: true Tasks under a same group run in parallel if _parallel option is set to +load_users: true. redshift>: copy/users.sql +load_items: redshift>: copy/items.sql

  16. Loops & Parameters Parameter +send_email_to_active_users: td_for_each>: list_active.sql A task can propagate parameters to following tasks _do: +send: Loop email>: tempalte.txt to: ${td.for_each.addr} Generate subtasks dynamically so that Digdag applies the same set of operators to different data sets.

  17. Grouping work fl ows... +task +task +task +task +task +task +task +task +task +task +task +task Ingest Enrich Model Load Utilize

  18. Grouping work fl ows +model +learn +ingest +load +enrich +task +tasks +task +task +task +task +basket_analysis +task +task Ingest Enrich Model Load Utilize

  19. Pushing work fl ows to a server with Docker image Digdag server schedule: > Develop on laptop, push it to a server. daily>: 01:30:00 > Workflows run periodically on a server. > Backfill timezone: Asia/Tokyo > Web editor & monitor _export: Docker docker: > Install scripts & dependences in a image: my_image:latest Docker image, not on a server. > Workflows can run anywhere including developer's laptop. +task: sh>: ./run_in_docker

  20. Demo

  21. Digdag is production-ready It's just like a web application. API & Visual UI scheduler & All task state executor Digdag Digdag PostgreSQL client server

  22. Digdag is production-ready Stateless servers + Replicated DB API & scheduler & executor Visual UI Digdag All task state server Digdag HTTP Load Digdag PostgreSQL client Balancer server HA PostgreSQL

  23. Digdag is production-ready Isolating API and execution for reliability API Digdag All task state server Digdag HTTP Load Digdag PostgreSQL client Balancer server HA Digdag server PostgreSQL Digdag server scheduler & 
 executor

  24. Digdag at Treasure Data 850 active work fl ows 3,600 work fl ows run every day 28,000 tasks run every day 400,000 work fl ow executions in total

  25. Digdag & Open Source

  26. Learning from my OSS projects • Make it pluggable! 700+ plugins in 6 years input/output, and fi lter 200+ plugins in 3 years input/output, parser/formatter, 
 decoder/encoder, fi lter, and executor 70+ implementations in 8 years

  27. Digdag also has plugin architecture 32 operators 7 schedulers 2 command executors 1 error noti fi cation module

  28. Visit my website! https://digdag.io Sadayuki Furuhashi

Recommend


More recommend