scalable data ingestion architecture using airflow and
play

Scalable Data Ingestion Architecture Using Airflow and Spark April - PowerPoint PPT Presentation

Scalable Data Ingestion Architecture Using Airflow and Spark April 17, 2019 Johannes Lepp Data Council Data Engineer San Francisco, CA johannes.leppa@komodohealth.com Agenda Komodo Health Data Ingestion Challenges Data


  1. Scalable Data Ingestion Architecture Using Airflow and Spark April 17, 2019 Johannes Leppä Data Council Data Engineer San Francisco, CA johannes.leppa@komodohealth.com

  2. Agenda ❖ Komodo Health ❖ Data Ingestion Challenges ❖ Data Ingestion System Architecture ❖ Lessons Learned and Future Developments ❖ Scaling Processes ❖ Conclusions

  3. Our Mission To reduce the global burden of disease through the most actionable healthcare map

  4. Komodo Health™ Integrity Our Map Links Activities of the Entire Healthcare System Payers Biopharma • 500+ payers • $20B payments Providers Clinical Trials • 3.5 M • 100k+ Clinical Trials Patient-Centric doctors / nurses AI powered linkages Institutions Scientific Publications • 450K • 20M publications hospitals / clinics

  5. Agenda ❖ Komodo Health ❖ Data Ingestion Challenges ❖ Data Ingestion System Architecture ❖ Lessons Learned and Future Developments ❖ Scaling Processes ❖ Conclusions

  6. Variation in data size and cadency External Public and proprietary sources ● Size of data ● Source 1 From MBs to TBs ○ Refresh cadencies: ● Source 2 Daily ○ Weekly ○ Monthly ○ Source 3 Quarterly ○ Bi-annual ○ Source 4 One-off ○ Historical drop followed by ■ incremental additions Source 5

  7. Variation in access to raw data External Landed Public and proprietary sources ● Original format Size of data ● Source 1 From MBs to TBs ○ SFTP Refresh cadencies: ● Source 2 Daily ○ AWS S3 Weekly ○ Monthly ○ Source 3 Quarterly ○ API Bi-annual ○ Source 4 One-off ○ Download Historical drop followed by ■ incremental additions Source 5 Several interfaces for data extraction Hard drive ●

  8. Variation in file formats External Landed Raw Original file formats ● Original format Parquet CSV ○ XML Source 1 ○ SFTP SAS ○ Fixed-width ○ Source 2 Parquet ○ AWS S3 Various compression formats ● Source 3 Encrypted data ● API Source 4 Download Source 5 Hard drive

  9. Cover several aspects of healthcare system External Landed Raw Transformed Several datasets covering a ● Original format Parquet Parquet single aspect of healthcare Source 1 Different schemas ○ Different conventions ○ Need to transform to ● Source 2 common schema Source 3 Source 4 Source 5

  10. Security and privacy External Landed Raw Transformed Security and privacy ● Original format Parquet Parquet Access control ○ Data encryption Source 1 ○ Compliances ○ Source 2 Source 3 Source 4 Source 5

  11. Prior to centralized data ingestion system Eternal question: What is the priority? ● Scalability, maintainability, robustness, reliability ○ Rapid development ○

  12. Prior to centralized data ingestion system Eternal question: What is the priority? ● Scalability, maintainability, robustness, reliability ○ Rapid development ← startup choice ○ Provide value to customers and show progress to investors ■ React to changing requirements ■

  13. Prior to centralized data ingestion system Eternal question: What is the priority? ● Scalability, maintainability, robustness, reliability ○ Rapid development ← startup choice ○ Provide value to customers and show progress to investors ■ React to changing requirements ■ Consequences: ● Specialized pipelines ○ Manual operations ○ Variation in technologies and how to use them ○ Less reusable code ○

  14. Why did we build a centralized ingestion system? Previous approach hard to maintain ● Overhead in onboarding engineers to processes ○ Accumulation of manual tasks ○ Project to integrate a few new data sources ● Daily increments ○ Similar data sources ○ Opportunity : build system for these sources and migrate other sources later ○ Pros of in-house implementation ● Flexibility ○ Integrate with our tech stack ○ Leverage previous experience ■

  15. Agenda ❖ Komodo Health ❖ Data Ingestion Challenges ❖ Data Ingestion System Architecture ❖ Lessons Learned and Future Developments ❖ Scaling Processes ❖ Conclusions

  16. Overview of the system infrastructure Airflow ● Organize workflows ○ Automation ○ Alerting ○ Spark ● Distributed processing ○ Kubernetes ● Container management ○ AWS ● EC2 - servers ○ S3 - store data ○

  17. Pros: Airflow: Schedule workflows DAGs written in Python ● Hooks to integrate with sources ● External Landed Raw Transformed Operators for common tasks ● Original format Parquet Parquet Alert on success/failure ● Monitoring ● Source 1 Parallelize DAGs and tasks ● SFTP Source 2 AWS S3 Source 3 API Source 4 Download

  18. Pros: Airflow: Schedule workflows DAGs written in Python ● Hooks to integrate with sources ● External Landed Raw Transformed Operators for common tasks ● Original format Parquet Parquet Alert on success/failure ● Monitoring ● Source 1 Parallelize DAGs and tasks ● SFTP Cons: Source 2 AWS S3 Had to customize hooks and ● operators Source 3 Handling credentials ○ API Needing additional S3 ○ metadata Source 4 Download

  19. Spark: Distributed processing External Landed Raw Transformed Original format Parquet Parquet Pros: Source 1 Reliable ● Python and Scala APIs ● Source 2 Cons: Performance tuning can be tricky ●

  20. Kubernetes: Container management Pros: Environments isolated to namespaces ● Spark Node selectors for resource allocation ● Master ○ Nodes labeled based on the Auto Pod Scaling Groups instances are tied to Self-healing of pods! ● Airflow Airflow Cons: Scheduler WebUI Occasional stability issues ● ○ Networking issues Pod Pod Difficult to troubleshoot ● Node Node

  21. So far so good Scheduled execution Parallelized tasks Scalable resources Alerting Monitoring Resilient infrastructure Isolated environments

  22. Agenda ❖ Komodo Health ❖ Data Ingestion Challenges ❖ Data Ingestion System Architecture ❖ Lessons Learned and Future Developments ❖ Scaling Processes ❖ Conclusions

  23. Infra limitation: Spark scaled manually Big spikes in resource usage Spark Spark Wasteful to keep scaled up ● Worker Worker Scaling down is tricky ● Currently run big workloads on separate cluster ● Pod Pod Manual operation :( ○ Node Node

  24. Infra limitation: Spark scaled manually Big spikes in resource usage Spark Spark Wasteful to keep scaled up ● Worker Worker Scaling down is tricky ● Currently run big workloads on separate cluster ● Pod Pod Manual operation :( ○ Spark Worker Two Spark workers on the same node Pod resulted in double counting Spark resources Node Node

  25. Automatic scaling under development Big spikes in resource usage Spark Spark Wasteful to keep scaled up ● Executor Executor Scaling down is tricky ● Currently run big workloads on separate cluster ● Pod Pod Manual operation :( ○ Spark Spark Future solution: Executor Executor Run Spark directly on Kubernetes ● Pod Pod Introduced in Spark 2.4.0 for client mode ○ Node Node K8s autoscaler to scale nodes ●

  26. Infra limitation: Scheduler a single point of failure Using local executor Spark Tasks executed as subprocesses of scheduler Driver ● Scale resources vertically Spark ● Driver Self-healing on failures? It depends... ● File transfer Airflow Scheduler Pod Node Node

  27. Infra limitation: Scheduler a single point of failure Using local executor Spark Tasks executed as subprocesses of scheduler Driver ● Scale resources vertically Spark ● Driver Self-healing on failures? It depends... ● File transfer Issues in self-healing: Airflow Scheduler Inconsistency in Airflow database ● Dependency on lost local file ● Pod Pod evicted due to disk pressure ● Node Node

  28. Why are you using local executor? It has served us well, so far Spark It was enough when we started Driver ● Did not want to add complexity Spark ● Driver File transfer Airflow Scheduler Pod Node Node

  29. Automatic scaling under development, again It has served us well, so far Spark It was enough when we started ● File transfer Driver Did not want to add complexity ● Pod Pod Future solution: Airflow Spark Kubernetes executor ● Scheduler Driver Introduced in Airflow 1.10.0 ○ K8s autoscaler to scale nodes ● Pod Pod Node Node

  30. Agenda ❖ Komodo Health ❖ Data Ingestion Challenges ❖ Data Ingestion System Architecture ❖ Lessons Learned and Future Developments ❖ Scaling Processes ❖ Conclusions

  31. Beyond infra - Scaling the ingestion processes Our data ingestion priorities: ● Speed of data delivery ○ Data quality ○ Security and privacy ○ Bottleneck is engineering time spent on integrating new data sources ● Tools to simplify processes ○

  32. Early and fast iterations External Landed Raw Transformed Original format Parquet Parquet Commonize Data profiling tool: Source 1 ● Recognize columns ○ Simplifies commonization Source 2 ● Validate raw data ○ Communicate issues with source ○ Compliance risks Data Profiling

Recommend


More recommend