building scalable real time data pipeline data fridge
play

Building Scalable Real-Time Data Pipeline Data Fridge Vicente Valls - PowerPoint PPT Presentation

Building Scalable Real-Time Data Pipeline Data Fridge Vicente Valls Rios Software Engineering Manager !!! 2 million orders delivered in one day (July 2019) www.deliveryhero.com Data Volume in Numbers + User clicks + Logistics data


  1. Building Scalable Real-Time Data Pipeline “Data Fridge” Vicente Valls Rios Software Engineering Manager

  2. !!! 2 million orders delivered in one day (July 2019) www.deliveryhero.com

  3. Data Volume in Numbers + User clicks + Logistics data 10M + Restaurant Order-Related availability Events Per Day + Menu items + Customer data

  4. House of Brands & Global Services Logistics Search Recommendation ...

  5. Challenges

  6. Data Producers Data Consumers Search Recommendation Logistics ... ...

  7. Data Producers Data Consumers Search Recommendation Data Fridge Logistics ... ...

  8. Mission ● Unify the data structure across all entities ● Provide different data consumptions types: ○ Near real-time ○ Low-latency ○ High-latency (analytics) ● Become a data producer for ML applications

  9. Architecture

  10. Architecture Data Fridge Low-latency any API Data Consumer Ingestion Events any Streaming API subscription Data Consumer any Data Producer Long-term any storage Analytics

  11. Ingestion API

  12. Ingestion API ● Verifies quality of data using complex validations ● Single entrypoint AWS API Gateway ● Batch import ● AWS Lambda for event processing ● IP Whitelisting ● JWT authentication AWS Lambda

  13. Canary Deployment v1 ● Custom solution Alias: PROD 90% ● 2 versions under the same alias 10% v2 ● Metrics monitoring

  14. Streaming

  15. Streaming AWS Kinesis Lambda AWS SNS any any Ingestion Data Data API Producer Consumer ● Kinesis stream preserves messages up to 7 days ● Ability to replay data ● Ability to scale up/down

  16. SNS instead of Kinesis ? ● We do not care about order of events. ● Having Message Filtering. ● Kinesis requires to scale up shards as our Consumers grow. ● Scale up Kinesis is harder and more expensive than SNS. ● SNS->HTTP/SQS service provides PUSH-PULL data consumption.

  17. Streaming ● 2 types for subscriptions ● Filtering based on event attributes HTTPS AWS PUSH SNS any Data Consumer AWS SQS PULL any Data Consumer

  18. Stream Aggregation Order Event Order Event - AWS SNS AWS DynamoDB Order + Order Status Event AWS SNS AWS Lambda Order Status Event AWS SNS /SQS

  19. Low-latency API

  20. Low-latency API ● Dead Letter Queue AWS DynamoDB ● On-demand scaling AWS AWS API Lambda Gateway AWS SQS DLQ

  21. Analytics

  22. Analytics AWS Google ● Bigquery as OLAP db Lambda BigQuery ● Data quality visualization ● Bigquery is scalable DWH solution AWS SQS DLQ

  23. Any Producer Ingestion API Streaming First use case: Second use case: Third use case: Near real-time Low-latency API Analytics Any Consumer

  24. Tech Stack

  25. Challenges

  26. Challenges ● SLAs ( latency, durability, etc.) for some cloud services. ● Data Quality & Data documentation ● GDPR ● Automating new pipeline creation ● Automating SNS subscriptions / BigQuery Access

  27. We Are Hiring! www.deliveryhero.com

Recommend


More recommend