Building Scalable Real-Time Data Pipeline “Data Fridge” Vicente Valls Rios Software Engineering Manager
!!! 2 million orders delivered in one day (July 2019) www.deliveryhero.com
Data Volume in Numbers + User clicks + Logistics data 10M + Restaurant Order-Related availability Events Per Day + Menu items + Customer data
House of Brands & Global Services Logistics Search Recommendation ...
Challenges
Data Producers Data Consumers Search Recommendation Logistics ... ...
Data Producers Data Consumers Search Recommendation Data Fridge Logistics ... ...
Mission ● Unify the data structure across all entities ● Provide different data consumptions types: ○ Near real-time ○ Low-latency ○ High-latency (analytics) ● Become a data producer for ML applications
Architecture
Architecture Data Fridge Low-latency any API Data Consumer Ingestion Events any Streaming API subscription Data Consumer any Data Producer Long-term any storage Analytics
Ingestion API
Ingestion API ● Verifies quality of data using complex validations ● Single entrypoint AWS API Gateway ● Batch import ● AWS Lambda for event processing ● IP Whitelisting ● JWT authentication AWS Lambda
Canary Deployment v1 ● Custom solution Alias: PROD 90% ● 2 versions under the same alias 10% v2 ● Metrics monitoring
Streaming
Streaming AWS Kinesis Lambda AWS SNS any any Ingestion Data Data API Producer Consumer ● Kinesis stream preserves messages up to 7 days ● Ability to replay data ● Ability to scale up/down
SNS instead of Kinesis ? ● We do not care about order of events. ● Having Message Filtering. ● Kinesis requires to scale up shards as our Consumers grow. ● Scale up Kinesis is harder and more expensive than SNS. ● SNS->HTTP/SQS service provides PUSH-PULL data consumption.
Streaming ● 2 types for subscriptions ● Filtering based on event attributes HTTPS AWS PUSH SNS any Data Consumer AWS SQS PULL any Data Consumer
Stream Aggregation Order Event Order Event - AWS SNS AWS DynamoDB Order + Order Status Event AWS SNS AWS Lambda Order Status Event AWS SNS /SQS
Low-latency API
Low-latency API ● Dead Letter Queue AWS DynamoDB ● On-demand scaling AWS AWS API Lambda Gateway AWS SQS DLQ
Analytics
Analytics AWS Google ● Bigquery as OLAP db Lambda BigQuery ● Data quality visualization ● Bigquery is scalable DWH solution AWS SQS DLQ
Any Producer Ingestion API Streaming First use case: Second use case: Third use case: Near real-time Low-latency API Analytics Any Consumer
Tech Stack
Challenges
Challenges ● SLAs ( latency, durability, etc.) for some cloud services. ● Data Quality & Data documentation ● GDPR ● Automating new pipeline creation ● Automating SNS subscriptions / BigQuery Access
We Are Hiring! www.deliveryhero.com
Recommend
More recommend