Delivering Intelligence from space
Crop Forecasting Pipeline Monitoring Market Intelligence Gathering Plane & Ship Tracking
3 Distributed Data Engineering - Lessons
4 Distributed Data Engineering - Lessons 1. Metrics
5 Distributed Data Engineering - Lessons 1. Metrics 2. Logging
6 Distributed Data Engineering - Lessons 1. Metrics 2. Logging 3. Frameworks
7 Distributed Data Engineering - Lessons 1. Metrics 2. Logging 3. Frameworks 4. Serverless ETL
8 1. Metrics were lacking
9 Before user_id num_downloads num_uploads 4 10 1 7 6 3
10 Before user_id num_downloads num_uploads 4 11 +1 1 7 6 3 Server
11 downloads date user_id image_size 2017-10-01 14:40:32 4 1365 2017-10-02 11:01:11 4 650
12 downloads date user_id image_size 2017-10-01 14:40:32 4 1365 2017-10-02 11:01:11 4 650 2017-10-02 11:06:00 5 9001
13 Source of Truth Database
14 Source of Truth Database ● Migration headaches ● Manage connections ● Performance
15 Source of Truth Database Logging { "message": "Downloaded img", "userId": "1234", "imgId": "1d3x5", "service": "download-server", "time": "1509385330" }
16 Two Kinds of Logs Server logs [Wed Oct 11 14:32:12 2000] [info] [client ● Debugging 127.0.0.1] image 1d3x5 downloaded by userId 1234 ● Support
17 Two Kinds of Logs Server logs [Wed Oct 11 14:32:12 2000] [info] [client ● Debugging 127.0.0.1] image 1d3x5 downloaded by userId 1234 ● Support { Metric logs "message": "Downloaded img", ● Dashboards "userId": "1234", "imgId": "1d3x5", ● Analytics "service": "download-server", "time": "1509385330" }
18 { Metric logs "message": "Downloaded img", ● Dashboards "userId": "1234", "imgId": "1d3x5", ● Analytics "service": "download-server", "time": "1509385330" }
19 Centralize
20 Metric Collector import observatory obs = observatory.Tracker() obs.track('search_made', { 'query': event.query, 'n_results': len(resp['data']), 'user_id': user_item.id })
21 Metric Collector Enrich / Conform import observatory obs = observatory.Tracker() obs.track('search_made', { REST API 'query': event.query, Lambda 'n_results': len(resp['data']), 'user_id': user_item.id })
22 Metric Collector import observatory obs = observatory.Tracker() obs.track('search_made', { REST API 'query': event.query, Lambda Redshift 'n_results': len(resp['data']), 'user_id': user_item.id }) S3
23 After ● Centralized metrics ● Log enrichment ● Persistent store REST Lambda Redshift S3
24 2. Debugging is painful
25 Before Filesystem EC2 CloudWatch Lambda EC2 Filesystem
26 Centralize
27 EC2 Stream Lambda EC2
28 EC2 Lambda CloudWatch EC2
29 EC2 Lambda CloudWatch Consumer EC2
30 ES EC2 SAAS Lambda CloudWatch Consumer S3 EC2
31 After ● Elasticsearch ● Search by UUID
32 But! What does the full flow of a request look like?
33 Correlation ID UUID
34 Correlation ID ● Create for any external call CID External Service A Request
35 Correlation ID ● CID passed everywhere CID Service B External Service A Request Service C
36 Correlation ID In ES → filter �y CID
37 3. Building services is slow
38 Before ● Online console ● Zip file deployment ● Doesn’t s�ale
39 Infrastructure as Code
40 Infrastructure as Code ● Serverless Framework ● Template -> service ● Rapid deployment
41 # serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev functions: usersCreate: handler: users.create events: - http: post users/create resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions: - AttributeName: email AttributeType: S
42 # serverless.yml service: users Microservice provider: name: aws runtime: nodejs6.10 stage: dev functions: usersCreate: handler: users.create Internet Lambda events: gateway handler - http: post users/create S3 resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions: Dynamodb - AttributeName: email AttributeType: S
43 # serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev STAGE: ${opt:stage, self:provider.stage} functions: usersCreate: handler: users.create events: - http: post users/create resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions: - AttributeName: email AttributeType: S
44 # serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev ● Service info as ENV vars STAGE: ${opt:stage, self:provider.stage} functions: usersCreate: handler: users.create events: - http: post users/create resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions: - AttributeName: email AttributeType: S
45 # serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev ● Service info as ENV vars STAGE: ${opt:stage, self:provider.stage} ● Inject in logs functions: usersCreate: handler: users.create events: - http: post users/create resources: import observatory Resources: obs = observatory.Tracker() usersTable: Type: AWS::DynamoDB::Table Properties: obs.track('search_made', { TableName: usersTable 'query': event.query, AttributeDefinitions: - AttributeName: email 'n_results': len(resp['data']), AttributeType: S 'user_id': user_item.id })
46 After ● Rapid dev ● Source controlled ● Log enrichment
47 4. Server time == $$$
48 Before ● Bursty ● ETL servers idle
49 Transient Resources ● Pipeline: ○ Spin up EC2 ○ Terminate ETL EC2 EC2 DB
50 FAAS Resources ● Pipeline: ○ Discretize work ○ Lambda fleet ○ Inherently transient Worker Listener Worker DB Worker
51 After ● Faster ● Cheaper ● Highly scalable ETL
52 Distributed Data Engineering - Lessons 1. Metrics 2. Logging 3. Frameworks 4. Serverless ETL
Skywatch
Recommend
More recommend