Delivering Intelligence from space Crop Forecasting Pipeline - - PowerPoint PPT Presentation

delivering intelligence from space crop forecasting
SMART_READER_LITE
LIVE PREVIEW

Delivering Intelligence from space Crop Forecasting Pipeline - - PowerPoint PPT Presentation

Delivering Intelligence from space Crop Forecasting Pipeline Monitoring Market Intelligence Gathering Plane & Ship Tracking 3 Distributed Data Engineering - Lessons 4 Distributed Data Engineering - Lessons 1. Metrics 5 Distributed


  • Delivering Intelligence from space

  • Crop Forecasting Pipeline Monitoring Market Intelligence Gathering Plane & Ship Tracking

  • 3 Distributed Data Engineering - Lessons

  • 4 Distributed Data Engineering - Lessons 1. Metrics

  • 5 Distributed Data Engineering - Lessons 1. Metrics 2. Logging

  • 6 Distributed Data Engineering - Lessons 1. Metrics 2. Logging 3. Frameworks

  • 7 Distributed Data Engineering - Lessons 1. Metrics 2. Logging 3. Frameworks 4. Serverless ETL

  • 8 1. Metrics were lacking

  • 9 Before user_id num_downloads num_uploads 4 10 1 7 6 3

  • 10 Before user_id num_downloads num_uploads 4 11 +1 1 7 6 3 Server

  • 11 downloads date user_id image_size 2017-10-01 14:40:32 4 1365 2017-10-02 11:01:11 4 650

  • 12 downloads date user_id image_size 2017-10-01 14:40:32 4 1365 2017-10-02 11:01:11 4 650 2017-10-02 11:06:00 5 9001

  • 13 Source of Truth Database

  • 14 Source of Truth Database ● Migration headaches ● Manage connections ● Performance

  • 15 Source of Truth Database Logging { "message": "Downloaded img", "userId": "1234", "imgId": "1d3x5", "service": "download-server", "time": "1509385330" }

  • 16 Two Kinds of Logs Server logs [Wed Oct 11 14:32:12 2000] [info] [client ● Debugging 127.0.0.1] image 1d3x5 downloaded by userId 1234 ● Support

  • 17 Two Kinds of Logs Server logs [Wed Oct 11 14:32:12 2000] [info] [client ● Debugging 127.0.0.1] image 1d3x5 downloaded by userId 1234 ● Support { Metric logs "message": "Downloaded img", ● Dashboards "userId": "1234", "imgId": "1d3x5", ● Analytics "service": "download-server", "time": "1509385330" }

  • 18 { Metric logs "message": "Downloaded img", ● Dashboards "userId": "1234", "imgId": "1d3x5", ● Analytics "service": "download-server", "time": "1509385330" }

  • 19 Centralize

  • 20 Metric Collector import observatory obs = observatory.Tracker() obs.track('search_made', { 'query': event.query, 'n_results': len(resp['data']), 'user_id': user_item.id })

  • 21 Metric Collector Enrich / Conform import observatory obs = observatory.Tracker() obs.track('search_made', { REST API 'query': event.query, Lambda 'n_results': len(resp['data']), 'user_id': user_item.id })

  • 22 Metric Collector import observatory obs = observatory.Tracker() obs.track('search_made', { REST API 'query': event.query, Lambda Redshift 'n_results': len(resp['data']), 'user_id': user_item.id }) S3

  • 23 After ● Centralized metrics ● Log enrichment ● Persistent store REST Lambda Redshift S3

  • 24 2. Debugging is painful

  • 25 Before Filesystem EC2 CloudWatch Lambda EC2 Filesystem

  • 26 Centralize

  • 27 EC2 Stream Lambda EC2

  • 28 EC2 Lambda CloudWatch EC2

  • 29 EC2 Lambda CloudWatch Consumer EC2

  • 30 ES EC2 SAAS Lambda CloudWatch Consumer S3 EC2

  • 31 After ● Elasticsearch ● Search by UUID

  • 32 But! What does the full flow of a request look like?

  • 33 Correlation ID UUID

  • 34 Correlation ID ● Create for any external call CID External Service A Request

  • 35 Correlation ID ● CID passed everywhere CID Service B External Service A Request Service C

  • 36 Correlation ID In ES → filter �y CID

  • 37 3. Building services is slow

  • 38 Before ● Online console ● Zip file deployment ● Doesn’t s�ale

  • 39 Infrastructure as Code

  • 40 Infrastructure as Code ● Serverless Framework ● Template -> service ● Rapid deployment

  • 41 # serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev functions: usersCreate: handler: users.create events: - http: post users/create resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions: - AttributeName: email AttributeType: S

  • 42 # serverless.yml service: users Microservice provider: name: aws runtime: nodejs6.10 stage: dev functions: usersCreate: handler: users.create Internet Lambda events: gateway handler - http: post users/create S3 resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions: Dynamodb - AttributeName: email AttributeType: S

  • 43 # serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev STAGE: ${opt:stage, self:provider.stage} functions: usersCreate: handler: users.create events: - http: post users/create resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions: - AttributeName: email AttributeType: S

  • 44 # serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev ● Service info as ENV vars STAGE: ${opt:stage, self:provider.stage} functions: usersCreate: handler: users.create events: - http: post users/create resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions: - AttributeName: email AttributeType: S

  • 45 # serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev ● Service info as ENV vars STAGE: ${opt:stage, self:provider.stage} ● Inject in logs functions: usersCreate: handler: users.create events: - http: post users/create resources: import observatory Resources: obs = observatory.Tracker() usersTable: Type: AWS::DynamoDB::Table Properties: obs.track('search_made', { TableName: usersTable 'query': event.query, AttributeDefinitions: - AttributeName: email 'n_results': len(resp['data']), AttributeType: S 'user_id': user_item.id })

  • 46 After ● Rapid dev ● Source controlled ● Log enrichment

  • 47 4. Server time == $$$

  • 48 Before ● Bursty ● ETL servers idle

  • 49 Transient Resources ● Pipeline: ○ Spin up EC2 ○ Terminate ETL EC2 EC2 DB

  • 50 FAAS Resources ● Pipeline: ○ Discretize work ○ Lambda fleet ○ Inherently transient Worker Listener Worker DB Worker

  • 51 After ● Faster ● Cheaper ● Highly scalable ETL

  • 52 Distributed Data Engineering - Lessons 1. Metrics 2. Logging 3. Frameworks 4. Serverless ETL

  • Skywatch