ml through streaming at
play

ML through Streaming at Sherin Thomas @doodlesmt Stopping a - PowerPoint PPT Presentation

QCON LONDON 2020 ML through Streaming at Sherin Thomas @doodlesmt Stopping a Phishing Attack Hello Alex, Im Tracy calling from Lyft HQ. This month were awarding $200 to all 4.7+ star drivers. Congratulations! Hey Tracy, thanks! Np!


  1. QCON LONDON 2020 ML through Streaming at Sherin Thomas @doodlesmt

  2. Stopping a Phishing Attack

  3. Hello Alex, I’m Tracy calling from Lyft HQ. This month we’re awarding $200 to all 4.7+ star drivers. Congratulations! Hey Tracy, thanks! Np! And because we see that you’re in a ride, we’ll dispatch another driver so you can park at a safe location…. ….Alright your passenger will be taken care of by another driver

  4. Before we can credit you the award, we just need to quickly verify your identity. We’ll now send you a verification text. Can you please tell us what those numbers are…... 12345

  5. Fingerprinting Fraudulent Behaviour

  6. Sequence of User Actions Request Ride ... Driver Contact Cancel Ride ….. Something Reference: Fingerprinting Fraudulent Behaviour

  7. Sequence of User Actions Request Ride ... Driver Contact Red Flag Cancel Ride ….. Something Reference: Fingerprinting Fraudulent Behaviour

  8. Temporally ordered user action sequence SELECT user_id, TOP (2056, action) OVER ( PARTITION BY user_id ORDER BY event_time RANGE INTERVAL ‘90’ DAYS PRECEDING ) AS client_action_sequence FROM event_user_action

  9. Temporally ordered user action sequence SELECT user_id, Last x events sorted by time TOP (2056, action) OVER ( PARTITION BY user_id ORDER BY event_time RANGE INTERVAL ‘90’ DAYS PRECEDING ) AS client_action_sequence FROM event_user_action

  10. Temporally ordered user action sequence SELECT user_id, TOP (2056, action) OVER ( PARTITION BY user_id ORDER BY event_time Historic context is also important RANGE INTERVAL ‘90’ DAYS PRECEDING (large lookback) ) AS client_action_sequence FROM event_user_action

  11. Temporally ordered user action sequence SELECT user_id, TOP (2056, action) OVER ( PARTITION BY user_id Event time processing ORDER BY event_time RANGE INTERVAL ‘90’ DAYS PRECEDING ) AS client_action_sequence FROM event_user_action

  12. Make streaming features accessible for ML use cases

  13. Flink

  14. Apache Flink ● Low latency stateful operations on streaming data - in the order or milliseconds ● Event time processing - replayability, correctness ● Exactly once processing ● Failure recovery SQL Api ●

  15. Event Ingestion Pipeline Kinesis Kinesis Kinesis Kinesis Filters (Streaming) Event Ingestion Pipeline HDFS { S3 “ride_req”, “user_id”: 123, “event_time”: t0 } (Offline/Batch)

  16. Processing Time vs Event Time Processing time System time when the event is processed -> determined by processor Event time Logical time when the event occurred -> part of event metadata Credit: The Beam Model by Tyler Akidau and Frances Perry

  17. Event Time ROGUE episode episode episode episode episode episode episode episode episode ONE IV V VI I II III vii viii IX III.5 1977 1980 1983 1999 2002 2005 2015 2016 2018 2019 Processing Time

  18. Example: integer sum over 2 min window Credit: The Beam Model by Tyler Akidau and Frances Perry

  19. Watermark 12:09 12:08 12:03 12:05 12:04 12:01 12:02 W = 12:10 W = 12:05 W = 12:02

  20. Example: integer sum over 2 min window Credit: The Beam Model by Tyler Akidau and Frances Perry

  21. Example: integer sum over 2 min window Credit: The Beam Model by Tyler Akidau and Frances Perry

  22. Usability

  23. What Data Scientists care about Model Development 1 Feature Engineering 2 Data Quality 3 Scheduling, Execution, 4 Data Collection Compute Resources 5

  24. ML Workflow Data Input Data Prep Modeling Deployment EXTRACT & DATA DISCOVERY TRANSFORM TRAIN MODELS DEPLOY FEATURES LABEL DATA MONITOR & NORMALIZE AND EVALUATE AND VISUALIZE CLEAN UP DATA OPTIMIZE PERFORMANCE MAINTAIN EXTERNAL FEATURE SETS

  25. ML Workflow Data Input Data Prep Modeling Deployment EXTRACT & DATA DISCOVERY TRANSFORM TRAIN MODELS DEPLOY FEATURES LABEL DATA MONITOR & NORMALIZE AND EVALUATE AND VISUALIZE CLEAN UP DATA OPTIMIZE PERFORMANCE MAINTAIN EXTERNAL FEATURE SETS

  26. Dryft! - Self Service Streaming Framework User Plane Control Plane Data Plane Data Discovery Dryft UI Kafla Query Analysis DynamoDB Job Cluster Druid Hive Elastic Search

  27. Declarative Job Definition Flink SQL Job Config SELECT { geohash, “retention”: {}, COUNT (*) AS total_events, “lookback”: {}, TUMBLE_END ( “stream”: { rowtime, “kinesis”: user_activities INTERVAL ‘1’ hour }, ) “features”: { FROM event_user_action “user_activity_per_geohash”: { GROUP BY “type”: “int” TUMBLE ( “version”: 1, rowtime, “description”: “user activities INTERVAL ‘1’ hour per geohash” ) } } }

  28. Kinesis Kinesis S3 Sources User Apps DynamoDB Hive Kinesis Sinks Feature Fanout Feature Fanout

  29. Eating our own dogfood

  30. Feature Fanout App - also uses Dryft SELECT -- this will be used in keyBy { CONCAT_WS ('_', feature_name, version, id), feature_data, “stream”: { “kinesis”: feature_stream CONCAT_WS ('_', feature_name, version) AS feature_definition, }, “sink”: { occurred_at FROM features “feature_service_dynamodb”: { “write_rate”: 1000, “retry_count”: 5 } } }

  31. Deployment

  32. Previously... ● Ran on AWS EC2 using custom deployment Separate autoscaling groups for JobManager and ● Taskmanagers ● Instance provisioning done during deployment ● Multiple jobs(60+) running on the same cluster

  33. Multi tenancy hell!!

  34. Kubernetes Based Deployment TM TM TM TM TM TM TM TM TM TM TM TM TM JM JM JM App 1 App 2 App 3 Managing Flink on Kubernetes

  35. Flink-K8s-Operator TM TM TM Custom TM TM TM Flink Operator Resource Descriptor JM Managing Flink on Kubernetes

  36. Custom Resource Descriptor ● Custom resource apiVersion: flink.k8s.io/v1alpha represents Flink application kind: FlinkApplication metadata: name: flink-speeds-working-stats ● Docker Image contains namespace: flink all dependencies spec: image: ‘100,dkr.ecr.us-east-1.amazonaws.com/abc:xyz’ ● CRD modifications trigger flinkJob: update (includes jarName: name.jar parallelism and other Flink parallelism: 10 taskManagerConfig: { configuration properties) resources: { limits: { memory: 15Gi, cpu: 4 } }, replicas: num_task_managers, taskSlots: NUM_SLOTS_PER_TASK_MANAGER, envConfig: {... }, }

  37. TM TM TM Validate Compute Resources Flink Generate CRD TM TM TM Kubernetes CRD Operator JM Dryft Conf --------- --------- ---------

  38. Flink on Kubernetes ● Separate Flink cluster for each application Resource allocation customized per job - at job creation time ● ● Scales to 100s of Flink applications ● Automatic application updates Managing Flink on Kubernetes - by Anand and Ketan

  39. Bootstrapping

  40. What is bootstrapping? SELECT passenger_id, COUNT(ride_id) FROM event_ride_completed GROUP BY passenger_id, HOP(rowtime, INTERVAL ‘30’ DAY, INTERVAL ‘1’ HOUR)

  41. Bootstrap with historic data -7 -6 -5 -3 -4 -2 -1 1 2 3 5 4 6 Historic Data Future Data Current Time Read historic data to ‘bootstrap’ the program with 30 days worth of data. Now your program returns results on day 1. But what if the source does not have all 30 days worth of data?

  42. Solution - Consume from two sources (historic) < Target Time Business Sink Logic (real-time) >= Target Time Read historic data from persistent store(AWS S3) and streaming data from Kafka/Kinesis Bootstrapping state in Apache Flink - Hadoop Summit

  43. Job starts

  44. Bootstrapping over

  45. Start Job With a higher parallelism for fast bootstrapping Detect Bootstrap Completion Job sends a signal to the control plane once watermark has progressed beyond a point where we no longer need historic data “Update” Job with lower parallelism but same job graph Control plane cancels job with savepoint and starts it again from savepoint but with a much lower parallelism

  46. Output volume spike during bootstrapping Bootstrapping

  47. Output volume spike during bootstrapping Features need to be fresh but eventually complete ● ● Smooth out data writes during bootstrap to match throughput Write features produced during bootstrapping separately ● Low Priority Kinesis Stream bootstrap steady state High Priority Idempotent Sink Kinesis Stream

  48. What about skew between historic and real-time data?

  49. Skew Watermark = Kinesis

  50. Solution: Source synchronization consumer global watermark partition 1 shared state partition 2 global consumer watermark global watermark partition 3 partition 4 FLINK-10887, FLINK-10921, FLIP-27

  51. Now...

  52. ● 120+ features ● Features available in DynamoDB(real time point lookup), Hive(offline analysis), Druid(real time analysis) and more… ● Time to write, test and deploy a feature is < 1/2 day p99 latency <5 seconds ● ● Coming Up - Python Support!

  53. Thank you! Sherin Thomas @doodlesmt

  54. Later

Recommend


More recommend