real time recommendations using spark streaming
play

Real Time Recommendations using Spark Streaming Elliot Chow Why? - PowerPoint PPT Presentation

Real Time Recommendations using Spark Streaming Elliot Chow Why? - React more quickly to changes in interest - Time-of-day effects - Real-world events Feedback Loop UI Recommendation Data Systems Systems Stream Processing Trends


  1. Real Time Recommendations using Spark Streaming Elliot Chow

  2. Why? - React more quickly to changes in interest - Time-of-day effects - Real-world events

  3. Feedback Loop UI Recommendation Data Systems Systems Stream Processing

  4. Trends Data - What people browse: impressions - What people watch: plays

  5. Trends Data - Impressions Appearance of a video in the viewport

  6. Trends Data - Plays Member plays a video

  7. Why Spark Streaming? - Existing Spark infrastructure - Experience with Spark - Batch and Streaming

  8. Components

  9. Design Consume Filter Plays Join Aggregate Transform Cassandra S3 Consume Filter Impressions

  10. Design Consume Filter Plays Join Aggregate Transform Cassandra S3 Consume Filter Impressions

  11. Join Key “Request Id” - a unique identifier of the source of a play or impression

  12. Design Consume Filter Plays Aggregate Join Transform Cassandra S3 Consume Filter Impressions

  13. Output Video Epoch Plays Impressions Stranger Things 1 (00:00-00:30) 4 5 Stranger Things 1 (00:00-00:30) 3 6 House Of Cards 2 (00:30-01:00) 8 10 Marseille 2 (00:30-01:00) 3 3

  14. Output - Instead of raw counts, output sets of request ids - Count = cardinality of the set of request ids - Idempotent counting

  15. Design Consume Filter Plays Transform Join Aggregate Cassandra S3 Consume Filter Impressions

  16. Programming with Spark Streaming

  17. Streaming Joins

  18. Streaming Joins - Time - Time to browse and select a video - Batched logging from client application - Delays in data sources

  19. Streaming Joins - Attempt I - Window both plays and impressions by epoch duration - Join the two windows together - Slide by epoch duration Plays Impressions t

  20. Streaming Joins - Attempt I - Easy to implement - Tight coupling with processing time - Does not mesh well with absolute time windows - Failure can mean loss of all data for the entire window Window Start Window End 00:15 00:45 00:00 Epoch 1 00:30 Epoch 2 01:00

  21. Streaming Joins - Attempt II - Join using mapWithState - Join key is the mapWithState key - State is the plays and impressions sharing the same join key - Use timeouts to expire unjoined data

  22. Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1, I1

  23. Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1 } R1, I1

  24. Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1 } R2, I8

  25. Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1 } R2 => { I8 } R2, I8

  26. Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1 } R2 => { I8 } R1, P1

  27. Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1, P1 } R1, P1 R2 => { I8 } R1, I1 R1, P1

  28. Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1, P1 } R2 => { I8 } R3, I5

  29. Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1, P1 } R2 => { I8 } R3, I5 R3 => { I5 }

  30. Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1, P1 } R1, I6 R3 => { I5 }

  31. Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1, P1, I6 } R1, I6 R1, I6 R3 => { I5 }

  32. Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1, P1, I6 } ... R3 => { I5 }

  33. Streaming Joins - Attempt II - Make progress every batch - Too much “uninteresting” data - High memory usage - Large checkpoints

  34. Streaming Joins - An Observation Plays Impressions t

  35. Streaming Joins - An Observation Plays Impressions t

  36. Streaming Joins - An Observation Join incoming batch of plays to windowed impressions, and vice versa Plays Impressions t

  37. Streaming Joins - An Observation Slide by batch interval... Plays Impressions t

  38. Streaming Joins - An Observation Slide by batch interval again... Plays Impressions t

  39. Streaming Joins - Attempt III - Counts are updated every batch - Uses Spark’s windowing - No checkpoints

  40. mapWithState

  41. mapWithState Can be used for more than sessionization -

  42. mapWithState Can be used for more than sessionization - - Be aware of cache evictions - Lots of state may need to be recomputed

  43. mapWithState val input : DStream[(VideoId, RequestId)] = // ... val spec : StateSpec[VideoId, RequestId, Set[RequestId], (VideoId, Set[RequestId])] = // ... val output : DStream[(VideoId, Set[RequestId])] = { input. mapWithState(spec) }

  44. mapWithState val input : DStream[(VideoId, RequestId)] = // ... val spec : StateSpec[VideoId, RequestId, Set[RequestId], (VideoId, Set[RequestId])] = // ... val output : DStream[(VideoId, Set[RequestId])] = { input. mapWithState(spec). groupByKey. mapValues(_.maxBy(_.size)) }

  45. mapWithState val input : DStream[(VideoId, RequestId)] = // ... val spec : StateSpec[VideoId, Iterable[RequestId], Set[RequestId], (VideoId, Set[RequestId])] = // ... val output : DStream[(VideoId, Set[RequestId])] = { input. groupByKey. mapWithState(spec) }

  46. mapWithState val input : DStream[(VideoId, RequestId)] = // ... val spec : StateSpec[VideoId, RequestId, Set[RequestId], Unit] = // ... val output : DStream[(VideoId, Set[RequestId])] = { input. mapWithState(spec). stateSnapshots }

  47. Productionizing Spark Streaming

  48. Metrics - Monitoring system health - Aid in diagnosis of issues - Needs to be performant and accurate

  49. Metrics - Option I - Use “traditional” stream processing metrics - Events/second, bytes/second, … - Batching can make numbers hard to interpret - Susceptible to recomputation

  50. Metrics - Option II - Spark Accumulators - Used internally by Spark - Susceptible to recomputation - Unclear when to report the metric - Can make use of SparkListener & StreamingListener

  51. Metrics - Option III - Explicit counts on RDDs - Counts will be accurate - Additional latency - Use caching to prevent duplicate work*

  52. Metrics - Processing time < Batch interval - Time the different parts of the job - Spark is lazy - may require forcing evaluation - Use Spark UI metrics

  53. Error Handling - What exceptions cause the streaming job to crash?

  54. Error Handling - What exceptions cause the streaming job to crash? - Most seem to be caught to keep the job running - Exception handling is application-specific - Stop-gap: track the elapsed time since the batch started

  55. Future Work

  56. Future Work - Red/Black deployment with zero data-loss

  57. Future Work - Red/Black deployment with zero data-loss - Auto-scaling

  58. Future Work - Red/Black deployment with zero data-loss - Auto-scaling - Improved back pressure per topic

  59. Future Work - Red/Black deployment with zero data-loss - Auto-scaling - Improved back pressure per topic - Updating broadcast variables

  60. Questions? We’re hiring! elliot@netflix.com

Recommend


More recommend