stream processing uber
play

STREAM PROCESSING @ UBER DANNY YUAN @ UBER What is Uber - PowerPoint PPT Presentation

STREAM PROCESSING @ UBER DANNY YUAN @ UBER What is Uber Transportation at your fingertips Stream Data Allows Us To Feel The Pulse Of Cities Marketplace Health Whats Going on Now Whats Happened? Status Tracking A Little Background


  1. STREAM PROCESSING @ UBER DANNY YUAN @ UBER

  2. What is Uber

  3. Transportation at your fingertips

  4. Stream Data Allows Us To Feel The Pulse Of Cities

  5. Marketplace Health

  6. What’s Going on Now

  7. What’s Happened?

  8. Status Tracking

  9. A Little Background

  10. Uber’s Platform Is a Distributed State Machine Rider States

  11. Uber’s Platform Is a Distributed State Machine Rider States Driver States

  12. Applications can’t do everything

  13. Instead, Applications Emit Events

  14. Events Should Be Available In Seconds

  15. Events Should Rarely Get Lost

  16. Events Should Be Cheap And Scalable

  17. Where are the challenges?

  18. Many Dimensions Dozens of fields per event

  19. Granular Data

  20. Granular Data

  21. Granular Data Over 10,000 hexagons in the city

  22. Granular Data 7 vehicle types

  23. Granular Data 1440 minutes in a day

  24. Granular Data 13 driver states

  25. Granular Data 300 cities

  26. Granular Data 1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion possible combinations

  27. Unknown Query Patterns Any combination of dimensions

  28. Variety of Aggregations - Heatmap - T op N - Histogram - count(), avg(), sum(), percent(), geo

  29. Different Geo Aggregation

  30. Large Data Volume • Hundreds of thousands of events per second, or billions of events per day 
 • At least dozens of fields in each event

  31. Tight Schedule

  32. Key: Generalization

  33. Data Type • Dimensional T emporal Spatial Data Dimension Value state driver_arrived vehicle type uber X timestamp 13244323342 lattitude 12.23 longitude 30.00

  34. 
 Data Query • OLAP on single-table temporal-spatial data SELECT ¡<agg ¡functions>, ¡<dimensions> ¡ 
 FROM ¡<data_source> 
 WHERE ¡<boolean ¡filter> 
 GROUP ¡BY ¡<dimensions> 
 HAVING ¡<boolean ¡filter> 
 ORDER ¡BY ¡<sorting ¡criterial> 
 LIMIT ¡<n> 
 DO ¡<post ¡aggregation>

  35. Finding the Right Storage System

  36. 
 Minimum Requirements • OLAP with geospatial and time series support 
 • Support large amount of data 
 • Sub-second response time 
 • Query of raw data

  37. It can’t be a KV store

  38. Challenges to KV Store Pre-computing all keys is O(2 n ) ¡ for both space and time 


  39. It can’t be a relational database

  40. 
 Challenges to Relational DB • Managing multiple indices is painful 
 • Scanning is not fast enough

  41. 
 A System That Supports • Fast scan 
 • Arbitrary boolean queries 
 • Raw data 
 • Wide range of aggregations

  42. Elasticsearch

  43. Highly Efficient Inverted-Index For Boolean Query

  44. Built-in Distributed Query

  45. Fast Scan with Flexible Aggregations

  46. Storage

  47. Are We Done?

  48. Transformation e.g. (Lat, Long) -> (zipcode, hexagon)

  49. Dynamic Pricing

  50. Trend Prediction

  51. Supply and Demand Distribution

  52. Technically Speaking: Clustering & Pr(D, S, E)

  53. New Use Cases —> New Requirements

  54. Pre-aggregation

  55. Joining Multiple Streams

  56. Sessionization

  57. Multi-Staged Processing

  58. State Management

  59. Apache Samza

  60. Why Apache Samza?

  61. DAG on Kafka

  62. Excellent Integration with Kafka

  63. Excellent Integration with Kafka

  64. Built-in Checkpointing

  65. Built-in State Management

  66. Processing Storage

  67. What If Storage Is Down?

  68. What If Processing Takes Long?

  69. Processing Storage

  70. Are We Done?

  71. Post Processing

  72. Results Transformation and Smoothing

  73. Scale of Post Processing 10,000 hexagons in a city

  74. Scale of Post Processing 331 neighboring hexagons to look at

  75. Scale of Post Processing 331 x 10,000 = 3.1 Million Hexagons to Process for a Single Query

  76. Scale of Post Processing 99%-ile Processing Time: 70ms

  77. Post Processing • Each processor is a pure function 
 • Processors can be composed by combinators

  78. Post Processing • Highly parallelized execution 
 • Pipelining

  79. Post Processing • Each processor is a pure function 
 • Processors can be composed by combinators 
 • Highly parallelized execution

  80. Practical Considerations

  81. Data Discovery

  82. Elasticsearch Query Can Be Complex

  83. /driverAcceptanceRate? ¡ geo_dist(10, ¡[37, ¡22])& ¡ time_range(2015-­‑02-­‑04,2015-­‑03-­‑06)& ¡ aggregate(timeseries(7d))& ¡ eq(msg.driverId,1) ¡

  84. Elasticsearch Query Can Be Optimized • Pipelining 
 • Validation 
 • Throttling

  85. Time in seconds

  86. Elasticsearch Can Be Replaced

  87. Processing Storage Query

  88. There’s one more thing

  89. There are always patterns in streams

  90. There is always need for quick exploration

  91. How many drivers cancel a request 10 times in a row within a 5-minute window?

  92. Which riders request a pickup from 100 miles apart within a half hour window?

Recommend


More recommend