scaling data infrastructure spotify
play

Scaling Data Infrastructure @ Spotify matti@spotify.com - PowerPoint PPT Presentation

Scaling Data Infrastructure @ Spotify matti@spotify.com kalvans@spotify.com Mrti Kalvns Matti Pehrs kalvans@spotify.com matti@spotify.com Agenda 1. Data at Spotify 2. Summer of 2015 3. Challenges & Victory Datamon


  1. Scaling Data Infrastructure @ Spotify matti@spotify.com kalvans@spotify.com

  2. Mārtiņš Kalvāns Matti Pehrs kalvans@spotify.com matti@spotify.com

  3. Agenda 1. Data at Spotify 2. Summer of 2015 3. Challenges & Victory ○ Datamon ○ Styx ○ GABO

  4. Spotify big-data context ● Over 100 million monthly active users ● Over 30 million song ● Over 2 billion playlists ● Active in 60 markets

  5. Data is at the heart of Spotify In 2007 In 2016 - Monthly Royalty Report - Monthly Royalty Report - Weekly Billboard - Daily reports to partners - ... - AB-Testing - Discover weekly - Daily Mix - ...

  6. Our growth in Data Users Developers +50 TB/day +60 TB/day +100M Users +10k M/R jobs

  7. Autonomy & Dependencies Team B Team A Team C Hadoop

  8. Autonomy & Dependencies

  9. Autonomy & Dependencies

  10. Autonomy & Dependencies

  11. Summer of Incidents

  12. Summer of Incidents ● A strain of incidents

  13. Summer of Incidents ● A strain of incidents War-room ●

  14. Summer of Incidents ● A strain of incidents War-room ● ● Hadoop on it’s knees

  15. Summer of Incidents ● A strain of incidents War-room ● ● Hadoop on it’s knees Event Delivery Catch up ●

  16. Summer of Incidents ● A strain of incidents War-room ● ● Hadoop on it’s knees Event Delivery Catch up ● ● Reprocessing of data

  17. Summer of Incidents ● A strain of incidents War-room ● ● Hadoop on it’s knees Event Delivery Catch up ● ● Reprocessing of data Hard to debug data issues ●

  18. Challenges and the path to victory...

  19. Challenges and the path to victory... 1. Early Warning Datamon - Data monitoring

  20. Challenges and the path to victory... 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control

  21. Challenges and the path to victory... 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery

  22. Challenges and the path to victory... 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery

  23. Early Warning - Datamon

  24. Early Warning - Datamon ● Unified view Alignment between teams ○ ● Ownership ○ Clear ownership of data SLA ● ○ Alert on late data

  25. Early Warning - Datamon ● Define terminology ● Provide metadata language ● Implement a Datamon service

  26. Challenges and the path to victory... 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery

  27. Debuggability & Control - Styx - Execution control - Self service for data users - Execution information - Expose debug information - Execution isolation - Docker for data jobs The river Styx

  28. Debuggability & Control - Styx ● Execution control ○ Centralized execution API

  29. Debuggability & Control - Styx ● Execution control ○ Centralized execution API Backfilling and reprocessing ○

  30. Debuggability & Control - Styx ● Execution control Execution information ● ○ Timeline

  31. Debuggability & Control - Styx ● Execution control Execution information ● ○ Timeline ○ Google Cloud Logging

  32. Debuggability & Control - Styx ● Execution control Execution information ● ● Execution isolation Docker ○

  33. Challenges and the path to victory... 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery

  34. Automate Capacity - GABO/Event Delivery ● Complex and manual config

  35. Automate Capacity - GABO/Event Delivery ● Complex and manual config ● Pubsub & Dataflow streaming

  36. Automate Capacity - GABO/Event Delivery ● Complex and manual config ● Pubsub & Dataflow streaming ● Pubsubs at scale

  37. Automate Capacity - GABO/Event Delivery ● Complex and manual config ● Pubsub & Dataflow streaming ● Pubsubs at scale ● Dataflow streaming

  38. Automate Capacity - GABO/Event Delivery ● Complex and manual config ● Pubsub & Dataflow streaming ● Pubsubs at scale ● Dataflow streaming :-( ● 2 micro services + 1 Map/Reduce job

  39. Automate Capacity - GABO/Event Delivery ● Complex and manual config ● Pubsub & Dataflow streaming ● Pubsubs at scale ● Dataflow streaming :-( ● 2 micro services + 1 Map/Reduce job ● Autoscaling & The Stuffer

  40. GABO - WIP ● Handles at least 10x our load ● Darkloading ● Autoscale everything ● Self service

  41. Summary ● Make sure you have the right tools to deal with data incidents ○ Make sure you have time to implement the tools you need ● Remember that your capacity model can fail at larger scale ○ Keep track of your scale and Automate, automate, automate...

  42. Thank you! kalvans@spotify.com matti@spotify.com Want to join the band? http://spoti.fi/jobs

Recommend


More recommend