monarch
play

Monarch Googles planet-scale streaming monitoring infrastructure. - PowerPoint PPT Presentation

Monarch Googles planet-scale streaming monitoring infrastructure. Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling Monitoring at Google Ref:


  1. Monarch Google’s planet-scale streaming monitoring infrastructure.

  2. Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling

  3. Monitoring at Google Ref: https://www.google.com/about/datacenters/inside/locations/index.html

  4. Monitoring at Google Global Span Huge Volume Many Kinds ● Hardware/networking ● OS ● Infrastructure services Big, user-facing services ● ● Smaller services Ref: https://www.google.com/about/datacenters/inside/locations/index.html Constant change

  5. Essentials of Monarch Scaling Maintain good hygiene Scale horizontally Reduce dimensions early

  6. Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling

  7. Global Extent Ref: https://www.google.com/about/datacenters/inside/locations/index.html

  8. Monarch Zone Monitor Locally Target Leaf Configuration Leaf Evaluato r Streamz Library Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Zone Leaf Ingest Leaf Leaf Leaf Mixer Router Leaf Leaf Leaf Leaf Leaf Leaf Recovery Assigner Repository Logs

  9. Monarch Zone: Ingestion, Retention and Queries Target Leaf Configuration Notification Leaf Evaluator Streamz Library Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Zone Query Leaf Leaf Ingest Leaf Leaf Leaf Leaf Leaf Mixer Router Leaf Leaf Leaf Leaf Recovery Assigner Repository Logs

  10. Monarch Zone: Ingestion Target Leaf Configuration Notification Leaf Evaluator Streamz Library Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Zone Query Leaf Leaf Ingest Leaf Leaf Leaf Leaf Leaf Mixer Router Leaf Leaf Leaf Leaf Recovery Assigner Repository Logs

  11. Metrics /http/server/response_latencies (Distribution) Description Path (string) Status_code_class (int64) (cumulative) /requestz 200 /requestz 500 Values /inspectz 200 /statusz 200 ... ...

  12. Target Schema BorgTask Description user (string) job (string) cell (string) task_num (int) jones server ip 32 Values

  13. Monarch Zone: Ingestion Target Leaf Configuration Notification Leaf Evaluator Streamz Library Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Zone Query Leaf Leaf Leaf Ingestion Leaf Leaf Leaf Leaf Mixer Router Leaf Leaf Leaf Leaf Recovery Assigner Repository Logs

  14. Monarch Zone: Retention Target Leaf Configuration Notification Leaf Evaluator Streamz Library Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Zone Query Leaf Leaf Leaf Ingestion Leaf Leaf Leaf Leaf Mixer Router Leaf Leaf Leaf Leaf Recovery Assigner Repository Logs

  15. Streams /http/server/response_latencies BorgTask ... ... jones server ip 32 /inspectz 200 ... ... timestamp value ... 1:21 1:20 1:19 stream-identifier history

  16. The Data Model for Queries BorgTask :: /rpc/server/server_latencies user cell status_code_class server_latencies job task_num path 10:52-1:21 10:42-01:21 ... jones server ip 0 DB Alloc . . . . . . . . . . . . . . . . . . 10:52-1:21 10:42-01:21 ... jones server ip 876 DB Query 10:52-1:21 10:42-01:21 ... ip jones server 877 DB Undo ... ... ... ... ... ... 07:33-4:49 07:38-4:49 ... qr emons client 33 Help Ask time series column stream-id columns Confidential + Proprietary

  17. Monarch Zone: Retention Target Leaf Configuration Notification Leaf Evaluator Streamz Library Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Zone Query Leaf Leaf Leaf Ingestion Leaf Leaf Leaf Leaf Mixer Router Leaf Leaf Leaf Leaf Recovery Assigner Repository Logs

  18. Monarch Zone: Query Target Leaf Configuration Notification Leaf Evaluator Streamz Library Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Zone Query Leaf Leaf Leaf Ingestion Leaf Leaf Leaf Leaf Mixer Router Leaf Leaf Leaf Leaf Recovery Assigner Repository Logs

  19. Monarch Zone : Evaluation and Notification Target Leaf Configuration Notification Leaf Evaluator Streamz Library Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Zone Query Leaf Sample Leaf Leaf Leaf Ingestion Leaf Leaf Leaf Leaf Mixer Server Router Leaf Leaf Leaf Leaf Recovery Assigner Repository Logs

  20. Monarch Zone Target Leaf Configuration Notification Leaf Evaluator Streamz Library Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Zone Query Leaf Sample Leaf Leaf Leaf Ingestion Leaf Leaf Leaf Leaf Mixer Server Router Leaf Leaf Leaf Leaf Recovery Assigner Repository Logs

  21. Ref: https://www.google.com/about/datacenters/inside/locations/index.html

  22. Local > Global View Leaf Leaf Evaluator Leaf Leaf Config Server Root Mixer

  23. Global Monarch Leaf Notification Leaf Evaluator Leaf Configuration Leaf Config Server Leaves (global zone) Root Query Mixer Zones Zone Mixers

  24. Global Monarch Leaf Notification Leaf Evaluator Leaf Configuration Leaf Config Server Leaves (global zone) Root Query Mixer Zones Zone Mixers

  25. Global Monarch Leaf Notification Leaf Evaluator Leaf Configuration Leaf Config Server Leaves (global zone) Root Query Mixer Zones Zone Mixers

  26. Integrated Monarch Global Monarch Leaf Leaf Leaf Leaf Monarch Zones Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf

  27. Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling

  28. Query Query( Fetch(Raw('BorgTask', '/http/server/response_latency'), {'user': 'gmail', 'status_code_class': 200}) | Window(Delta('5m')) | GroupBy([job, cell], Sum()) | Point(Percentile(95)), '1h', '5m') Also: Join, PickTopStreams, MapStreamId, Union General expressions A large set of aggregation functions

  29. The Life of a Query Query Root Zone Repo Mixer Mixer Response Leaf Fetch Window GroupBy Point

  30. The Life of a Query Query Root Zone Repo Mixer Mixer Response Leaf Fetch Fetch Window Window GroupBy GroupBy Point Point

  31. The Life of a Query Query Root Zone Repo Mixer Mixer Response Leaf Fetch Fetch Fetch Fetch Window Window Window GroupBy GroupBy GroupBy Point Point

  32. The Life of a Query Query Root Zone Repo Mixer Mixer Response Leaf Fetch Fetch Window GroupBy GroupBy Point

  33. Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling

  34. Panopticon

  35. Using Panopticon Retention Policy

  36. Using Panopticon Retention Policy Query

  37. Using Panopticon Retention Policy Query Configure alert

  38. Using Panopticon Retention Policy Query Configure alert Setup Consoles

  39. Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling

  40. Monarch as Platform A custom console service Python-based configuration libraries that encode best practices Really automatic monitoring Cross company monitoring SLA definition and alerting Automated monitoring of rollouts . . .

  41. Google Stackdriver Monarch is the backend for Google Stackdriver Monitors cloud customers and Google services used by those customers A good deal of important development to do this Encryption at rest Carefully controlled and audited access Different ways of naming things and data model

  42. Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling

  43. Lessons Learned re: Scaling Maintain Good Hygiene Scale horizontally -- only -- and it’s hard! Reduce dimensions early

  44. Lessons Learned - Good Hygiene Concurrency: don’t make long tails longer. Periodically assess all components. Always be deprecating. Study outliers carefully!

  45. Lessons Learned - Scaling Horizontally It’s hard, but it’s the only way. Increase the number of leaves and zones. Watch out for: Centralized services that become bottlenecks. Non-constant per-backend costs. Query fan-out.

  46. Lessons Learned - Reduce Dimensions Early Aggregate data as it arrives. Configuration and data multiplexing are important. Users must be able to see “through” the aggregation.

  47. Lessons Learned - See through aggregation

  48. Lessons Learned - See through aggregation

  49. Lessons Learned - See through aggregation

  50. Lessons Learned re: Scaling Maintain Good Hygiene Scale horizontally -- only -- and it’s hard! Reduce dimensions early This is a sampling of lessons we’ve learned--there are many more.

  51. Thank You

Recommend


More recommend