accept partial failures minimize service loss
play

Accept Partial Failures, Minimize Service Loss Daxin Wang Baidu - PowerPoint PPT Presentation

Accept Partial Failures, Minimize Service Loss Daxin Wang Baidu SRE Diversified products Promote Experienced team Too complicated to recover rapidly Network & Infrastructure Operation Mistake Software Bug Basic model of reduce


  1. Accept Partial Failures, Minimize Service Loss Daxin Wang

  2. Baidu SRE Diversified products Promote Experienced team

  3. Too complicated to recover rapidly Network & Infrastructure Operation Mistake Software Bug

  4. Basic model of reduce service lost in incident / 𝑑𝑓𝑠𝑀𝑗𝑑𝑓_π‘šπ‘π‘‘π‘’ = , 𝑀𝑝𝑑𝑒 𝑒 𝑒𝑒 0 requests failed 90 60 recovery operation incident 30 happened 0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 root cause recovery

  5. Root cause recovery VS partial recovery / 𝑑𝑓𝑠𝑀𝑗𝑑𝑓_π‘šπ‘π‘‘π‘’ = , 𝑀𝑝𝑑𝑒 𝑒 𝑒𝑒 0 requests failed 90 60 root cause completely identified recovery partial 30 detected recovery 0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 root cause recovery partial recovery

  6. Basic principles β€’ Deployment isolation – Limit failures in one cell, shift user queries rapidly β€’ Module isolation – Make the non-essential modules detachable β€’ User traffic isolation – Drop some of the queries to save the important ones

  7. Deployment isolation

  8. Deployment isolation

  9. Deploy isolation – Global Single Point ZooKeeper Third Party Service

  10. Deploy isolation – Service Across Cells

  11. Deploy isolation – Capacity Redundancy Realtime capacity measure Periodic stress test

  12. Deploy isolation – Reduce Change Risks β€’ Not only deploy, but also operation β€’ Do not change all cells at the same time, especially in automation!!! β€’ Check system status after every stage of change, manually if necessary β€’ Pay attention to different operation entry, set global β€œlocks”

  13. Module isolation β€’ No service will never crash β€’ Detail loss is much better than totally outage β€’ Make every non- essential module detachable, even automatically

  14. Module isolation -- External Dependencies CDN DNS HttpDNS

  15. User traffic isolation β€’ When no sufficient capacity, sacrifice part of requests to save the more important part β€’ Which part? – Real user > Crawler – Paid user > Free user – Popular request > Long tail request

  16. User traffic isolation – Distinguish Real-time β€’ Prepare for dropping at any time β€’ Crawlers may disguise the requests as human β€’ Machine learning attempt

  17. Conclusion

  18. Daxin Wang ηŽ‹θΎΎεΏƒ Enjoy the fighting against incidents

Recommend


More recommend