Accept Partial Failures, Minimize Service Loss Daxin Wang
Baidu SRE Diversified products Promote Experienced team
Too complicated to recover rapidly Network & Infrastructure Operation Mistake Software Bug
Basic model of reduce service lost in incident / π‘ππ π€πππ_πππ‘π’ = , πππ‘π’ π’ ππ’ 0 requests failed 90 60 recovery operation incident 30 happened 0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 root cause recovery
Root cause recovery VS partial recovery / π‘ππ π€πππ_πππ‘π’ = , πππ‘π’ π’ ππ’ 0 requests failed 90 60 root cause completely identified recovery partial 30 detected recovery 0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 root cause recovery partial recovery
Basic principles β’ Deployment isolation β Limit failures in one cell, shift user queries rapidly β’ Module isolation β Make the non-essential modules detachable β’ User traffic isolation β Drop some of the queries to save the important ones
Deployment isolation
Deployment isolation
Deploy isolation β Global Single Point ZooKeeper Third Party Service
Deploy isolation β Service Across Cells
Deploy isolation β Capacity Redundancy Realtime capacity measure Periodic stress test
Deploy isolation β Reduce Change Risks β’ Not only deploy, but also operation β’ Do not change all cells at the same time, especially in automation!!! β’ Check system status after every stage of change, manually if necessary β’ Pay attention to different operation entry, set global βlocksβ
Module isolation β’ No service will never crash β’ Detail loss is much better than totally outage β’ Make every non- essential module detachable, even automatically
Module isolation -- External Dependencies CDN DNS HttpDNS
User traffic isolation β’ When no sufficient capacity, sacrifice part of requests to save the more important part β’ Which part? β Real user > Crawler β Paid user > Free user β Popular request > Long tail request
User traffic isolation β Distinguish Real-time β’ Prepare for dropping at any time β’ Crawlers may disguise the requests as human β’ Machine learning attempt
Conclusion
Daxin Wang ηθΎΎεΏ Enjoy the fighting against incidents
Recommend
More recommend