透過 Istio 打造企業 內 的 SRE Hybrid Specialist: Shawn Ho shawnho@google.com
1 What is SRE?
Product Lifecycle Concept Business Development Operations Market Agile DevOps solves this solves this
Dev & Ops’ KPIs aren't Aligned Developers Operators Agility Stability
What is relationship between Devops and SRE ? ● Devops is more like abstract concept,guide line and disciplines to break silos in developments, “Class SRE implements Devops” operation ● SRE is Google version of realized practice of Devops.
Class SRE = REAL PERSON CI/CD Developers Self-Service Platform Monitoring Automation SRE
#1. Decision based on data 所有的決定是以資料為基礎
#2. Be user centric 即使所有的監控數據都是正常的, 但客 戶 只要覺得系統不穩定,那系統就是不穩定
#3. Blameless culture & Share responsibility 降低部門隔閡要由跨部門的責任分享開始 (Developers, Operators, Leader) 系統 系統失效不僅是維運者的責任,程式碼品質,技術債等都是可能的原因
2 How to Implement SRE by Istio/Anthos?
Istio in 2 minutes External Internal App 1 App 1 Service A Service B Local Authz JWT + TLS JWT + TLS HTTP , gRPC, TCP proxy proxy Ingress Gateway Egress Gateway mTLS Perimeter mTLS mTLS Perimeter security security Policy Routing policies policies Enforcement + Cert issuance + Secure Reporting Naming Citadel Gallery Pilot Data flow Logging Monitoring plugin plugin plugin Cert Authority Control + metrics flow Istio Control Plane Control Plane API on K8S API Server
What does SRE implement on Platform? Culture Metrics & Capacity Change Emergency monitoring planning management response Toil management SLO Forecasting ● ● ● Release process ● Oncall ● Blamelessness Dashboard Demand-driven ● ● ● Consulting design ● Incident analysis ● Share responsibility Analytics Pergormance ● ● ● Automations ● Postmoruems ●
What does SRE implement on Platform? Culture Metrics & Capacity Change Emergency monitoring planning management response Toil management SLO Forecasting ● ● ● Release process ● Oncall ● Blamelessness Dashboard Demand-driven ● ● ● Consulting design ● Incident analysis ● Share responsibility Analytics Pergormance ● ● ● Automations ● Postmoruems ●
Monitoring and Incident Management Understand system System monitoring Log handling Incident handling Postmoruem architecture Monitoring system Managing planned Create incident Retrospect incident Understand system by gathering event (release, ticket and prepare plan to architecture and blackbox & whitebox maintenance) Rollback change to prevent reoccurence deployed topology metrics resolve incident SLI & SLO are Investigate root extracted from the cause with matrix and logs. logging,monitoring matrix and The informations are debugging. visualized thru dashboard
What to Monitor? SLI SLO SLA Error Budget service level service level service level Product management & SRE define an availability indicator : a objective : a top-line agreement : target . well-defined target for fraction consequences measure of 'good of good enough' interactions • SLA = (SLO + margin) • 100% - availability target • used to specify • specifies goals + consequences = SLI is a “budget of SLO/SLA (SLI + Target) + Target + unreliability” consequences (or the error budget ). SLO = SLI + Target “99% of REST API call will complete in less than 100ms every week” SLI Target
Error Budget (Availability) Allowed unavailability window Error Budget Availability SLO per year per quarter per 30 days Error rate 1% 90% 36.5 days 9 days 3 days 90 95% 18.25 days 4.5 days 1.5 days 80 99% 3.65 days 21.6 hours 7.2 hours 0 99.5% 1.83 days 10.8 hours 3.6 hours -100 99.9% 8.76 hours 2.16 hours 43.2 minutes -900 99.95% 4.38 hours 1.08 hours 21.6 minutes -1900 99.99% 52.6 minutes 12.96 minutes 4.32 minutes -9900 99.999% 5.26 minutes 1.30 minutes 25.9 seconds -99900
Demo with Anthos: Monitoring+Incident Mgmt ● Topology ● SLO/SLI Metrics ● Blackbox/Whitebox ● Log Viewer ● Tracing/Tracing Report
Demo with Anthos: Monitoring+Incident Mgmt Topology Blackbox Whitebox
Demo with Anthos: Monitoring+Incident Mgmt Logging Tracing
Error Budget Burn Down Rate
Demo with Anthos: Proactive Reduce Error Budget Alert Setting ● 90 Kubernetes Cluster Cloud Load Clients Kubernetes Engine Balancing Taiwan-1 ● Canary Deployment Cross-Region Deployment ● 10 Kubernetes Cluster Kubernetes Engine Singapore
Demo with Anthos: Proactive Reduce Error Budget Alert Setting ● 50 Kubernetes Cluster Cloud Load Clients Kubernetes Engine Balancing Taiwan-1 ● Canary Deployment Cross-Region Deployment ● 50 Kubernetes Cluster Kubernetes Engine Singapore
What does SRE implement on Platform? Culture Metrics & Capacity Change Emergency monitoring planning management response Toil management SLO Forecasting ● ● ● Release process ● Oncall ● Blamelessness Dashboard Demand-driven ● ● ● Consulting design ● Incident analysis ● Share responsibility Analytics Pergormance ● ● ● Automations ● Postmoruems ●
Capacity planning Plan for organic growth Increased product adoption and usage by customers. Determine inorganic growth Sudden jumps in demand due to feature launches, marketing campaigns, etc.
Change Management Roughly 70% 1 of outages are due to changes in a live system Service Continuous Deployment Kubernetes Configuration OnPremise Kubernetes Cluster Kubernetes Cluster Kubernetes Cluster Kubernetes Engine Kubernetes Engine Kubernetes Engine GCP On-Prem1 Multiple Instances NAT Anthos Hub Service Cloud Source Repositories Clients
Demo with Anthos: The Power of GitOps
Summary + Call for Action SRE has 3 key principles: ● Decision Based on Data ( 有意義的監控) ○ Be User Centric (黑箱測試) ○ Blameless Culture & Share Responsibility (分擔責任,共同努力) ○ Kubernetes is a perfect platform to implement SRE ● SLI + SLO + Error Budget ○ Watch for the Budget Burn Rate ○ Establish CI+CD with GitOps ○ Pick a System and Build your SRE Practices ●
Cover images used with permission. These books can be found on shop.oreilly.com.
Recommend
More recommend