openstack operations quick ramp up and survival guide
play

OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, - PowerPoint PPT Presentation

OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private Cloud, @joshuakwan Fan He, Architect, IBM Bluemix Private Cloud, @fancyhe Joshua Guan, Operations Lead Fan He, Cloud Architect IBM


  1. OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private Cloud, @joshuakwan Fan He, Architect, IBM Bluemix Private Cloud, @fancyhe

  2. Joshua Guan, Operations Lead Fan He, Cloud Architect IBM Bluemix Private Cloud IBM Bluemix Private Cloud

  3. A Little Bit Background … • Bluemix Private Cloud is IBM’s private cloud as service based on OpenStack • Bluemix Private Cloud landed in China to support IBM’s Cloud business there. • We were building an OpenStack Operations Team from scratch

  4. Agenda • Define an OpenStack Operations Team • Operating Model • Processes • Tooling • Teaming • Tooling Integration • Cliché: OpenStack upgrade, HA, Live Migration

  5. Operating OpenStack is like … You thought you would work like this And, Welcome to the real world

  6. Define an OpenStack Operations Team Operating Model Processes Tooling Teaming • How the cloud services are • Operation Tiers • Monitoring • Roles and Responsibilities offered • Escalation Levels • Collaboration • Shift Model • What is the SLA • Incident Management • Cloud Management • Collaboration with Business • Change Management • Knowledge Base Partners, Data Centers and • Shifts • Security backend teams, etc. • Onboard & Offboard • Customer Support • …

  7. Operating Model Customers use consume complies OpenStack Service Support Entry Points Service Level Agreement Offering route operates Business Partner collaborate/escalate OpenStack Operations Data Center collaborate/escalate Development Team

  8. Processes • Roles Operation Security • Responsibilities Tiers Tier Role Responsibilities Escalation Shifts 1 Support First line of defense Flows 2 Operations Deploy, upgrade, admin 3 OpenStack Engineering Build the product Change Incident Management Management 3 Network Engineering Undercloud networks

  9. Processes • How tickets/alerts/incidents Operation Security go between different tiers Tiers customer Escalation Shifts Flows Tier 1 Change Incident Tier 2 Management Management Tier 3 Tier 3 Tier 3

  10. Processes Definition Example Operation Priority Level P0, P1, P2 Security Tiers Incident OpenStack node failure, Data center Definition network interruption Management RFO, Outage Track Activities Escalation Shifts Flows Response time Immediate, 15min, 1hr Update interval Every 30min Change Incident Communicatio Customer ticket, email, statuspage.io Management Management n method Escalcation to 1hr leadership

  11. Processes • Different types of changes • How the change will be rolled Operation Security out Tiers • When the change will be rolled out Escalation Shifts Flows • Review and approval • Customer communication Change Incident Management Management

  12. Processes Time Operation Security Tiers at-work at-work at-work at-work at-work at-work Escalation Shifts Flows on-call primary on-call primary on-call primary Change Incident Management Management on-call secondary on-call secondary on-call secondary

  13. Processes • Security Compliance Activities • Health Check Operation Security • Patch Reporting Tiers • Vulnerability Scanning • Continuous Business Need Escalation Shifts Flows Change Incident Management Management

  14. Tooling • Monitoring Monitoring • Alerting • Log Aggregation Customer Collaboration Support • Dashboard OpenStack Operations Cloud Security Management Knowledge Base

  15. Tooling • Chat Monitoring • File Sharing • Project Kanban Customer Collaboration Support • Shift Management OpenStack Operations Cloud Security Management Knowledge Base

  16. Tooling • CMDB Monitoring • Asset Management • Change Management Customer Collaboration Support • Incident Management OpenStack Operations Cloud Security Management Knowledge Base

  17. Tooling • Internal Wiki/Runbooks Monitoring • Product Documents for Customers Customer Collaboration Support OpenStack Operations Cloud Security Management Knowledge Base

  18. Tooling • Access Management Monitoring • Security Compliance Management • Health Checking Customer Collaboration Support • Patching Reporting • Vulnerability Scanning OpenStack Operations Cloud Security Management Knowledge Base

  19. Tooling • Ticketing System Monitoring • Customer Chat • Customer Satisfaction Customer Collaboration Support • Cloud Level Maintenance Communication OpenStack • Site Level Maintenance Operations Communication Cloud Security Management Knowledge Base

  20. Teaming Service Level Service Shift Agreement Availability Model

  21. Teaming Time Operators on shift • 24x7 Availability • Spread the pain Triage • Eliminate interruptions as at-work at-work at-work at-work at-work at-work possible at-work at-work at-work at-work SME On-call 1 SME On-call 2 SME On-call 3 on-call primary on-call primary on-call primary primary primary primary on-call secondary on-call secondary on-call secondary secondary secondary secondary

  22. Tooling Integration • A lot of screens to watch • A lot of systems to work on • A lot of interruptions • Use your tools to “kill” them

  23. Tooling Integration As a good start: Kill ”context switch” – work on a single platform

  24. Tooling Integration As a good start: Kill ”context switch” – work on a single platform

  25. Tooling Integration What’s next: Kill ”all interruptions” – workflow automation across platforms

  26. Cliché – Where BOOOOOM Happens • Implementations & Operations: Change management • The Practices of Upgrade • The Story of HA • The Myth of Live Migration

  27. Change management • “Infrastructure as Code” • Incoming change requests • Customer initiated requirements • Internal enhancements roll out • Compliance • Change planning for Consistency • Priorities • Dependencies

  28. OpenStack Upgrade • Prerequisites: deployment automation • Consistency – cloud configurations in CMDB • Idempotency – code to run OpenStack upgrade • Upgrade process design • Upgrade orchestration • Repeatable success & minimum disruption Reference: Upgrading OpenStack: A Best Practices Guide

  29. Let’s talk about High Availability…. • Architecture decisions for HA • Eliminate SPOF; Non-disruptive upgrade; Load Balancing; … • Inherent availability = MTTF / (MTTF + MTTR) • HA’s “dark side” for cloud operations • Recovery with HA resetting • Complexity’s impact on recovery time • Mitigation plan • Built-in monitoring for HA mechanism • Recovery automation

  30. Live Migration? • Does ”nova live-migrate” work? • Manage customer expectations • Abuse prevention • Limited appropriate scenarios • Automation with caution • Integration with pre & post- verification routine Reference: Live Migration is a Perk, not a Panacea @kiwik http://kiwik.github.io/openstack/2015/05/23/Nova-Live-Migration-Workflow/

  31. The Open Cloud: Delivering Solutions with Choice October 26 th CCIB Room 116 Kickoff with Todd Moore 11:25 IBM Vice President, Open Technology En Enterprise OpenStack for Beginners 11:30 Perspectives Pe Shamail Tahir • Tyler Britten The Open Cloud: A Platform of Possibilities 12:15 Jesse Proudman • Azmir Mohamed Don’t Just Take Our Word for It: Use Cases from Materna & AT&T 2:15 4:30 Join Brad Topol Armin von Dolenga (Materna) • Jacob Caspi (AT&T) and the Interop Challenge Vendors for refreshments Part 1 - Designing Effective Microservices 3:05 Manuel Silveyra Part 2 - Deploying Infrastructure Foundations 3:55 Microservices on Shaun Murikami • Andrew Bodine the Open Cloud Part 3 - Delivering Application Microservices 5:05 Daniel Krook Part 4 – Directing Deployments with DevOps 5:55 Megan Kostick • Michael Brewer • Manuel Silveyra

  32. Thank You

Recommend


More recommend