achieving five nines of vnf reliability in telco grade
play

Achieving Five Nines of VNF Reliability in Telco-grade OpenStack - PowerPoint PPT Presentation

Achieving Five Nines of VNF Reliability in Telco-grade OpenStack Cloud Panel Discussion 1:50 PM 2:30 PM on Wednesday, April 27, 2016 Kandan Kathirvel, AT&T; Eoin Walsh, Intel; Rimma Iontel, Red Hat Inc. and Fausto Marzi, Ericsson.


  1. Achieving Five Nines of VNF Reliability in Telco-grade OpenStack Cloud Panel Discussion 1:50 PM – 2:30 PM on Wednesday, April 27, 2016 Kandan Kathirvel, AT&T; Eoin Walsh, Intel; Rimma Iontel, Red Hat Inc. and Fausto Marzi, Ericsson. Moderated by Haseeb Akhtar, Ericsson

  2. Converting PNF to VNF without cloud awareness is not optimal Physical Network Function Virtual Network Function New & Evolving Purpose build software Application Requires significant innovation Operationally Perfected over decades Orchestration Common automation Mostly Manual or Contained Software Rapidly evolving Software defined Early adoption and rapid evolution Physical Connection – no SDN Network Host OS & Not Virtualized (most cases) Cloud Provided (Common) Virtualization Vendor provided OS Compute Commodity hardware (any Vendor) Purpose built and dedicated for field of use N/w Fabric Purpose built & dedicated (Most cases) Multi-tenant (Common framework) Datacenter / Central Same Same Same Office Page 2

  3. VNF availability depends on Cloud & VNF resiliency Single Instance of OpenStack region is about 99.9% ( 8.76 hours unplanned downtime per year) High Risk of Application Outages Cloud Aware Applications Low Risk of Application Outages VNF - Single VM VNF HA in a region VNF HA in 2 regions at same DC VNF HA across 2 DCs VNF HA across 4 DCs Openstack Openstack Openstack Openstack Region Region Region2 Region1 Geo Location 1 Geo Location 1 (DC1) (DC1) Geo Location 2 Geo Location 4 Single DC Single DC Single DC (DC2) (DC4) 1 2 3 4 5 99.9% (8.76 hrs down/year) 99.99% (52.56 mins down/year) 99.999%(5.26 Mins down/year) Optimal Optimal Few Few Most of VNFs Some VNFs current state VNF 2 Regions at a DC VNF Availability Page 3

  4. Proposed OpenStack Enhancements • Hitless upgrades – reduce overall platform downtime • Policy driven live/offline migration inclusive of SR-IOV, CPU pinning and Huge pages support • Multi-location awareness & workload placement • Resiliency/Stability testing framework in OpenStack Rally – measure and report • Auto healing framework for OpenStack Controllers VNF evolution • Support HA both locally and Globally • Leverage OpenStack/Cloud Platform resiliency features ex: anti-affinity to place VMs on different servers Few Page 4

  5. NFV Ready Architecture OSS/BSS Services VNF VNF VNFC VNFC VNFC VNFC Service Catalog Descriptor Repositories Service Orchestration Network Orchestration Virtual Resource Monitoring & Reporting Service Assurance Security vCompute vStorage vNetwork VNF Manager Analytics Enhanced Platform Awareness Virtualization Open APIs Platform Resource Monitoring & Reporting VIM SDN Controller Compute Storage Network Page 5

  6. Proposed OpenStack Enhancements • Hitless upgrades – reduce overall platform downtime • Policy driven live/offline migration inclusive of SR-IOV, CPU pinning and Huge pages support • Multi-location awareness & workload placement • Resiliency/Stability testing framework in OpenStack Rally – measure and report • Auto healing framework for OpenStack Controllers • Automated provisioning and monitoring (Ceilometer, Heat and Ironic) • Intelligent workload placement (Nova scheduler) Few Page 6

  7. NFV Ready Architecture OSS/BSS Services VNF VNF VNFC VNFC VNFC VNFC Service Catalog Descriptor Repositories Service Orchestration Network Orchestration Virtual Resource Monitoring & Reporting Service Assurance Security vCompute vStorage vNetwork VNF Manager Analytics Enhanced Platform Awareness Virtualization Open APIs Platform Resource Monitoring & Reporting VIM SDN Controller Compute Storage Network Page 7

  8. Proposed OpenStack Enhancements • Hitless upgrades – reduce overall platform downtime • Policy driven live/offline migration inclusive of SR-IOV, CPU pinning and Huge pages support • Multi-location awareness & workload placement • Resiliency/Stability testing framework in OpenStack Rally – measure and report • Auto healing framework for OpenStack Controllers • Automated provisioning and monitoring (Ceilometer, Heat and Ironic) • Intelligent workload placement (Nova scheduler) • Tools to measure, monitor and report end-to-end platform SLA Few Page 8

  9. Compute Node HA – Local Disaster Prerequisite: Risks: Workflow: Compute Compute Compute Control Control Control Node 1 Node 2 Node n Node 1 Node 2 Node 3 Compute Disaster is VM VM VM VM VM VM nodes shared Node Fencing VM VM VM detected VM VM VM storage VM VM VM VM VM VM Compute node HA HA HA evacuation HA HA HA Corosync + Pacemaker Agent Agent Agent Users connect to the new service Database Active HA Controller Page 9

  10. Compute Node HA – Global Operational Disaster Internet Prerequisite: Risks: Workflow: Workflow: Floating IPs Disaster is Node and DC Data replication retrieved from detected Fencing Nova BGP Announce IPs Compute node with BGP or AS XXXXX evacuation OSPF On the other Compute Nodes DC1 DC2 the floating IPs are retrieved 1.1.1.0/24 1.1.2.0/24 Floating IP FIP: 100.100.1.15 FIP: 100.100.1.15 announced with BGP or OSPF Freezer-dr-api Freezer-dr-api VM1 VM1 Users connect to the new service Evacuation Page 10

  11. Proposed OpenStack Enhancements • Hitless upgrades – reduce overall platform downtime • Policy driven live/offline migration inclusive of SR-IOV, CPU pinning and Huge pages support • Multi-location awareness & workload placement • Resiliency/Stability testing framework in OpenStack Rally – measure and report • Auto healing framework for OpenStack Controllers • Automated provisioning and monitoring (Ceilometer, Heat and Ironic) • Intelligent workload placement (Nova scheduler) • Global and Local Compute HA management Few Page 11

  12. Thanks! We need to build together. Page 12

Recommend


More recommend