There are two ways to achieve always-on service: maximizing MTBF (mean time between failure) and minimizing MTTR (mean time to repair). Traditional network equipment has been designed with superb MTBF in mind: dual power supply, dual network cards, double correction memory, etc. Of course, MTTR was also considered, thanks to VRRP, health checks, proactive monitoring, etc.
The NFV dream of leveraging COTS instead of ATCA inevitably leads to a lower MTBF. Commoditization means cheaper and more generic (not telco-tailored) hardware, which impacts reliability. The current trend in the industry is not to improve MTBF via software extensions to the virtualization layer (with clustering for instance) as it wastes precious resources waiting for the failure to happen (i.e. with 1+1 replication). The trend is actually about MTTR, as it has already been proven with the cloud-native applications such as Netflix and their usage of periodic fire drills that deliberately force a hard shutdown ofactually destroys their cloud servers, network elements or whole availability zones (ChaosMonkey) with no service impact. TMForum is notoriously supporting their Zero-touch Operations and Management (ZOOM) initiative, very aligned with the DevOps movement, as both leverage automation as the foundation for transformation and business improvement. Even IBM has recognized the power of solving Operation problems using a Software Engineering approach, or as they say: “Recover first, Resolve next”.
With this in mind, a Carrier-Grade NFVI has to provide the means for these Next-Generation Operation teams (i.e. NetOps) to be able to use Software Automation to replace their operational runbooks, and avoid manual intervention (Zero-Touch). Failure will happen, and the teams may be prepared, but will they be notified in time? Will the failure be quickly detected by the monitoring tools? Do their operational tools enable the recovery of the service by isolating the faulty components and provisioning new ones, while the Operators can diagnose the root cause without service downtime?
At the NFVI layer, the open source community has improved all the components, including the Operating System, the user-space tools (SystemD), the service availability daemons (Pacemaker), the logging and event detection (standard logs processed by ElasticSearch, Kibana and Logstash) and enabled multiple APIs for native out-of-band management (IPMI, PXE, etc) from the orchestration tools in the VIM. In the OpenStack HA Reference Architecture, there is no single point of failure as all components (control nodes, message queues, databases, etc.) are fully redundant and provide scale-out services, leading to a fault-tolerant and self-healing OpenStack control plane. You can also find open-source solutions for monitoring and alerting, capacity planning, log-correlation and backup/recovery solutions, that can be combined to improve even further the platform’s uptime.
In the case that the VNFs cannot leverage this new operational reality, and need to rely on an HA-ready infrastructure, you can also enable and expose the Openstack’s internal HA mechanism, Pacemaker, to selected VMs. It allows those VNF’s to be managed in a fault-proof fashion, automating the recovery of the VM elsewhere in case of host failure, thanks to the tight integration with the storage layer, Ceph being one of the best suited for fast live migrations. This is an unique integration advantage of Openstack, with advanced options that allows sub-second detection and recovery times with ensured data integrity and consistency.