Manageability of a Telco Cloud

ypukiWhen it comes to Service Providers, there is  often a longer  lifecycle with their platform components. One of the key things is the ability to perform upgrades in a supported fashion as well as a clear understanding of the the software provider when they release their software. The challenge with Open Source software is to find the proper support organization, with long enough support cycles (4-5 years at least), that thoroughly plans and tests of every release with backwards compatibility in mind.

With this purpose, the installation tool needs to encompass the update and upgrade use-cases. One of the best management tools for Openstack lifecycle is the one that uses Openstack itself (project TripleO). The tool is used in all stages from day 0 to day 2, and is being improved to extend the APIs that control the installation, upgrade and rollback. It allows controlled addition of compute or storage nodes, updating them and decommissioning them. Although the road for perfect manageability is long, we’re helping the Upstream projects (like OPNFV) to learn from Telco operators and continuously improving the tools for a seamless experience. At the same time, Service Providers can leverage additional management tools like Cloudforms and Satellite, for a scalable combo of policies, configuration management, auditing and general management of their NFV Infrastructure.

One of the most recent improvements have been the addition of bare metal tools and out-of-band management for Openstack (Ironic and it’s growing list of supported devices, like IPMI). This allows treating physical servers as another pool of resources that can be enabled or disabled on demand. The possibilities in terms of energy savings (green computing) may as well justify the investment in an Openstack-based NFVI, due to the elasticity that such solution offers.

Operators should analyze the core components of their VNFs and the interaction with the virtualized infrastructure. Features like host-level live kernel upgrades with Kpatch may not work if the VNFs are not supported guests. This means ensuring the VNFs use the latest QEMU drivers, with a modern and supported kernel adapted to the recent x86 extensions (VT-x, VT-d). The drivers in the host level are equally important, as they may limit the operations available to update or modify settings of the platform. original

FInally, one piece of advice when choosing the components of the NFVI control plane. Ideally, the control plane should offer a protection level equal or better than the systems they are controlling. Most vendors have chosen a simple clustering technology (keepalived) that is popular in the Upstream Openstack project, as it fits most Enterprise needs. At Red Hat, being experts in mission-critical environments, we chose Pacemaker (although keepalived is also supported), because its advanced quorum and fencing capabilities increase significantly the uptime of the control plane elements.. Pacemaker is an opensource project and it can be found here . The resulting architecture of an Openstack control plane with Pacemaker underneath permits automatic fault monitoring and recovering of the foundational services that compose the NFVI layer..


Security in a Virtualized world

When it comes to security, multiple aspects need to be reviewed. Although it is very important to remember the security requirements of CSPs, the significant aspect to consider are those that can be affected by the virtualization of the network functions.

When CSPs do not use virtualization, every electronic component has a dedicated purpose, sometimes with specific implementations in silicon (ASICs) or dedicated execution environments for software running on x86 (bare metal computing).istock_000015887354xsmall

When virtualizing network functions, we are not leveraging physical barriers therefore the principle of security by isolation does not apply anymore. In the worst case scenario, we can have the same CPU executing instructions for two VMs simultaneously: one participating in a DMZ (De-militarized zone) and another running a HLR (Home Location Registry) with customer records.

In NFV, the hypervisor provides the separation between VM contexts. It is the job of the operating system to provide policies like SELinux and sVirt to ensure that the VMs behave properly. Then, it’s up to the scheduling system (Openstack) to guarantee host isolation via Anti-Affinity rules and other policies. Those policies monitor not just the entry points (input-output) but also the operations VMs are allowed to perform, in terms of memory access and device management.  When the Service Provider needs more global policies and automated enforcement and auditing, Red Hat recommends Cloudforms or Satellite as additional management layer.

The security concerns are so critical that customers demand to see the hypervisor source code as the ultimate measure to audit the quality of that isolation. Thanks to the work of Intel with other open source communities, KVM now uses the latest CPU technologies, including built-in security enhancements.

Some of the features that protectsrt-151_2_ Linux and KVM include:

  • Process Isolation, supported by CPU hardware extensions
  • MAC isolation by default (SELinux and sVirt)
  • RBAC and Audit trail
  • Resource control (CGroups)
  • FIPS 140-2 validated cryptographic modules  
  • Disk encryption (LUKS) for data at rest
  • Strong key support for SSH (AES256) for data in motion
  • PKI authentication for operator management, SSL when using the HTTPS interface
  • SCAP audits supported (OpenSCAP), using the Security Guide developed with NSA, NIST and DISA

Detailed documentation  of the security mechanisms of Linux and KVM can be found here. 

Leveraging the decades of work to make Linux suitable to the demands of an enterprise environment has created a vast ecosystem of tools, including default policies like PolicyKit and SELinux and auditing tools like OpenSCAP.  This is the reason why you can find Linux and KVM on mission-critical systems like healthcare, government, finance companies and even the military.

Scalability of the NFVi

Most Telcos and Standard Organizations have chosen Openstack as the NFVi because its design allows for almost unlimited scalability, in theory. Reality is that feature increase comes at the cost of management overhead, and currently, a typical Openstack deployment with all the components enable cannot scale linearly beyond a few hundred nodes.


It’s by fine-tuning the 1500+ configuration variables of the control plane that we can achieve real scalability of the compute, network and storage plane, ensuring a consistent behaviour under the most demanding situations. This requires a deep expertise in the field and mastery of all the technological pieces, like the operating system, database engines, networking stacks, etc.

The default parameters of a NFVi infrastructure may not be fine-tuned to avoid bottlenecks and growth issues. You can find interesting experiences on pushing Openstack scaling to handle 168.000 instances in this Ubuntu post and in this video from the Vancouver summit on the crazy things Rackspace had to do to achieve a really big scale deployment.

For more information on the techniques to segregate and scale your environment, visit the Openstack operators guidelines on scaling.


Performance for VNFs

VNFs require high performance and deterministic behaviour, particularly in packet processing to reduce latency and jitter. Upstream communities like Openstack, OPNFV, libvirt and others are adding features specific for the NFV use-case. These features give service providers and their vendors tight control about performance and predictability of their NFV workloads, even during congestion or high-load situations.

Non-Uniform Memory Architecture (NUMA) aware scheduling allows applications and management systems to allocate workloads to processor cores considering their internal topology. This way, they can optimize for latency and throughput of memory and network interface accesses and avoid interference from other workloads. Combined with other tuning, we are, for example, able to reach >95% of baremetal efficiency processing small (64 bytes) packets in a DPDK-enabled application within a VM using SR-IOV. This represents between 10 to 100 times better performance than non-optimized private clouds 

Screenshot from 2016-03-16 23-48-40.png

The latest Linux kernel offers Real-Time KVM extensions that enable NFV workloads with the well-bounded, low-latency (<10μs) response times required for signal processing use cases like Cloud-based Radio Access Networks. It is also important to realize that KVM has recently improved its PCI and SR-IOV management (VFIO) to allow zero-overhead passthrough of devices to VMs

Open vSwitch has also received improvements, and further extensions with Intel’s Data-Plane Development Kit (DPDK), which enables higher packet throughput between Virtualized Network Functions and network interface cards as well as between Network Functions. Figures as high as 213 Million Packets per second for 64-byte packets on a x86 COTS server, representing the worst case scenario, proves that the technology has less than 1% performance overhead on the worst cases, and a even smaller impact on most real-world cases.

Additional performance measures are available at the operating-system, with tools like Tuned, with predefined profiles to change the way interruptions are handling, or the prioritization of the process scheduler.

Availability in NFV

There are two ways to achieve always-on service: maximizing MTBF (mean time between failure) and minimizing MTTR (mean time to repair). Traditional network equipment has been designed with superb MTBF in mind: dual power supply, dual network cards, double correction memory, etc. Of course, MTTR was also considered, thanks to VRRP, health checks, proactive monitoring, etc.11990-604

The NFV dream of leveraging COTS instead of ATCA inevitably leads to a lower MTBF. Commoditization means cheaper and more generic (not telco-tailored) hardware, which impacts reliability.  The current trend in the industry is not to improve MTBF via software extensions to the virtualization layer (with clustering for instance) as it wastes precious resources waiting for the failure to happen (i.e. with 1+1 replication). The trend is actually about MTTR, as it has already been proven with the cloud-native applications such as Netflix and their usage of periodic fire drills that deliberately force a hard shutdown ofactually destroys their cloud servers, network elements or whole availability zones (ChaosMonkey) with no service impact. TMForum is notoriously supporting their Zero-touch Operations and Management (ZOOM) initiative, very aligned with the DevOps movement, as both leverage automation as the foundation for transformation and business improvement. Even IBM has recognized the power of solving Operation problems using a Software Engineering approach, or as they say: “Recover first, Resoavailability_02lve next”.

With this in mind, a Carrier-Grade NFVI has to provide the means for these Next-Generation Operation teams (i.e. NetOps) to be able to use Software Automation to replace their operational runbooks, and avoid manual intervention (Zero-Touch). Failure will happen, and the teams may be prepared, but will they be notified in time? Will the failure be quickly detected by the monitoring tools? Do their operational tools enable the recovery of the service by isolating the faulty components and provisioning new ones, while the Operators can diagnose the root cause without service downtime?

At the NFVI layer, the open source community has improved all the components, including the Operating System, the user-space tools (SystemD), the service availability daemons (Pacemaker), the logging and event detection (standard logs processed by ElasticSearch, Kibana and Logstash) and enabled multiple APIs for native out-of-band management (IPMI, PXE, etc) from the orchestration tools in the VIM. In the OpenStack HA Reference Architecture, there is no single point of failure as all components (control nodes, message queues, databases, etc.) are fully redundant and provide scale-out services, leading to a fault-tolerant and self-healing OpenStack control plane.  You can also find open-source solutions for monitoring and alerting, capacity planning, log-correlation and backup/recovery solutions, that can be combined to improve even further the platform’s uptime.

In the case that the VNFs cannot leverage this new operational reality, and need to rely on an HA-ready infrastructure, you can also enable and expose the Openstack’s internal HA mechanism, Pacemaker, to selected VMs. It allows those VNF’s to be managed in a fault-proof fashion, automating the recovery of the VM elsewhere in case of host failure, thanks to the tight integration with the storage layer, Ceph being one of the best suited for fast live migrations. This is an unique integration advantage of Openstack, with advanced options that allows sub-second detection and recovery times with ensured data integrity and consistency.

The 5 factors to consider in a carrier-grade NFV design

Turn on the faucet and you expect water to flow. Turn on the light switch and you expect the light to shine. Pick up the phone and you expect to be able to make or receive a call. Pick up a device and you expect connectivity. There is a consumer expectation of “always available.”

As the nature of services delivered over communications networks diversified, networks have become increasingly complex to handle the exploding volume of data along with high quality voice services. But consumer expectations of “always-on” have only increased.

Historically, communications service providers (CSPs) adopted the term carrier-grade to define network elements that were up to the rigor of deployment in an always-on communications network.  While there is not a hard definition, carrier-grade generally denoted five nines (99.999%) or six nines (99.9999%) of uptime availability- with six nines of availability equating to about six seconds of downtime per year.

NFV offers CSPs a move to a cloud-based virtualized network that runs on consumer-off-the-shelf (COTS) hardware. This provides a software-based move away from proprietary hardware appliances and the lack of flexibility that comes along with them. The change is so significant that we need to re-evaluate the factors that define a solution as carrier-grade.

In the following 5 posts,  I’ll analyze the main 5 factors to consider:

  1. Availability
  2. Performance
  3. Scalability
  4. Security
  5. Manageability

Subscribe to the blog to receive the updates in your inbox


Comparing Openstack vs AWS to Tesla vs Edison

After reading this article I realized the current situation between AWS (public cloud) vs Private cloud (IBM/Cisco/HP/Dell-EMC) is very similar to the Electricity landscape in mid-1800’s ending in the War of the Currents in late 1800’s (AC vs DC)

Let me explain my point in this analogy: before electricity was discovery, manufacturing used natural power (water, air and animals) to create movement that was transferred inside the factory to help process goods using belts and chains. When electricity appeared, an Electrical generator was used as a power source, copper cables transferred electrical power to machines, and increased productivity and the density of workers, with reduced maintenance for the machinery.

So eventually every factory had its own coal-powered electrical generator (i.e. x86 servers nowadays), and line workers used their machines (i.e. computers) that required the energy that came via the copper wires (i.e. internet or VPN access).

Then the Power Grid was invented, and the industry wanted to shutdown their small electrical generators and leverage the bigger power grid. Outsourcing servers to AWS (public cloud) is like connecting the factory to the power grid (public electricity), but there were no standards for that in late 1800s, hence the Edison (DC) vs Tesla (AC) war.

In this case, Amazon is Edison, the greedy inventor, with lock-in in his veins (DC was patented). Tesla is Openstack, a generous inventor with open APIs and specs to build your own special sauce on top of a good-enough standard (60Hz, 110V, but easily changeable to 50Hz, 220V)

We’re seeing the struggle of Industries outsourcing their power generators and connecting it to the grid (i.e. the cloud). But there is the Monopoly way (Edison), and the Open way (Tesla)

Let OpenStack be the Tesla here, it will help the ‘Tech Giants’ to fight the good battle (Interconnected Private clouds using Openstack) and provide better services to the Industry.

Marcos Garcia, an OpenStack enthusiast