Deploying OpenStack on just one hosted server

The RDO project recently re-bumped all their tools to both release and install OpenStack. TI was impressed by all the changes so I wanted to test it, and indeed, it works great now. Now you can easily use the full-blown TripleO installer in a single machine (with separate VMs for controller and compute) thanks to the new TripleO-QuickStart tool. The main source of documentation is: https://github.com/openstack/tripleo-quickstart – and a presentation is available here https://goo.gl/LUuYSK 

 

However, I use my laptop for work and I had only 16GB of tripleo_owl.pngRAM, which makes it hard to experiment with OpenStack there. Thanks to a no-setup-fees offer from OVH, I found a $49 CAD/month server with Intel Xeon, SSD drives and 32GB of RAM, and it took me just 10 minutes to create an account and pay the first month via credit card.

Quick comments before showing you how to do it:

  • A hosted server has only 1 NIC: the public one. Be careful, it’s really exposed to the internet (secure it properly, with firewall rules, SSH configs, fail2ban)
  • A 16GB server can work too, but it’ll be very slow. 32G is faster 🙂
  • Stick to the documented path (i.e. CentOS 7)
  • It will take you at least 1 or 2h from start to finish
  • You’ll need SSH tunnels to connect to the Overcloud, and SOCKS proxy from your browser to see Horizon (I use FoxyProxy extension for Firefox)

In the basic scenario, your hosted server will be running 3 VMs: undercloud (12 GB RAM, 4vCPU), and  control and compute (8GB RAM, 1vCPU each). Those VMs are only visible as the stack user, root cannot see them via virsh. You can create more VMs by passing parameters to the tripleo ansible-playbooks.

[root@ooo-quickstart ~]# su - stack
Last login: Mon May 16 02:27:21 CEST 2016 on pts/0
[stack@ooo-quickstart ~]$ virsh list --all
 Id Name State
----------------------------------------------------
 2 undercloud running
 7 compute_0 running
 8 control_0 running

Step 1: Set-up the server

  1. lease a server with 32G RAM
  2. deploy CentOS 7.2, using a customized install to enable DISTRIBUTION KERNEL and set up /home to be most part of the disk.
    1. Otherswise, OVH installs its own kernel without KVM. Other hosting providers may do the same for security purposes
  3. Once booted, write down the IP

Step 2:  Download the script to your local Linux box

The script will create a virtualenv, where Ansible2 will be downloaded, and the SSH keys will be stored. It will also install some package dependencies. Read the script first! Don’t trust anything that you download and execute “sudo bash” on it without reading it first.

$git clone https://github.com/openstack/tripleo-quickstart 
$cd tripleo-quickstart
$sudo bash quickstart.sh --install-deps

NOTE: in my Fedora 23 laptop, I had to install another package

$sudo dnf install redhat-rpm-config

Step 3: Configure your server 

You’ll receive the root password on your email. We’ll only use it for the first login, so we can copy our SSH key to the server.

$export VIRTHOST=1.2.3.4 #put your own IP here
$ssh-copy-id root@$VIRTHOST
$ssh root@$VIRTHOST uname -a #ensure it's the distribution kernel. OVH kernel says 3.14.32-xxxx-grs-ipv6-64
$ssh root@$VIRTHOST yum groupinstall "Virtualization Host" -y

NOTE: Prevent open Bugs (May 15th 2016)

I haven’t opened them as bugs yet, but just in case, with CentOS 7.2, there are two things that will break the installation at some point, and need to be solved in your Hosted Server (running CentOS) before the installation:

  1. #ERROR: qemu-kvm: -chardev pty,id=charserial0: Failed to create chardev\n
    1. I solved it via https://loginroot.com/qemu-kvm-chardev-ptyidcharserial0-failed-to-create-chardev/
    2. Basically, replace the devpts line of /etc/fstab with “devpts /dev/pts devpts gid=5,mode=620 0 0
    3. Then do “mount -o remount /dev/pts
  2. #ERROR: Node 141e60b7-19ea-43d1-b14e-fe07193cdf7d did not pass power credentials validation: SSH connection cannot be established: Failed to establish SSH connection to host 192.168.23.1 ; and DEBUG ironic.common.utils [req-8d62ae59-8832-4eec-82d6-c9139d7624a8 – – – – -] SSH connect failed: Incompatible ssh server (no acceptable macs) ssh_connect /usr/lib/python2.7/site
    1. I solved it via http://stackoverflow.com/questions/28399335/python-paramiko-incompatible-ssh-server
    2. Just edit /etc/ssh/sshd_conf and add another MAC algorithm: MACs hmac-sha1
    3. Then, restart the service with systemctl restart sshd

Step 4: Start the installation (from your laptop)

Inside the tripleo-quickstart folder, execute:

(your-laptop tripleo-quickstart)$ bash quickstart.sh $VIRTHOST

Alternatively, you can use other environment files (or even create your own), like this:

$ bash quickstart.sh --config playbooks/centosci/ha.yml  $VIRTHOST

Once finished, this message appears:

##################################
Virtual Environment Setup Complete
##################################

Access the undercloud by:

 ssh -F /home/marcos/.quickstart/ssh.config.ansible undercloud

There are scripts in the home directory to continue the deploy:

 undercloud-install.sh will run the undercloud install
 undercloud-post-install.sh will perform all pre-deploy steps
 overcloud-deploy.sh will deploy the overcloud
 overcloud-deploy-post.sh will do any post-deploy configuration
 overcloud-validate.sh will run post-deploy validation

Alternatively, you can ignore these scripts and follow the upstream docs:

First:

 openstack undercloud install
 source stackrc

Then continue with the instructions (limit content using dropdown on the left):

 http://ow.ly/Ze8nK

One by one, execute the following commands:

(your-laptop)$ ssh -F ~/.quickstart/ssh.config.ansible undercloud
#(now we're in the undercloud VM, SSH-jumped via $VIRTHOST)
[stack@undercloud ~]$ undercloud-install.sh
[stack@undercloud ~]$ undercloud-post-install.sh
[stack@undercloud ~]$ overcloud-deploy.sh
[stack@undercloud ~]$ overcloud-deploy-post.sh
[stack@undercloud ~]$ overcloud-validate.sh

Step 5: Connect to OpenStack

From the undercloud, recover the OpenStack credentials that were stored by the TripleO installer in ~/overcloudrc. and connect using the CLI as usual.

[stack@undercloud ~]$ . overcloudrc 
[stack@undercloud ~]$ keystone catalog

To use Horizon, the easiest way is to re-connect via SSH to the undercloud enabling a SOCKS proxy.

(your-laptop)$ ssh -F ~/.quickstart/ssh.config.ansible undercloud -D 9090

Then on Firefox, configure FoxyProxy extension use the SOCKS proxy on localhost:9090

Now, find out the IP and credentials to connect to Horizon

[stack@undercloud ~]$ cat overcloudrc 
export OS_AUTH_URL=http://10.0.0.4:5000/v2.0
export OS_USERNAME=admin
export OS_PASSWORD=qU8veJ3RVJZmEnvz9fzqubDbR
export OS_TENANT_NAME=admin
(...)

Finally, open the browser to http://10.0.0.4/

 

Next Steps

Now that you have a very simple openstack installation (1 controller, 1 compute) you can experiment with a production-like setup of 3 controllers and 2 computes by simply telling Ansible you want more of those profiles (see how to pass parameters to the tripleo ansible-playbooks). If you have enough space, I suggest you also enable Ceph.

You can delete your openstack installation from the undercloud by simply doing “heat stack-delete overcloud”. Then, from your laptop, re-execute the tripleo-quickstart with the new variables and it will re-configure the undercloud accordingly, saving you a lot of time (all images have been downloaded and configured already)

Happy testing!

Advertisements

Manageability of a Telco Cloud

ypukiWhen it comes to Service Providers, there is  often a longer  lifecycle with their platform components. One of the key things is the ability to perform upgrades in a supported fashion as well as a clear understanding of the the software provider when they release their software. The challenge with Open Source software is to find the proper support organization, with long enough support cycles (4-5 years at least), that thoroughly plans and tests of every release with backwards compatibility in mind.

With this purpose, the installation tool needs to encompass the update and upgrade use-cases. One of the best management tools for Openstack lifecycle is the one that uses Openstack itself (project TripleO). The tool is used in all stages from day 0 to day 2, and is being improved to extend the APIs that control the installation, upgrade and rollback. It allows controlled addition of compute or storage nodes, updating them and decommissioning them. Although the road for perfect manageability is long, we’re helping the Upstream projects (like OPNFV) to learn from Telco operators and continuously improving the tools for a seamless experience. At the same time, Service Providers can leverage additional management tools like Cloudforms and Satellite, for a scalable combo of policies, configuration management, auditing and general management of their NFV Infrastructure.

One of the most recent improvements have been the addition of bare metal tools and out-of-band management for Openstack (Ironic and it’s growing list of supported devices, like IPMI). This allows treating physical servers as another pool of resources that can be enabled or disabled on demand. The possibilities in terms of energy savings (green computing) may as well justify the investment in an Openstack-based NFVI, due to the elasticity that such solution offers.

Operators should analyze the core components of their VNFs and the interaction with the virtualized infrastructure. Features like host-level live kernel upgrades with Kpatch may not work if the VNFs are not supported guests. This means ensuring the VNFs use the latest QEMU drivers, with a modern and supported kernel adapted to the recent x86 extensions (VT-x, VT-d). The drivers in the host level are equally important, as they may limit the operations available to update or modify settings of the platform. original

FInally, one piece of advice when choosing the components of the NFVI control plane. Ideally, the control plane should offer a protection level equal or better than the systems they are controlling. Most vendors have chosen a simple clustering technology (keepalived) that is popular in the Upstream Openstack project, as it fits most Enterprise needs. At Red Hat, being experts in mission-critical environments, we chose Pacemaker (although keepalived is also supported), because its advanced quorum and fencing capabilities increase significantly the uptime of the control plane elements.. Pacemaker is an opensource project and it can be found here http://clusterlabs.org/ . The resulting architecture of an Openstack control plane with Pacemaker underneath permits automatic fault monitoring and recovering of the foundational services that compose the NFVI layer..

Security in a Virtualized world

When it comes to security, multiple aspects need to be reviewed. Although it is very important to remember the security requirements of CSPs, the significant aspect to consider are those that can be affected by the virtualization of the network functions.

When CSPs do not use virtualization, every electronic component has a dedicated purpose, sometimes with specific implementations in silicon (ASICs) or dedicated execution environments for software running on x86 (bare metal computing).istock_000015887354xsmall

When virtualizing network functions, we are not leveraging physical barriers therefore the principle of security by isolation does not apply anymore. In the worst case scenario, we can have the same CPU executing instructions for two VMs simultaneously: one participating in a DMZ (De-militarized zone) and another running a HLR (Home Location Registry) with customer records.

In NFV, the hypervisor provides the separation between VM contexts. It is the job of the operating system to provide policies like SELinux and sVirt to ensure that the VMs behave properly. Then, it’s up to the scheduling system (Openstack) to guarantee host isolation via Anti-Affinity rules and other policies. Those policies monitor not just the entry points (input-output) but also the operations VMs are allowed to perform, in terms of memory access and device management.  When the Service Provider needs more global policies and automated enforcement and auditing, Red Hat recommends Cloudforms or Satellite as additional management layer.

The security concerns are so critical that customers demand to see the hypervisor source code as the ultimate measure to audit the quality of that isolation. Thanks to the work of Intel with other open source communities, KVM now uses the latest CPU technologies, including built-in security enhancements.

Some of the features that protectsrt-151_2_ Linux and KVM include:

  • Process Isolation, supported by CPU hardware extensions
  • MAC isolation by default (SELinux and sVirt)
  • RBAC and Audit trail
  • Resource control (CGroups)
  • FIPS 140-2 validated cryptographic modules  
  • Disk encryption (LUKS) for data at rest
  • Strong key support for SSH (AES256) for data in motion
  • PKI authentication for operator management, SSL when using the HTTPS interface
  • SCAP audits supported (OpenSCAP), using the Security Guide developed with NSA, NIST and DISA

Detailed documentation  of the security mechanisms of Linux and KVM can be found here. 

Leveraging the decades of work to make Linux suitable to the demands of an enterprise environment has created a vast ecosystem of tools, including default policies like PolicyKit and SELinux and auditing tools like OpenSCAP.  This is the reason why you can find Linux and KVM on mission-critical systems like healthcare, government, finance companies and even the military.

Scalability of the NFVi

Most Telcos and Standard Organizations have chosen Openstack as the NFVi because its design allows for almost unlimited scalability, in theory. Reality is that feature increase comes at the cost of management overhead, and currently, a typical Openstack deployment with all the components enable cannot scale linearly beyond a few hundred nodes.

google-datacenter-tech-13

It’s by fine-tuning the 1500+ configuration variables of the control plane that we can achieve real scalability of the compute, network and storage plane, ensuring a consistent behaviour under the most demanding situations. This requires a deep expertise in the field and mastery of all the technological pieces, like the operating system, database engines, networking stacks, etc.

The default parameters of a NFVi infrastructure may not be fine-tuned to avoid bottlenecks and growth issues. You can find interesting experiences on pushing Openstack scaling to handle 168.000 instances in this Ubuntu post and in this video from the Vancouver summit on the crazy things Rackspace had to do to achieve a really big scale deployment.

For more information on the techniques to segregate and scale your environment, visit the Openstack operators guidelines on scaling.

 

Performance for VNFs

VNFs require high performance and deterministic behaviour, particularly in packet processing to reduce latency and jitter. Upstream communities like Openstack, OPNFV, libvirt and others are adding features specific for the NFV use-case. These features give service providers and their vendors tight control about performance and predictability of their NFV workloads, even during congestion or high-load situations.

Non-Uniform Memory Architecture (NUMA) aware scheduling allows applications and management systems to allocate workloads to processor cores considering their internal topology. This way, they can optimize for latency and throughput of memory and network interface accesses and avoid interference from other workloads. Combined with other tuning, we are, for example, able to reach >95% of baremetal efficiency processing small (64 bytes) packets in a DPDK-enabled application within a VM using SR-IOV. This represents between 10 to 100 times better performance than non-optimized private clouds 

Screenshot from 2016-03-16 23-48-40.png

The latest Linux kernel offers Real-Time KVM extensions that enable NFV workloads with the well-bounded, low-latency (<10μs) response times required for signal processing use cases like Cloud-based Radio Access Networks. It is also important to realize that KVM has recently improved its PCI and SR-IOV management (VFIO) to allow zero-overhead passthrough of devices to VMs

Open vSwitch has also received improvements, and further extensions with Intel’s Data-Plane Development Kit (DPDK), which enables higher packet throughput between Virtualized Network Functions and network interface cards as well as between Network Functions. Figures as high as 213 Million Packets per second for 64-byte packets on a x86 COTS server, representing the worst case scenario, proves that the technology has less than 1% performance overhead on the worst cases, and a even smaller impact on most real-world cases. http://redhatstackblog.redhat.com/2015/08/19/scaling-nfv-to-213-million-packets-per-second-with-red-hat-enterprise-linux-openstack-and-dpdk/

Additional performance measures are available at the operating-system, with tools like Tuned, with predefined profiles to change the way interruptions are handling, or the prioritization of the process scheduler.

Availability in NFV

There are two ways to achieve always-on service: maximizing MTBF (mean time between failure) and minimizing MTTR (mean time to repair). Traditional network equipment has been designed with superb MTBF in mind: dual power supply, dual network cards, double correction memory, etc. Of course, MTTR was also considered, thanks to VRRP, health checks, proactive monitoring, etc.11990-604

The NFV dream of leveraging COTS instead of ATCA inevitably leads to a lower MTBF. Commoditization means cheaper and more generic (not telco-tailored) hardware, which impacts reliability.  The current trend in the industry is not to improve MTBF via software extensions to the virtualization layer (with clustering for instance) as it wastes precious resources waiting for the failure to happen (i.e. with 1+1 replication). The trend is actually about MTTR, as it has already been proven with the cloud-native applications such as Netflix and their usage of periodic fire drills that deliberately force a hard shutdown ofactually destroys their cloud servers, network elements or whole availability zones (ChaosMonkey) with no service impact. TMForum is notoriously supporting their Zero-touch Operations and Management (ZOOM) initiative, very aligned with the DevOps movement, as both leverage automation as the foundation for transformation and business improvement. Even IBM has recognized the power of solving Operation problems using a Software Engineering approach, or as they say: “Recover first, Resoavailability_02lve next”.

With this in mind, a Carrier-Grade NFVI has to provide the means for these Next-Generation Operation teams (i.e. NetOps) to be able to use Software Automation to replace their operational runbooks, and avoid manual intervention (Zero-Touch). Failure will happen, and the teams may be prepared, but will they be notified in time? Will the failure be quickly detected by the monitoring tools? Do their operational tools enable the recovery of the service by isolating the faulty components and provisioning new ones, while the Operators can diagnose the root cause without service downtime?

At the NFVI layer, the open source community has improved all the components, including the Operating System, the user-space tools (SystemD), the service availability daemons (Pacemaker), the logging and event detection (standard logs processed by ElasticSearch, Kibana and Logstash) and enabled multiple APIs for native out-of-band management (IPMI, PXE, etc) from the orchestration tools in the VIM. In the OpenStack HA Reference Architecture, there is no single point of failure as all components (control nodes, message queues, databases, etc.) are fully redundant and provide scale-out services, leading to a fault-tolerant and self-healing OpenStack control plane.  You can also find open-source solutions for monitoring and alerting, capacity planning, log-correlation and backup/recovery solutions, that can be combined to improve even further the platform’s uptime.

In the case that the VNFs cannot leverage this new operational reality, and need to rely on an HA-ready infrastructure, you can also enable and expose the Openstack’s internal HA mechanism, Pacemaker, to selected VMs. It allows those VNF’s to be managed in a fault-proof fashion, automating the recovery of the VM elsewhere in case of host failure, thanks to the tight integration with the storage layer, Ceph being one of the best suited for fast live migrations. This is an unique integration advantage of Openstack, with advanced options that allows sub-second detection and recovery times with ensured data integrity and consistency.

The 5 factors to consider in a carrier-grade NFV design

Turn on the faucet and you expect water to flow. Turn on the light switch and you expect the light to shine. Pick up the phone and you expect to be able to make or receive a call. Pick up a device and you expect connectivity. There is a consumer expectation of “always available.”

As the nature of services delivered over communications networks diversified, networks have become increasingly complex to handle the exploding volume of data along with high quality voice services. But consumer expectations of “always-on” have only increased.

Historically, communications service providers (CSPs) adopted the term carrier-grade to define network elements that were up to the rigor of deployment in an always-on communications network.  While there is not a hard definition, carrier-grade generally denoted five nines (99.999%) or six nines (99.9999%) of uptime availability- with six nines of availability equating to about six seconds of downtime per year.

NFV offers CSPs a move to a cloud-based virtualized network that runs on consumer-off-the-shelf (COTS) hardware. This provides a software-based move away from proprietary hardware appliances and the lack of flexibility that comes along with them. The change is so significant that we need to re-evaluate the factors that define a solution as carrier-grade.

In the following 5 posts,  I’ll analyze the main 5 factors to consider:

  1. Availability
  2. Performance
  3. Scalability
  4. Security
  5. Manageability

Subscribe to the blog to receive the updates in your inbox

1-aircraft-carrier

Boosting the NFV datapath with RHEL OpenStack Platform

Instead of explaning SR-IOV and DPDK by myself, please visit Nir Yechiel’s article on http://redhatstackblog.redhat.com/ and you’ll understand the multiple combinations of technologies around fast processing of network traffic in virtualized environments… Enjoy!

The Network Way - Nir Yechiel's blog

A post I wrote for the Red Hat Stack blog, trying to clarify what we are doing with RHEL OpenStack Platform to accelerate the datapath for NFV applications.

Read the full post here: Boosting the NFV datapath with RHEL OpenStack Platform

View original post

Comparing Openstack vs AWS to Tesla vs Edison

After reading this article I realized the current situation between AWS (public cloud) vs Private cloud (IBM/Cisco/HP/Dell-EMC) is very similar to the Electricity landscape in mid-1800’s ending in the War of the Currents in late 1800’s (AC vs DC)

Let me explain my point in this analogy: before electricity was discovery, manufacturing used natural power (water, air and animals) to create movement that was transferred inside the factory to help process goods using belts and chains. When electricity appeared, an Electrical generator was used as a power source, copper cables transferred electrical power to machines, and increased productivity and the density of workers, with reduced maintenance for the machinery.

So eventually every factory had its own coal-powered electrical generator (i.e. x86 servers nowadays), and line workers used their machines (i.e. computers) that required the energy that came via the copper wires (i.e. internet or VPN access).

Then the Power Grid was invented, and the industry wanted to shutdown their small electrical generators and leverage the bigger power grid. Outsourcing servers to AWS (public cloud) is like connecting the factory to the power grid (public electricity), but there were no standards for that in late 1800s, hence the Edison (DC) vs Tesla (AC) war.

In this case, Amazon is Edison, the greedy inventor, with lock-in in his veins (DC was patented). Tesla is Openstack, a generous inventor with open APIs and specs to build your own special sauce on top of a good-enough standard (60Hz, 110V, but easily changeable to 50Hz, 220V)

We’re seeing the struggle of Industries outsourcing their power generators and connecting it to the grid (i.e. the cloud). But there is the Monopoly way (Edison), and the Open way (Tesla)

Let OpenStack be the Tesla here, it will help the ‘Tech Giants’ to fight the good battle (Interconnected Private clouds using Openstack) and provide better services to the Industry.

Marcos Garcia, an OpenStack enthusiast