Updating OpenStack TripleO Ceph nodes safely one at a time

Part of the process when updating Red Hat’s TripleO based OpenStack is to apply the package and container updates, viaupdate run step, to the nodes in each Role (like Controller, CephStorage and Compute, etc). This is done in-place, before the ceph-upgrade (ceph-ansible) step, converge step and reboots.

openstack overcloud update run --nodes CephStorage

Rather than do an entire Role straight up however, I always update one node of that type first. This lets me make sure there were no problems (and fix them if there were), before moving onto the whole Role.

I noticed recently when performing the update step on CephStorage role nodes that OSDs and OSD nodes were going down in the cluster. This was then causing my Ceph cluster to go into backfilling and recovering (norebalance was set).

We want all of these nodes to be done one at a time, as taking more than one node out at a time can potentially make the Ceph cluster stop serving data (all VMs will freeze) until it finishes and gets the minimum number of copies in the cluster. If all three copies of data go offline at the same time, it’s not going to be able to recover.

Continue reading Updating OpenStack TripleO Ceph nodes safely one at a time

Using Ansible and dynamic inventory to manage OpenStack TripleO nodes

TripleO based OpenStack deployments use an OpenStack all-in-one node (undercloud) to automate the build and management of the actual cloud (overcloud) using native services such as Heat and Ironic. Roles are used to define services and configuration, which are then applied to specific nodes, for example, Service, Compute and CephStorage, etc.

Although the install is automated, sometimes you need to run adhoc tasks outside of the official update process. For example, you might want to make sure that all hosts are contactable, have a valid subscription (for Red Hat OpenStack Platform), restart containers, or maybe even apply custom changes or patches before an update. Also, during the update process when nodes are being rebooted, it can be useful to use an Ansible script to know when they’ve all come back, services are all running, all containers are healthy, before re-enabling them.

Inventory script

To make this easy, we can use the TripleO Ansible inventory script, which queries the undercloud to get a dynamic inventory of the overcloud nodes. When using the script as an inventory source with the ansible command however, you cannot pass arguments to it. If you’re managing a single cluster and using the standard stack name of overcloud, then this is not a problem; you can just call the script directly.

Continue reading Using Ansible and dynamic inventory to manage OpenStack TripleO nodes

Using network namespaces with veth to NAT guests with overlapping IPs

Sets of virtual machines are connected to a virtual bridges (e.g. virbr0 and virbr1) and as they are isolated, can use the same subnet range and set of IPs. However, NATing becomes a problem because the host won’t know which VM to return the traffic to.

To solve this problem, we can use network namespaces and some veth (virtual Ethernet) devices to connect up each private network we want to NAT.

Each veth device acts like a patch cable and is actually made up of two network devices, one for each end (e.g. peer1-a and peer1-b). By adding those interfaces between bridges and/or namespaces, you create a link between them.

The network namespace is only used for NAT and is where the veth IPs are set, the other end will act like a patch cable without an IP. The VMs are only connected into their respective bridge (e.g. virbr0) and can talk to the network namespace over the veth patch.

We will use two pairs for each network namespace.

  • One (e.g. represented by veth1 below ) which connects the virtual machine’s private network (e.g. virbr0 on 10.0.0.0/24) into the network namespace (e.g. net-ns1) where it sets an IP and will be the private network router (e.g. 10.0.0.1).
  • Another (e.g. represented by veth2 below) which connects the upstream provider network (e.g. br0 on 192.168.0.0/24) into the same network namespace where it sets an IP (e.g. 192.168.0.100).
  • Repeat the process for other namespaces (e.g. represented by veth3 and veth4 below).
Configuration for multiple namespace NAT

By providing each private network with is own unique upstream routable IP and applying NAT rules inside each namespace separately we can avoid any conflict.

Continue reading Using network namespaces with veth to NAT guests with overlapping IPs

Using Ansible to define and manage KVM guests and networks with YAML inventories

I wanted a way to quickly spin different VMs up and down on my KVM dev box, to help with testing things like OpenStack, Swift, Ceph and Kubernetes. Some of my requirements were as follows:

  • Define everything in a markup language, like YAML
  • Manage VMs (define, stop, start, destroy and undefine) and apply settings as a group or individually
  • Support different settings for each VMs, like disks, memory, CPU, etc
  • Support multiple drives and types, including Virtio, SCSI, SATA and NVMe
  • Create users and set root passwords
  • Manage networks (create, delete) and which VMs go on them
  • Mix and match Linux distros and releases
  • Use existing cloud images from distros
  • Manage access to the VMs including DNS/hosts resolution and SSH keys
  • Have a good set of defaults so it would work out of the box
  • Potentially support other architectures (like ppc64le or arm)

So I hacked together an Ansible role and example playbook. Setting guest states to running, shutdown, destroyed or undefined (to delete and clean up) are supported. It will also manage multiple libvirt networks and guests can have different specs as well as multiple disks of different types (SCSI, SATA, Virtio, NVMe). With Ansible’s –limit option, any individual guest, a hostgroup of guests, or even a mix can be managed.

Managing KVM guests with Ansible

Although Terraform with libvirt support is potentially a good solution, by using Ansible I can use that same inventory to further manage the guests and I’ve also been able to configure the KVM host itself. All that’s really needed is a Linux host capable of running KVM, some guest images and a basic inventory. The Ansible will do the rest (on supported distros).

Continue reading Using Ansible to define and manage KVM guests and networks with YAML inventories

Use swap on NVMe to run more dev KVM guests, for when you run out of RAM

I often spin up a bunch of VMs for different reasons when doing dev work and unfortunately, as awesome as my little mini-itx Ryzen 9 dev box is, it only has 32GB RAM. Kernel Samepage Merging (KSM) definitely helps, however when I have half a dozens or so VMs running and chewing up RAM, the Kernel’s Out Of Memory (OOM) killer will start executing them, like this.

[171242.719512] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/machine.slice/machine-qemu\x2d435\x2dtest\x2dvm\x2dcentos\x2d7\x2d00.scope,task=qemu-system-x86,pid=2785515,uid=107
[171242.719536] Out of memory: Killed process 2785515 (qemu-system-x86) total-vm:22450012kB, anon-rss:5177368kB, file-rss:0kB, shmem-rss:0kB
[171242.887700] oom_reaper: reaped process 2785515 (qemu-system-x86), now anon-rss:0kB, file-rss:68kB, shmem-rss:0kB

If I had more slots available (which I don’t) I could add more RAM, but that’s actually pretty expensive, plus I really like the little form factor. So, given it’s just dev work, a relatively cheap alternative is to buy an NVMe drive and add a swap file to it (or dedicate the whole drive). This is what I’ve done on my little dev box (actually I bought it with an NVMe drive so adding the swapfile came for free).

Continue reading Use swap on NVMe to run more dev KVM guests, for when you run out of RAM

Using pipefail with shell module in Ansible

If you’re using the shell module with Ansible and piping the output to another command, it might be a good idea to set pipefail. This way, if the first command fails, the whole task will fail.

For example, let’s say we’re running this silly task to look for /tmp directory and then trim the string “tmp” from the result.

ansible all -i "localhost," -m shell -a \
'ls -ld /tmp | tr -d tmp'

This will return something like this, with a successful return code.

localhost | CHANGED | rc=0 >>
drwxrwxrw. 26 roo roo 640 Se 28 19:08 /

Continue reading Using pipefail with shell module in Ansible

Setting up a monitoring host with Prometheus, InfluxDB and Grafana

Prometheus and InfluxDB are powerful time series database monitoring solutions, both of which are natively supported with graphing tool, Grafana.

Setting up these simple but powerful open source tools gives you a great base for monitoring and visualising your systems. We can use agents like node-exporter to publish metrics on remote hosts which Prometheus will scrape, and other tools like collectd which can send metrics to InfluxDB’s collectd listener (as per my post about OpenWRT).

Prometheus’ node exporter metrics in Grafana

I’m using CentOS 7 on a virtual machine, but this should be similar to other systems.

Continue reading Setting up a monitoring host with Prometheus, InfluxDB and Grafana

Securing Linux with Ansible

The Ansible Hardening role from the OpenStack project is a great way to secure Linux boxes in a reliable, repeatable and customisable manner.

It was created by former colleague of mine Major Hayden and while it was spun out of OpenStack, it can be applied generally to a number of the major Linux distros (including Fedora, RHEL, CentOS, Debian, SUSE).

The role is based on the Secure Technical Implementation Guide (STIG) out of the Unites States for RHEL, which provides recommendations on how best to secure a host and the services it runs (category one for highly sensitive systems, two for medium and three for low). This is similar to the Information Security Manual (ISM) we have in Australia, although the STIG is more explicit.

Continue reading Securing Linux with Ansible

Patches for OpenStack Ironic Python Agent to create Buildroot images with Make

Recently I wrote about creating an OpenStack Ironic deploy image with Buildroot. Doing this manually is good because it helps to understand how it’s pieced together, however it is slightly more involved.

The Ironic Python Agent (IPA) repo has some imagebuild scripts which make building the CoreOS and TinyCore images pretty trivial. I now have some patches which add support for creating the Buildroot images, too.

The patches consist of a few scripts which wrap the manual build method and a Makefile to tie it all together. Only the install-deps.sh script requires root privileges, if it detects missing dependencies, all other Buildroot tasks are run as a non-privileged user. It’s one of the great things about the Buildroot method!

Continue reading Patches for OpenStack Ironic Python Agent to create Buildroot images with Make

Creating an OpenStack Ironic deploy image with Buildroot

Edit: See this post on how to automate the builds using buildimage scripts.

Ironic is an OpenStack project which provisions bare metal machines (as opposed to virtual).

A tool called Ironic Python Agent (IPA) is used to control and provision these physical nodes, performing tasks such as wiping the machine and writing an image to disk. This is done by booting a custom Linux kernel and initramfs image which runs IPA and connects back to the Ironic Conductor.

The Ironic project supports a couple of different image builders, including CoreOS, TinyCore and others via Disk Image Builder.

These have their limitations, however, for example they require root privileges to be built and, with the exception of TinyCore, are all hundreds of megabytes in size. One of the downsides of TinyCore is limited hardware support and although it’s not used in production, it is used in the OpenStack gating tests (where it’s booted in virtual machines with ~300MB RAM).

Continue reading Creating an OpenStack Ironic deploy image with Buildroot