The Ops of DevOps Learning

October 6, 2020
By LimePoint Engineering
Automation

A recent change to some infrastructure code introduced a bug where iptables entries were being created in duplicate or triplicate under certain conditions. The fix for the bug was developed in a test environment with seemingly no new errors, reviewed in the usual manner and slated for an emergency release. Within a few minutes of applying the change to pre-production environments we started receiving reports from users that the applications were unavailable.

We’d apply the automation change and push it to the servers, it would kernel panic and restart, once the VM had restarted the problem seemingly went away. Our first question was clear – Is it the automation or the OS?

In my experience; most of the time the order for automating a task goes one of two ways:

Automated from scratch; or
Manual commands converted into automation/code.

Both are valid methods when approaching task automation. However, working the other way (that is, working out manual steps from automation) is very often like using Google translate on a sentence to a different language, then converting it back again. Sure, it might get the point across but don’t expect much else.

After about 4 or 5 hours of investigating and attempting to reproduce the kernel panic, we realised: The applications need to be started to trigger the issue. We’d previously left the applications down as to prevent users from connecting; shouldn’t be an issue (or so we thought), as the cause very clearly seemed to be an OS or kernel related. Once the applications were started, we were able to rule out the automation as the root cause by reproducing the case with manual command and eventually raised a support request with the vendor for a kernel patch.

It’s easy to make a very clear distinction between application and system when investigating an issue; if we expect our hardware and software layers to work together, shouldn’t we do the same when troubleshooting issues. The answer is yes, but every Ops person out there need to be humbly reminded of this from time to time.

The Ops of DevOps Learning

Confluent Platform on AWS Outposts

DriftGuard Custom Reporting – AWS Heat Map

The Ops of DevOps Learning

Related posts

Influxdb Enterprise LDAP

Enhancing data value chain reliability in time series database

Trending and analysis capabilities with InfluxDB time series database (TSDB)

Confluent Platform on AWS Outposts

DriftGuard Custom Reporting – AWS Heat Map