If your company wants to build assurance and guarantee your system’s ability to withstand turbulent conditions in production, then testing systems through chaos engineering is the way to go. Chaos engineering advances technological progress through its ability to increase development and make advances in rapid deployment.
The Origins of Chaos Engineering
Chaos engineering was developed at Netflix by a team of software engineers for quality software assurance (QA) and software testing. The concept was conceived a decade ago when the subscription streaming service moved from its own data centres to the public cloud. It was at this point that the team decided they needed to create a highly resilient service in their new cloud architecture. Netflix created Chaos Monkey – an automated testing tool used to disable random servers during regular activity hours – in 2010 and released it as open-source software. John Ciancutti, the former VP of Product Engineering at Netflix, stated that Chaos Monkey assisted in arbitrarily killing services within the architecture to test the system’s ability to succeed despite failure and unexpected outages. Netflix continues to develop additional software vulnerability testing tools which build resilience by enabling testing of further failure states.
Chaos engineering is gaining popularity among large technology companies such as Facebook, Google, Microsoft and LinkedIn, who leverage it to source failures before an outage occurs. Some pertinent questions; how is this performed? And are the results worth the risk? Testers inject failures into the software system to see how they perform under pressure and stress. The philosophy is that breaking things on purpose, and in marginally controlled environments, makes them more resilient. Through conducting these experiments, teams found that weaknesses in the systems diminish when fixed or handled correctly. An example may be introducing latency or a datacenter failure with the goal being to unearth vulnerability.
What is Chaos Engineering?
Chaos theory studies how specific systems behave in reaction to random actions. Chaos engineering works on the same principle by looking at how large-scale computer systems respond to particular events. One misfire in your data centre or even at the server level could have severe consequences and result in expensive downtime. Chaos engineering enables users to learn how a system responds to a stimulus. Doing this enables them to compensate for downfalls by making minor adjustments to their systems to mitigate against future occurrences of the same stimulus routinely. Important to note that the data collection process here is imperative to control parameters and limit future impacts.
Despite the moniker ‘chaos engineering’, it is, in fact, the opposite of chaos. It can be stated with confidence that chaos engineering is not quite so chaotic. Chaos engineering seeks to limit the chaos of outages by focusing on how to improve systems and make them stronger. Another name for chaos engineering is ‘reliability engineering’. As users of chaos engineering become aware of the benefits, they are injecting intentional, measured, known failures into the system to limit impact and downfall areas. The metrics gained provide an accurate, detailed picture of the system’s response to the stimuli.
Chaos engineering is spreading across many organisations and fields with its ability to provide companies with limited downtime and validate hypotheses about system behaviours. It assists organisations in two significant ways. Firstly, it helps teams understand the physics of multiple failure modes and in turn, gain confidence when dealing with failures. Secondly, it assists in the testing of redundancy and compartmentalisation, which contributes to comprehensive vulnerability testing.
Applying Chaos Engineering
Hopefully, I have made a case for the validity of chaos engineering. Let’s take a look at how we can leverage it. Chaos engineering provides businesses with the skills to lower risks, reduce workload and foster customer confidence.
Stressing systems allows you to find weaknesses that you might miss.
These are the critical steps in implementing chaos engineering:
- Get your system’s baseline for a measurable steady-state that depicts normal circumstances. The baseline of your system is the standard, expected behaviour (with no apparent flaws or vulnerabilities).
- You need to develop a hypothesis that the above state will continue in control groups and challenge groups. It helps to make an educated guess about the outcome of using the information and other datasets. The data should rely on measurable outputs and should support your hypothesis.
- Allow for the introduction of network stressors in your challenge group. Examples of this could be server crashes or hardware malfunctions. Doing this stimulates variables and allows them to enter the system, highlighting actual issues and problems that may occur when the system is in use.
- The last step is to invalidate your hypothesis by seeing the behavioural differences between the control group and the challenge group once you have introduced the chaos. In this stage, you need to compare your results to the original hypothesis, allowing you to see if the system performed as you expected. Then you would need to repeat the whole process through automation.
The Limitations of Chaos Engineering
Chaos engineering delivers many benefits. Apart from improving business risk mitigation, maintaining customer satisfaction, and reducing the workload for IT departments, it also allows for a reduced risk of revenue loss with minor maintenance costs. However, can injecting failure in systems and breaking them lead to reliability? Chaos engineering does not come without its limitations. Breaking stuff on purpose to uncover weaknesses and malfunctions may seem like an intelligent thing to do; however, it ends up wasting company time if you do not amend the flaws or system errors. You need to write down your observations and the areas that can be improved. After that, you should schedule a time to fix them.
It is crucial to realise that fault injection will not make your infrastructure stronger. Strength comes from individuals fixing the observed problems, and this is one of the many ways to approach faults to gain confidence in system correctness. It is also essential to address all issues before introducing more chaos into brittle infrastructures. If you inject faults gracefully and subtly they may go unnoticed. However, masking these failures can result in a calm state where individuals in the company may not take the fault seriously.
Antifragility – the idea that systems thrive because of stressors, shocks, noise, mistakes, faults, or failures – is integral to chaos engineering. Not all systems can withstand turbulent conditions for elongated periods.
Culture Plays a Significant Role
Work culture is crucial to maintaining good relationships with employees and businesses. When work culture is weak, teams become isolated from each other. It could start small, such as an outage, which causes someone to place blame on others. If this behaviour continues throughout the organisation, it has the potential to draw individuals further apart.
Chaos engineering can turn this attitude around.
However, this alone is not enough to change company culture; it is about how people use the tools given to them to achieve their goals. For example, breaking part of a service to prove it is unreliable does not foster a culture of trust. It is best to communicate the fault and then resolve the issue as a team.
If you have a dysfunctional team culture, it affects the whole organisation, and the company cannot thrive and grow. Your systems include the way that your employees interact with one another and solve problems. Behaviour and planning from your employees must translate into action if you want your company to thrive. Without care and quality checks, your whole system may fall apart.
One way to foster good cultural relationships within your company is to start at the beginning. Starting in development or staging allows employees to design chaos experiments together and use the tools to find the faults.
5 Tips for Chaos Engineering in Your Business
The word ‘chaos’ has connotations of disorder and confusion; this does not sit well with management teams and in turn, does not help your argument as to why the company should adopt and implement chaos engineering.
The critical element to get people on board with the idea is to explain the benefits of using chaos engineering and inform them about the sector’s limitations.
Here are five steps on how to implement chaos engineering in your business:
1. Drop the term ‘chaos ’
Do not be afraid to drop the term chaos. The word evokes negative connotations and obscures the positive attributes of the approach.
Instead of chaos engineering, start with ‘Limited scope, continuous, disaster recovery.’ This approach emphasises safety and is familiar to traditional IT managers and business leaders. ‘Limited scope’ connotes careful consideration and constraint. The word ‘continuous’ pays attention to the fact that it will be an ongoing process that could run in the background. ‘Disaster recovery’ is a common term used when dealing with critical circumstances within the system.
The term, as explained above, usually yields a more positive response than chaos engineering.
2. Focus on confidence and not on breaking things
The misunderstanding that is born from the term chaos engineering is that it is only about breaking things. The misconstrued line of thought often allows people to miss the advantages of using chaos engineering. You need to focus on the results and not on the method of injecting failures into the system.
3. Tell them there is a blast radius
A common fear people have towards chaos engineering is that Chaos Monkey destroys an array of things throughout the system. To avoid this anxiety, you need to emphasise the importance of having a blast radius and state that you will look at specific sections in the system, limiting harmful consequences that may come from chaos engineering.
There is much more to chaos engineering than working with the infrastructure or technical aspects. Chaos engineering focuses on all aspects of the socio-technical system of software development. Included in this are people, processes, practices, automated experiments, automation, and platforms. It improves infrastructure, platforms, applications, people, and practices.
4. The investment does not need to be significant
If you are worried about implementing chaos engineering because of costs, you need not fret. It is likely that you already perform chaos engineering in your company. You may be engaging in disaster recovery and already have teams looking at SLAs or system availability in production, meaning you may already be benefitting from chaos engineering.
5. Know the benefits and limitations
It is crucial not to over-promise the benefits of chaos engineering. It may be tempting (especially when adopting a new technique) to get over-excited about the potential benefits; however, remember that it has its downsides and limitations.
Chaos engineering brings with it excellent opportunities for software development and systems can be improved dramatically; however, try to stay neutral about everything, so you do not oversell it.
If you would like to know more, visit https://www.limepoint.com/platform-engineering to learn how chaos engineering can improve your business.