Chaos Testing Tutorial: A Comprehensive Guide


Chaos testing is an approach to test a system's resiliency by actively simulating and identifying failures in a given environment before they cause unplanned downtime or a negative user experience. DevOps and IT teams use chaos engineering to create a system of monitoring tools and actively run chaos testing in a production environment. This allows teams to see real-life simulations of how their software applications or service responds to different stress levels.

Most modern systems leverage the power of various cloud technologies that highly depend on the Internet. The same goes for microservices. We might have a different level of control over infrastructure and the Internet. But there are ways we can avoid most failures as far as distributed systems are concerned.

However, we can't prevent software applications from getting more integrated and complex as technology advances. The more advanced or complex a product becomes, there are more chances of potential failure. If a system goes down, it affects the entire customer base. One of the best ways to avoid this nightmare is to dabble into chaotic testing and use it to ensure high quality for your software, even if it's in production.


What is Chaos Testing?

Before defining chaos testing, it's important to know what chaos engineering exactly is. Chaos engineering allows testers to determine an application's quality by expanding their skills beyond traditional testing methods.

It involves using unexpected and random failure conditions to identify system bottlenecks, vulnerabilities, and weaknesses. Chaotic testing is a modern-day DevOps practice that uses unexpected and random conditions, actions, and failures to determine the resilience of a software product or a system.

In this process, testers deliberately inject failures and faults into a system's infrastructure to test how the system responds. When done in a controlled manner, this method is effective for preparing, practicing, minimizing, and preventing outages and downtimes before the occurrence. In other words, it's a purposefully induced crash to a production system to intentionally harm the application in production and see how things go.

Why Chaos Testing?

Chaotic testing is more stimulating and exciting for testers due to its practical nature. It helps testers expand their respective skill sets and add more value to building a higher-quality application.

The QA team can start by setting the system's baseline or optimal state. After that, testers consider potential weaknesses and create test scenarios based on those weaknesses and their impact. The next step is test execution with the help of available resources to fix the production server in case a problem appears. For instance, if an issue occurs during a test in the blast radius, the team should divert the necessary resources for reinstating the production server as per requirement.

The key to a successful chaotic testing stint is seamless cooperation and coordination between the DevOps and QA testing teams. The DevOps team has the necessary restoration skills to bring the production server to normalcy. Testers can easily break back-end and hardware connections to determine the impact of the blow on the product.

Therefore, while testing in production, it's crucial to leverage both abilities to facilitate optimal chaos testing, including development, implementation, and support. Another great way to make the most out of chaos testing is by executing tests at durations that are not considered peak hours. It helps minimize customer performance effects, thereby maintaining brand reputation.

History of Chaos Testing

It has almost been over a decade since chaotic testing came into existence. The credit goes to the same tech giant that has given us ample entertainment for as long as we can remember. Yes, we are talking about the legend, Netflix! The DevOps team moved to Amazon Web Services (AWS) for better scalability for the infrastructure. Even though the decision was intimidating, Netflix ended up converting from a monolithic to a microservice architecture by migrating to AWS.

Even though the platform had millions of users, they managed to minimize the impact of this transition on their customer base. One lesson Netflix learned is that there shouldn't ever be one point of failure that could lead to an unpredictable downtime. A few years before Netflix switched to AWS, they experienced a major failure which made them understand the importance of chaotic testing.


Difference between Chaos and Regular Testing

Chaotic testing differs from standard testing in numerous ways.

  • Chaos tests take into account the various touchpoints that are beyond the scope of testing, whereas normal testing only considers the ones that are within the scope of the testing.
  • Regular testing typically occurs during the project's build/compile phase, whereas chaotic testing occurs once the system is complete.
  • Unlike chaotic testing, regular testing does not usually include the testing of varying configurations, behaviors, outages, and other interruptions caused by a third-party entity.
  • Standard testing only sometimes identifies the easy fix of end-user negative reactions. It results in a disabled system that you need to fix before testing can resume. On the other hand, chaos testing introduces issues into the system to see how it reacts.
  • Regular testing uncovers bugs, and a blocker can cause a system hang. Chaotic testing has a predetermined abort plan that allows errors if the expected reactions are incorrect.

Benefits of Chaos Testing

Chaos engineering is a concept that came into emergence in 2010 and has now become one of the most widely adopted practices in DevOps. Network outages have cost even the leading industrial organizations a lot over the past decade. We all heard about the Amazon blackout in 2018 that caused a massive loss of up to $99 million. Even Facebook got into a similar situation in 2019.

But it's no surprise that since the internet began, such mishaps have happened to almost all enterprises. Such blackouts have significantly impacted the cost of maintenance and the overall company’s revenue. While money is an important entity that a company loses during outages, it also tends to have a domino effect.

For example, no matter how big a company is, an outage can cost them in terms of employee confidence, customer loyalty, stock prices, brand integrity, etc. In the worst-case scenario, stakeholders might take legal action that can further down the company in debt.

When chaos testing came into the picture, a long-term solution emerged for maintaining system resilience. Organizations can easily recognize their system weaknesses and help them create a failover plan that provides a safety net to the company in case of a catastrophic failure. On that note, let's look at some striking benefits of chaotic testing.

  • Business Boosting Capabilities
  • Businesses can quickly build reliable and resilient systems with the help of chaos, engineering, and testing, eventually increasing user satisfaction. It isn't as failure-prone as their non-chaos engineering counterparts, which boosts the business demand.

  • Better Collaboration
  • Businesses can quickly build reliable and resilient systems with the help of chaos, engineering, and testing, eventually increasing user satisfaction. It isn't as failure-prone as their non-chaos engineering counterparts, which boosts the business demand.

  • Innovative Nature
  • Since chaotic testing helps identify the structure and design-related flaws in software products, it promotes unlimited innovation. It gathers intelligence from having an understanding of these flaws, which contributes to enhancing existing new components.

  • Better Stakeholder and User Satisfaction
  • When a team gets the confidence to dive deep into chaotic testing, running experiments closer to production becomes a cakewalk. Since the production environment is the only thing closest to a real system, conducting experiments over here offers an accurate insight into how the experience would be for the end users. Of course, there are always bonus points for keeping the stakeholder happy and satisfied.

  • Better Application Performance Monitoring
  • Since chaos testing is a highly holistic approach to testing methods and performance engineering, it improves application performance monitoring. When teams perform chaotic testing regularly, it instills confidence in various distributed systems and ensures the optimum performance of applications even during major unpredictable failures.

Challenges of Chaos Testing

Chaos testing has a lot of benefits, but one must carry it out carefully. The following are the top challenges.

  • Unnecessary damage
  • Chaotic testing poses the major concern of unnecessary damage. There is a risk that chaos engineering results in a real-world loss that exceeds the allowances of justifiable testing. To pinpoint the cause of failure, you need to control the blast radius. Organizations should avoid tests that exceed the blast radius to limit the cost of uncovering application vulnerabilities.

  • Incomprehensive observability
  • It can be challenging to differentiate between essential and non-critical dependencies without comprehensive observability. Organizations can find it challenging to pinpoint the root cause of an issue due to a lack of visibility, which can hamper remediation plans.

  • Unclear hypothesis
  • It is important first to understand how the system functions normally and how it will react to a chaos test. The results can be ambiguous, and the insights from the chaos test may be limited in the absence of a well-defined hypothesis and model. Thus, it is crucial to stress the need to perform a chaos test carefully.

What is a Chaos Monkey?

The team of engineers at Netflix decided to develop software test tools for validating the recoverability and resilience of the platform. It has been a long-standing engineering principle various software development organizations have embraced for quite some time. The idea is that you can intentionally break a system to ensure resilience.

The aim behind chaos monkey’s design was to disable the production instances on AWS infrastructure unpredictably. This was used to expose weaknesses on which the Netflix engineers could work. This even prompted them to build improved automatic mechanisms for recovery.

As for the catchy name, the idea of unleashing a crazy wild animal with a destructive weapon in its hand while continuing to serve customers without glitches was the inspiration. After learning the weak spots, engineers would combat problems by setting up automatic triggers. The idea that started with chaos monkey has now evolved into a wide array of chaos principles.

What do we mean by an experiment in the context of chaotic testing? An experiment refers to a pre-planned and controlled injection of faults. Some of the most common and obvious fault injections include:

  • Disrupting entire availability zones or inducing a regional outage.
  • The random shutdown of computer engines in a data center or an availability zone.
  • Injection of latency between services over a network or a specific portion of traffic for a specified time frame.
  • Injecting a fall in the form of a code insertion that takes place before particular instructions.

Chaos Testing Principles

Let's check out the principles that form the base of chaotic testing.

  • Specify the System's Normal Behavior
  • Testers can define this steady state as a measurable output, such as system latency, error rates, throughput, etc. It should indicate a system’s normal behavior, that is, acceptable behavior, instead of an unexpected one. In other words, you should consider the normal system state as its steady state.

  • Specify a Hypothesis
  • This principle defines the steady state hypothesis that equates to the expected experimental output. This hypothesis should do justice to the core chaos engineering objective; the injected events won't change the system to something other than the steady state.

  • Design and Run Experiments
  • This principle facilitates designing failure scenarios of the system of the infrastructure and controllably running them. Testers should either have a clear recovery path or a back-out strategy.

  • Test Result Analysis
  • This principle involves verification of the correctness of the hypothesis. It also determines whether the system’s steady state has changed due to user experience and server continuity discrepancies.

Chaos Testing Principles

Types of Experiments in Chaos Engineering

There are three major kinds of experiments with chaos engineering features. Let's take a look.

  • Automating Faults
  • Most use reliability engineering to fix faults during the system’s reliability check. This kind of automation helps QA teams assess what kinds of automated solutions work and what types of functions would require backup components.

  • Injecting Failures
  • Injecting something that causes an unusual behavior in software is a must when it comes to chaos engineering. Such an experiment enables engineers to pinpoint vulnerable or weak software components and keep the software up and running in case of component malfunctions.

  • Dependency Testing
  • Sometimes, when dependencies remain, assuming a happy software development scenario can backfire on chaos engineers after they conduct a standard test. That's why checking for latent dependencies between microservices, databases, and downstream services is essential. Performing such checks and tests helps them get a clear idea of the obstacles that are the reasons behind production and post-production failures.

Chaos Testing Pyramid

In the past few decades, the technology industry has seen rather dramatic changes in design, building, and operation. This result is the development of more complex systems and tends to cause large-scale distributed systems to be more prone to failure.

Chaos engineering aims to inform and educate the organization about unknown vulnerabilities and previously unanticipated outcomes of computer systems. These complex testing procedures must identify hidden problems before they cause an outage outside the organization's control.

A disaster recovery team can improve the system's fault tolerance and resiliency by addressing systematic weaknesses. As a result, chaotic testing takes place on a variety of levels. Here is how a typical chaos test pyramid looks.

  • Unit testing: Unit testing evaluates a component or unit’s specific behavior of the software application. However, it is recommended to test an application's component as a standalone entity by removing all its dependencies. The chaos engineering team controls the application's behavior using mocks.
  • Integration testing: This testing focuses on the interactions and interrelationships between individual units or components. QA performs integration testing after automatically post-unit testing. In complex applications and systems, integration tests are important to determine their stable state or operational metrics.
  • System testing: It involves examining how a system reacts under the heavy stress of a worst-case failure scenario. In production environments, these tests are always conducted in real-world situations.
Chaos Testing Pyramid

Continuous quality cloud testing platforms like LambdaTest help you perform exploratory and automation testing across an online browser farm of 3000+ real browsers, devices, and operating systems combinations. It also offers a real device cloud and virtual testing platform to test websites and mobile apps in real-world environments.

Subscribe to our LambdaTest YouTube Channel to get the latest tutorials on Selenium automation testing, Cypress testing, and more.

How to perform Chaos Testing?

Before starting chaotic testing, it’s essential to determine whether chaos testing and engineering fit your organization and business. Chaos engineering effectively improves the integrity of large and complex systems, offering numerous benefits like quicker incident response times, less unplanned downtime, and more. However, chaotic testing may not be a good fit for smaller systems and desktop applications.

Chaos testing is pretty much easier to conduct if you use a cloud-based system. Following are the steps to get started with chaotic testing.

  • Talk with stakeholders: Talking to stakeholders who might be impacted by a service disruption is crucial since you are dealing with production data. This can include internal users like analytics professionals who rely on current data or customer service experts who would have to deal with any service interruption.
  • Configure chaos test tools: You can either set up an internal chaos test tool or use the Simian Army suite, accessible under the Apache 2.0 license.
  • Select a chaos level: Test execution tools can help you choose different chaos levels, like low, medium, high, and extreme.
  • Test reports: In response to test reports, you need to create a feasible fix whenever there is a failure report. This could be a simple fix, such as adding redundancy to the network. Instead, you may need to consider making an essential modification to your architecture.
  • Retest after deployment: If you're using an automated test schedule, you should have your fix before the next test cycle.

Chaos Testing as a DevOps Practice

It's the responsibility of a DevOps tester to utilize chaos monkeys to make the most out of chaos engineering. A QA or development team member defines the scenario, executes tests, determines the results, and records them. Not only that, but they also minimize how customers affect the entire production system.

Moreover, if things get out of control, the tester conducting the chaos engineering experiments needs to know when it's time to shut it all down. Usually, an experienced QA professional is responsible for conducting chaos engineering. This individual also defines various testing scenarios, handles test executions, and tracks results in outcomes, ensuring that customers don't suffer a major impact.

As we already know, in a DevOps workflow, testing tends to be highly automated due to necessity. Testers don't tamper with the build during software delivery, whether unit testing or smoke testing. Most organizations nowadays are happy to deploy customer-facing solutions with the help of pragmatic DevOps practices. How an application behaves counts in scenarios such as the production environment.

That's exactly when chaotic testing plays the most crucial role. After all, resilience takes on the highest priority during the deployment to the cloud, even though it's merely a portion of an entire testing regime.

Best Practices of Chaos Testing

Large organizations like Netflix have successfully shaped chaotic testing by spearheading their conception. Even though it happened out of sheer necessity, Netflix evolved and improved internal chaos testing for managing conformity and latency. Not to mention, they also had to monitor a large number of metrics across complicated software. That's exactly what other organizations have done on a large scale.

Regardless of your level on the revenue spectrum, some best practices are sure to amp up your chaotic testing game. Let's take a look.

  • Training teams according to their behavior and reactions
  • Businesses must observe how their teams react and respond while fixing problems and running experiments. Communication is the key here. If teams or team members cannot easily facilitate communication, it will take longer to eliminate issues. That's where collaboration and communication training comes into the picture.

  • Follow a realistic approach
  • Starting with having a concrete understanding of the usual system behavior will go a long way in diagnosing issues. Also, simulating realistic scenarios, like focusing on injecting bugs and failures that are most likely, will help make chaotic testing a more robust process. For instance, if your business has faced latency as a potent issue, injecting latency-inducing bugs can be the way to go.

  • Small blast radius
  • Don't straight away start with something big. Start small and limit the unknowns to smaller experiments so you can learn about them. Over time, once you gain some confidence, you can quickly scale out these experiments to the applicable extent. The good idea is, to begin with, a single microservice, container, or compute engine to minimize possible side effects.

  • Define and disrupt normal
  • Defining your control goes hand in hand with defining what's typical for properly implementing chaos engineering principles. Start by integrating proper monitoring services followed by defining thresholds. This will help you determine normal behavior and when things go beyond the norm.

    All professionals involved should be capable of maintaining a clear distinction between normalcy and abnormalities by closely choosing and monitoring metrics. Once you define the normal, go ahead and disrupt it in a controlled manner by putting chaos engineering principles into action. Make sure the disruption is realistic, including parameters such as dying servers and traffic spikes reflecting real-world scenarios.

  • Continuous confidence building
  • Considering the unknown nature of chaotic testing, team leaders must keep instilling confidence in the teams involved. Some ways to accomplish greater confidence in testing include using the right tools and having the right kind of experts around for damage control if things go wayward. Better clarity by taking notes and carefully tracking relevant metrics also increases the accuracy of the procedure.

Summing up

There isn’t a single system that is 100% safe from outage or failure. On average, almost all major cloud infrastructures suffer at least one or two outages per quarter. While we can't precisely control every aspect of these outages or failures, we can certainly have some control over the impact on partners, employees, customers, and business reputation. When an organization frequently exercises failures in a test lab, recognizing a system’s recovery path becomes much more manageable.

Of course, there's only a limit to which chaos testing can tackle resilience issues on its own. The rest depends on the testing tool you're using. Running Chaos tests minimizes the number of crashes or system failures apparent in production as long as the QA testing or DevOps teams are involved. Know what they're dealing with. With in-depth knowledge, proper tools, cooperation, collaboration, and pragmatic cooperation between teams and team members, there's no reason why organizations can't leverage the power of chaotic testing.

Frequently Asked Questions (FAQs)

What is chaos testing?

Chaos testing is a modern DevOps technique that employs unexpected and random conditions, actions, and failures to evaluate the resilience of software products or systems.

Why do we need chaos testing?

Chaotic testing is simply the ability to cause failures in your production system on a continuous but random basis. This discipline is intended to evaluate the resilience of the systems and the environment and also to determine the MTTR (mean time to repair.)

What is chaos monkey testing?

Netflix engineers created Chaos Monkey as a test tool to evaluate the resiliency and recoverability of Amazon Web Services (AWS). By shutting down one or more virtual machines, the software simulates failures of instances of services running within Auto Scaling Groups (ASG).

Try LambdaTest Now !!

Get 100 minutes of automation test minutes FREE!!

Next-Gen App & Browser Testing Cloud

Did you find this page helpful?