Chaos engineering is the discipline of testing distributed software or systems by introducing failures and permitting engineers to study the demeanor and perform modifications with the outcome so that the failures are avoided when end users work with the software and systems. It is blended with Site Reliability Engineering (SRE), which attempts to compute the influence of the improbable.
In chaos engineering, practitioners intentionally inject failure into a system to assess its resiliency. This science involves the implementation of experiments and hypotheses followed by comparing the outcome with a steady state.
An example of chaos engineering in a distributed system is randomly taking down services to observe the responses and impact inflicted on users. An application needs the following infrastructure to run: networking, storage, computing, and application. In chaos experiments, valid experiments include injecting turbulent or faulty conditions in any random section of the stack. This engineering permits a large extent of controlled testing of failures.
Some Internet organizations were the pioneers of distributed, large-scale systems. The complexity of these systems necessitated a novel approach to testing failures. This led to the creation of chaos engineering.
In 2010, Netflix changed its focus from physical infrastructure to cloud infrastructure. Amazon Web Services (AWS) offered this cloud infrastructure. Here, the requirement was to ensure that the Netflix streaming experience should not be affected if Amazon lost an instance. In response to this requirement, the Netflix team developed a tool called Chaos Monkey.
In 2011, the Simian army came into existence. This army appended some failure injection modes to Chaos Monkey that enabled testing of a holistic suite of failures and developed resilience in the suite. The task was the design of a cloud architecture wherein disparate singular components could fail without impacting the entire system’s availability.
In 2012, GitHub had the source code of Chaos Monkey, which Netflix shared. Netflix claimed that they had invented the optimum defense against unexpected large-scale failures. It revealed that it was frequently used, causing failures to coerce the construction of services with incredible resiliency.
In 2014, Netflix created a new role, Chaos Engineer. Koltron Andrus, the Gremlin Co-Founder, and his team declared that they had come up with a novel tool, Failure Injection Testing (FIT), which offered developers a higher granular control over the failure injection’s ‘blast radius.’ FIT gave developers control over the failure scope, which made them understand the chaos engineering insights and mitigation of the potential downside.
In 2016, Matthew Fornaciari and Kolton Andrus established Gremlin, the first managed solution of chaos engineering. In late 2017, Gremlin became available publicly. In 2018, the first large-scale conference pertinent to chaos engineering, Chaos Conf, was launched by Gremlin. In only two years, the attendees count was multiplied by approximately ten times. These attendees included veterans from industries such as Delivery, Finance, Retail, and Software.
In 2020, AWS ensured the addition of chaos engineering to the AWS Well-Architected Framework (WAF) reliability pillar. Toward the end of this year, AWS declared the Fault Injection Simulator (FIS) advent. This completely managed service natively ran the Chaos experiments on the services of AWS.
In 2021, the first report of ‘State of Chaos Engineering’ was published by Gremlin. This report consisted of the main advantages of chaos engineering, the expansion of its practice among organizations, and the frequency at which top-performing teams conducted chaos experiments.
Chaos engineering begins with analyzing the expected behavior of a software system. Here are the steps involved in implementing chaos experiments.
The chaotic experiments render valuable insights. These are leveraged to decrease the frequency of High Severity Expansion (SEV), ensure lesser time to detect SEVs, enhance the system design, comprehend system failure modes better, reduce the on-call burden, and minimize incidents. All of these are technical advantages.
The organization can enhance the SEV Management program, make the on-call training for engineering teams better, make the engineers more engaged and happy, and prevent enormous losses in maintenance and revenue expenses. These are the business advantages.
It is possible to have no outages to hamper daily activities. This implies that the organization’s service is more durable with increased availability. These are the customer benefits.
Some other advantages are the following.
A chaos engineering team is typically part of a small DevOps team, often working with pre-production and production software applications. With its broad implications across various systems, chaos experiments can affect groups and stakeholders at all levels of the organization.
Various stakeholders can participate in and contribute to a disruption involving hardware, networks, and cloud infrastructure, including network and infrastructure architects, risk specialists, cybersecurity teams, and even procurement officers.
The principles of chaos engineering are divided into four practices. Herein, it is assumed that the system is stable, and then you have to find the variance. If the steady state is harder to interrupt, the system is more robust to a proportional degree.
You must know the features of the normal or the steady state. This is pivotal to finding out the regression or the deviation. Based on what you are testing, you can select an apt metric for a good measure of normalcy. The metric can be the completion of the user journey in a stipulated time or the response time. In an experiment, the steady-state is the control group.
Assuming that a hypothesis applies perpetually to the system, you will get little scope for testing. The design of chaos engineering enables it to run against steady and robust systems with the ability to detect faults, such as infrastructure or application failures. If you run chaos experiments against unsteady systems, the process is not crucial because such systems are known to be unstable and unreliable.
The experiment involves introducing variables in the system to observe the system's response to variables. Such experiments represent real-world scenarios that affect one or more of the application pillars: infrastructure, storage, networking, and computing. An example is that when there is a failure, it could be either a network interruption or a hardware failure.
Let us consider that a hypothesis is for a steady state. The differences between the experiment and control groups are disruptions or variances from the steady state. These contradict the hypothesis of stability. Now, you can focus on the design alterations or fixes that can result in a more stable and robust system.
At Sun Microsystems, L. Peter Deutsch, a computer scientist, and his colleagues drafted a list of eight distributed systems’ fallacies, which are the following:
The preceding are false assumptions about distributed systems made by engineers and programmers. When applying chaos experiments to an issue, the preceding eight fallacies are a good starting point.
The chaos engineers regard them as core principles to comprehend the network and system problems. The underlying theme of these fallacies is that the network and systems can never be 100% dependable or perfect. As all accept this fact, the concept of ‘five nines’ exists in the case of highly available systems.
So, the chaos engineers strive for less than 100% availability, and the closest they can be to perfection is 99.999%. In distributed computing environments, you can easily make these false assumptions, and based on them, you can identify the random problems created in complicated distributed systems.
Netflix is regarded as a reputed pioneer of chaos experiments. This company was the first to use chaos engineering in the production environment. It designed the test automation platforms and made them open-source. For these platforms, it termed them collectively as the ‘Simian Army.’
The suite of the Simian Army included many tools, some of which are the following:
Its functionality is to disable one production system to create an outage and then test the manner of the other remaining systems' responses. The design of this tool is to enforce failures in a system and then check the system responses.
With time, a large count of chaos-inducing programs is being generated to test the abilities of the streaming service. Due to this, the suite of the Simian Army is continuously expanding.
Following are some other chaos engineering tools.
The rationale for using the word ‘monkey’ is the following:
No person can predict when a monkey might enter a data center and what the monkey would destroy. A data center has a collection of servers that hosts all critical functions done during online activities. Let us assume that a monkey enters this data center and then behaves randomly, destroying devices, ripping cables, and returning everything that passes his hand.
So, IT Managers face the challenge of designing the information system they are accountable for to continue functioning despite the destruction caused by such monkeys.
In a chaos experiment, the basic idea is to intentionally break a system and gather data that can be leveraged to augment the system's resiliency. This type of engineering is closely related to software testing and software quality assurance approaches. It is majorly suitable for sophisticated distributed systems and processes.
You will find it cumbersome to predict error-prone situations and resolve these errors. The size and complications of a distributed system play a role in giving rise to random events. The more the size and complexity of a distributed system, the more unpredictability in its demeanor.
To test a system and determine its weaknesses, turbulent conditions are purposely created in a distributed system. This chaos experiment results in the identification of the following problems.
In the current era, a rising number of organizations are moving to the cloud or the enterprise edge. The outcome of this movement is that the systems of these organizations are becoming complex and distributed. This outcome is also applicable to software development methodologies with an emphasis on continuous deliveries.
The rise in complexity of an organization’s infrastructure and processes within the infrastructure is augmenting the need for the organization to adopt chaos engineering.
Let us consider a distributed system that manages a finite count of transactions per second. The chaos testing is applied to determine the software's response mode when it reaches the transaction limit. It is observed whether the system crashes or the performance gets hampered.
Let us now consider a distributed system that witnesses a single point of failure or a shortage of resources. Chaos experiments determine the response of this system in the preceding two scenarios. In the case of the system's failure, developers are directed to execute modifications in design. After these modifications, the chaos tests are repeated to ratify the expected outcome.
In 2015, one reputed failure of a real-world system was identified with a chaos engineering relevance. The DynamoDB of Amazon witnessed an availability issue in a regional zone. In this region, over 20 Amazon web services dependent on DynamoDB failed during operation.
The websites that leveraged these services, one of which was Netflix, were down for multiple hours. Among all the websites, Netflix was the least hampered site. The reason cited was that Netflix had used Chaos Kong, a chaos engineering tool, to be prepared to address such a scenario. This tool disabled all the AWS availability zones.
These were the AWS data centers that were serving a specific geographical region. While using this tool, Netflix got hands-on experience in addressing regional outages. This incident strongly cemented the significance of using chaos experiments.
Testing does not result in the generation of new knowledge. The Test Engineer knows the system's features under consideration and writes the test case. Here, there are statements regarding the known properties of the system. Using existing knowledge, the tests make an assertion. After the test is run, the assertion is considered either True or False.
Chaos engineering is experimentation. It results in the generation of new knowledge. In the experiments, a hypothesis is proposed. Your confidence in the hypothesis gets augmented if the hypothesis is not contradicted. If the hypothesis is negated, you learn something new.
There is an inquiry to determine the reasons why the hypothesis was incorrect. Thus, chaos experiments have two possible results: an increase in your confidence or comprehension of new features of your system. In a nutshell, this is about exploring the unknown.
Even an enormous quantity of testing cannot match the insights due to experiments. Testing is done by humans who suggest assertions ahead of time. Experimentation is a formal way of discovering novel properties. After new system properties are discovered through experiments, you can translate them into tests.
Suppose you create novel assumptions of a system and encode them into a novel hypothesis. In that case, the result is a ‘regression experiment,’ which can be used to explore the modifications in the system with time. It was complicated system issues that gave birth to chaos experimentation. So, experimentation is favored over testing.
Frequently, chaos engineering is confused with anti-fragility and ‘breaking stuff in production.’
Nassim Taleb introduced the concept of anti-fragility. He coined the term ‘anti-fragile’ to point to those systems that increase their strength when exposed to random stress. He indicated that the ability of complicated techniques to adapt is not sufficiently implied by the term ‘hormesis.’
Some remarked that chaos experiments are the software version of the process indicated by anti-fragility. However, these two terms imply different concepts. In antifragility, you add chaos to a system and hope that it does not succumb to the chaos but responds in such a manner that its strength increases. Chaos engineering alerts the team regarding the inherent chaos in the system so that the team can be more resilient.
In antifragility, the initial step to enhance the robustness of a system is to identify the weak regions and eliminate them. Resilience engineering proposes that identifying what works correctly in safety provides more information than identifying what works incorrectly.
Another step in antifragility is the addition of redundancy. This step stems from intuition. In resilience engineering, there are several instances where redundancy has resulted in safety failures. But, redundancy is responsible for failures with almost the same ease as it can lessen failures.
Resilience engineering has a history of support for many decades. Antifragility is considered a theory that is outside peer review and academia. Both these schools of thought deal with complicated systems and chaos, due to which people opine that they are identical. However, chaos experiments have a fundamental grounding and empiricism absent in the antifragility spirit. Thus, you must realize that these two are disparate.
Nowadays, some believe that chaos experiments and ‘breaking stuff in production’ are synonyms. On closer investigation, it appears that the correct synonym for chaos experiment is ‘fixing stuff in production.’.
Breaking stuff is relatively easy. The more challenging job is to diminish the blast radius, contemplate safety critically, decide whether fixing something is worthwhile, and conclude whether your investment in experimentation is essential. This way, chaos experiment is differentiated from ‘breaking stuff.’
It is essential to procure the following metrics before initiating chaos experiments.
The application metrics are breadcrumbs, context, stack traces, and events. The High Severity Incident (SEV) metrics are MTBF, MTTR, and MTTD for SEVs by service, the total number of SEVs per week by service, and the total number of incidents per week by SEV level.
The alerting and on-call metrics are the top 20 most frequent alerts per week for each service, noisy alerts by service per week (self-resolving), time to resolution for alerts per service, and total alert counts by service per week.
The infrastructure monitoring metrics are network (packet loss, latency, and DNS), state (clock time, processes, and shutdown), and resource (memory, disk, IO, and CPU).
After you gather all the preceding metrics, you can determine whether the chaos experiments have generated a successful impact. Also, you can set aims for your teams and determine the success metrics.
When you have the collection of these metrics, you will be able to offer answers to some pertinent questions, some of which are the following.
Let us assume that we have a shared MySQL database. There is a group of 100 MySQL hosts where there are multiple shards per host. In Region A, there is a primary database host along with two replicas. In Region B, there is a pseudo primary and two pseudo replicas.
In this scenario, the sequence of the chaos experiments is as follows.
After you shut down one replica, you need to measure the time for detection of the shutdown, the removal of the replica, the kick-off of the clone, the completion of the clone, and the addition of the clone to the cluster. You have to maintain a steady frequency of conducting this shutdown experiment, during which you need to ensure that the experiment doesn’t result in having zero replicas at any moment.
You have to draft a report of the mean time taken for recovery after a replica shutdown. The last step is to break this average total time into days and hours to determine the peak hours.
In the first experiment, you get the results and data. If the cluster has only a single replica, you get an alert after five minutes. In this case, you don’t know whether an adjustment is essential for the alerting threshold to avoid incidents more effectively.
Now, you have to leverage this data to reply to the questions presented in the second experiment. You can use the weekly average of the mean time necessary to move from witnessing a failure to adding a clone to understand the impact of this range of activities. You can also comment on whether,, for the prevention of SEVs, the apt alerting threshold is five minutes.
However, you know that the transactions can be done by a pseudo primary and two replicas. In this experiment, you need to augment the count of replicas to four. At the same time, you should shut down two replicas. Then, you must obtain the time essential to clone two new replicas of the existing primary over many months on a Monday morning to compute the meantime for this process.
This experiment can result in the identification of unknown issues. An example of such issues is that the primary cannot bear a load of cloning and backups simultaneously. Therefore, you have to use the replicas in a better manner.
You also need to find out whether the pseudo region can fail over effectively. In this experiment, you have to shut down the primary and the two replicas, the entire cluster. In a real-life scenario, this failure would be unexpected; hence, you would not be prepared to handle it.
Such a shutdown would need some engineering work. Your task is to assign high priority to this engineering work to address such a failure scenario. After this engineering work, you can proceed with chaos experiments.
The implementation of chaos experiments is guided by three pillars, which are the following:
You can never attain 100% test coverage in software. The expansion of coverage is time-consuming. You can never account for every scenario. You can improve coverage by determining the testing that has the maximum impact. This implies that you do testing for scenarios with the gravest impact.
Some examples are network failures, network saturation, and non-availability of storage.
Ensure the experiments are frequently run, imitated, or run in the production environment.
The infrastructure, systems, and software are subject to modifications. There can be a quick change in the health or condition of these three items. So, the optimum location to experiment is the CI/CD pipeline. You should execute these pipelines when a modification is being done. The potential impact of a modification is best measured when the change commences the confidence-developing journey in a pipeline.
The production environment consists of users‘ activities, and the traffic load or traffic spikes are real. Suppose you decide to run chaos experiments in the production environment. In that case, you can thoroughly test the resilience and strength of the production system and eventually procure all the essential insights.
However, testing in production environments always uses real browsers, devices, and OS combinations to gauge how your software works in real-world environments. Cloud-based testing platforms provide a real device cloud to test your web and mobile apps on an online device farm of 3000+ real devices and operating systems combinations.
Subscribe to our LambdaTest YouTube Channel to get the latest updates on tutorials around Selenium testing, Cypress testing, and more.
You cannot hamper production under the plea of science. So, it is a responsible practice to restrict the blast radius of the chaos experiments. You need to concentrate on small-sized experiments. These can provide you insights regarding what is essential to be identified. Thus, you have to focus on tests and scope. An example is the test of network latency between two disparate services.
The chaos engineering teams have to adhere to a disciplined method during their experiments and test the following:
In the current software development life cycle, the inclusion of chaos experiments aids organizations in augmenting the speed, flexibility, and resiliency of the system and operating the distributed system smoothly. It also renders the remediation of issues before they affect the system. Organizations are witnessing that chaos experiment execution is quite significant, and your vision for better results in the future can be manifested by its implementation.
The Netflix engineers used chaos experiments to evaluate different variables and components without affecting end users. A chaos experiment by Netflix involved terminating production instances and chewing data tables to prevent the system from collapsing when a specific service failed.
DevOps teams are increasingly turning towards chaos engineering to build more resilient software or systems to solve issues before they cause incidents.