Introduction
The role of CI/CD pipelines
CI/CD pipelines automate the process of building, testing, and deploying software applications, enabling faster release cycles and more reliable updates by catching issues early and reducing manual errors [1]. By automating testing and deployment, CI/CD pipelines enable software teams to release updates more frequently without sacrificing quality, availability or security. This directly serves availability objectives by reducing deployment failures and downtime. It also serves productivity objectives by eliminating tedious manual processes, providing fast feedback to developers, and enabling them to deliver value continuously [2].
As chaos engineering gains traction, many organizations consider embedding chaos experiments alongside these CI/CD pipelines to facilitate continuous resilience testing. However, this conflates very different intents between standard "tests" and exploratory "experiments" like those used in chaos engineering.
Putting chaos experiments directly into CI/CD pipelines often creates operational and cognitive challenges for organizations. Customers who have tried this approach end up with slow pipelines, frustrated engineers, and more questions than answers. By injecting failure into the very paths intended for smooth, reliable software release, chaos directly conflict with the core objectives of CI/CD pipelines around maintaining availability and productivity. Uncontrolled chaos risks disrupting the velocity and reliability that CI/CD pipelines aims to provide.
This well-intentioned but obstructive approach stems from conflating "testing" and "experimentation" and not appreciating their distinct implications.
Tests vs experiments
There is terminological confusion between "tests" and "experiments" because of the overlap in language [1] - we "test" hypotheses in experiments and employ "testing" procedures as part of experimental design.
Tests assess attributes under defined conditions, while experiments manipulate variables to uncover causal relationships. So while testing occurs within experiments, their overarching goals differ - tests quantify attributes, while experiments explore causal relationships.
However, the distinction is nuanced. There are misconceptions that experiments cannot "fail" or only generate high-level data, unlike tests. But experiments can fail to generate sufficient insights, and both tests and experiments face balancing realism with analytical rigor [1]. Rather than a binary categorization, they overlap in needing to extract meaningful results while navigating operational relevance versus analytical precision.
While we cannot be language police dictating word choices, understanding these nuanced differences is key to understanding the complementary strengths of tests and experiments.
The implications of this confusion to chaos engineering
The determinism problem
Chaos engineering has adopted the concept of "experimentation" to validate system resilience. But due to this loaded language, we have come to refer to chaos "experiments" as chaos "tests" which is reasonable linguistic flexibility. However, it remains useful to distinguish their implications primarily around determinism.
Every software system encounters problems, but a complex system amplifies the issues of determinism significantly. Concurrency exists within individual machines and across multiple nodes. Networks are unreliable, often delaying or reordering packets at critical moments. Disks are prone to data corruption. Additionally, various failure modes can occur: machine failures, power outages, data center disasters, and human errors that inadvertently worsen the situation. The most challenging aspect of these failure modes is their non-deterministic nature. A critical bug may only manifest under a specific sequence of events across multiple parts of the system. Even if a chaos experiment or a test detect it once, they may never catch it again (but customers will inevitably encounter it in production).
Using chaos “tests“ falsely implies deterministic outcomes that you get from tests, including tests used in a CI/CD pipelines. However, tests, whether unit, integration or end-to-end, are focused on validating specific defined properties or behaviors of the system against expected requirements. They check that given certain inputs, the system responds with the correct outputs per its specification.
Complex distributed systems exhibit emergent, non-deterministic behaviors that arise from complex component interactions. These emergent behaviors are difficult to predict by examining individual parts of the system alone. Chaos experiments intentionally explore these non-deterministic, emergent behaviors to uncover unknown weaknesses and failure modes stemming from the system's complexity.
Tests set expectations. Experiments reveal surprises.
This differentiation isn't to dictate word choice but to recognize how non-determinism shapes the context for embedding chaos engineering experiments in a CI/CD pipeline.
The observability problem
Tests and experiments also fundamentally differ in their need for observation. Tests validate deterministic functionality - they reliably pass or fail based on expected behavior coded into them. Beyond establishing those initial criteria, no ongoing monitoring is required to interpret test operation. The outcomes are predefined. Tests embedded in pipelines run automatically with no need for human attention. Their results are definitive pass/fail with no uncertainty.
Chaos experiments, in contrast, tend to surface non-deterministic insights. Because they explore complex systems under uncertain real-world fault conditions, the outcomes of chaos experiments are unknown by design. Each experiment reveals new potential failure modes or resilience gaps specific to that run. So while automation initializes experiments, there is an essential human element in oversight - chaos engineers must actively monitor the response of the system to the induced faults because the emergent behavior of the system being stressed and outcomes are mostly unknown.
The mindset problem
Another key difference is mindset. Tests bring an assurance mindset - they demonstrate quality for stakeholders. Experiments take more of a research perspective - they emphasis learning. Tests produce yes/no answers to "does this work as expected." Experiments reveal insights around "how does this break" and "how resilient is the system."
The key learnings arise from interpreting system reactions to turbulence, not just optimizing for pass/fail outcomes. Chaos experiments tap into domain expertise to make educated guesses on how systems might react under turbulent conditions. But uncertainty still reigns. So chaos engineers set up observability tools, introduce failures, gather results, and fine-tune their expectations for next time.
Unexpected outcomes spawn new failure scenarios to explore. And even expected results prompt repetitions with varied conditions to deepen understanding and generalization. This iterative loop of hypothesis refinement continues until system behaviors become more predictable. Only then do they crystallize into codified assertions for regression testing.
The problem with CI/CD pipelines and chaos experiments
Continuous Resilience Testing
The concept of "continuous resilience testing" seems appealing on the surface. The idea is to inject faults into systems as part of CI/CD pipelines to continuously validate resilience characteristics. If a code change degrades the system's ability to withstand turbulence, the pipeline would fail - protecting against resilience regressions.
This drives from the mindset of traditional testing, where you assert known behaviors against specific inputs. If those assertions pass, you can safely progress the change. Applying this thinking to resilience, it seems logical to create resilience "tests" that inject failures, then validate the system responds properly. Run these tests continuously to catch any regressions before production.
However, this deterministic testing mindset breaks down when applied to the inherent non-determinism of complex, distributed systems as Richard Cook describes in "How Complex Systems Fail." [4]
Complex systems contain mixtures of latent failures and run in degraded modes. Their operations are dynamic, with components failing and adapting continuously. As Cook states, "Catastrophe is always just around the corner" - the potential for failures is ever-present due to the system's nature.
Crucially, Cook highlights that catastrophic failures require combinations of failures - there is no single "root cause."
This has severe consequences: The same failure injected repeatedly can manifest vastly different outcomes due to changing conditions like network delays, load levels, caching effects and more.
And therefore, the moment you inject a failure into a system, you enter the realm of non-deterministic experimentation rather than predictable testing. The system's response becomes unpredictable and dependent on the current complex state.
This non-determinism poses challenges for continuous resilience "testing" using fault injection as part of CI/CD pipelines. Pipelines require reliable, reproducible test results to facilitate automated deployment decisions. But the inherent unpredictability of failure injection means its outcomes can never be truly deterministic. Instead, you get inconsistent "test" results that impair automation rather than enabling it.
While the goal of continuously testing resilience is desirable, directly embedding chaos experiments into CI/CD pipelines conflates deterministic testing with non-deterministic exploration of failure modes. The two have different requirements and implications which must be properly clarified.
The load and duration problem
Even if we set aside debates over "testing" and "experimentation" nomenclature, injecting chaos experiments directly into CI/CD pipelines is problematic due to the long runtimes involved. For meaningful insights, chaos experiments must run against a production-like environment under realistic operating conditions.
This means first deploying the system version to a non-production chaos environment and establishing real-world traffic volumes and steady-state operation. Only once the system has achieved a stable operating state with production-equivalent load can faults meaningfully be injected to observe true resilience characteristics.
Simply injecting faults into an idling system provides limited fidelity. Complex systems often exhibit different behaviors when operating under load versus at low utilization. Caching layers, queues, connection pooling and other structural elements can mask or amplify the impacts of failures depending on real-time activity levels.
As a result, comprehensive chaos experiments require an involved setup phase to simulate production operating conditions before the actual fault injection can start. For large-scale distributed systems, attaining steady-state can take 30-60 minutes or more as load generators ramp up traffic.
Then, once the system is operating stably under load, teams must execute a full battery of chaos experiments modeling different failure scenarios. Even a minimal suite may require 10-50+ discrete experiments of 5+ minutes each to properly explore potential failure domains.
So while a basic fault injection may nominally last 5 minutes, the total runtime can spans multiple hours when incorporating:
Deployment to a non-prod environment
Baseline setup to stable operation under load
Repeated iterations across a comprehensive failure test suite
A reasonably comprehensive resilience assessment spanning just 10 minimal experiments would easily extend a standard build pipeline by 2-4 hours. More rigorous regimens with 30-50 complex failure scenarios assessed, which are increasingly common for mature organizations, are ballooning total runtimes to days or even weeks. This overhead severely disrupts CI/CD pipeline velocity and developer productivity.
As customers' resilience requirements increases, their deployment cadence paradoxically slow down - the exact opposite of the agility CI/CD pipelines aim to provide.
The value problem
While the desire to continuously execute the same battery of chaos experiments is understandable, this approach eventually delivers diminishing returns. When experiments and environments remain static across iterations, teams rapidly exhaust the learning potential - key weaknesses get exposed early without revealing additional vulnerabilities. Yet costs continue accruing as repetitive experiments congest pipelines, inflate observability data volumes, and divert focus from designing novel failure scenarios. Worse still, continuous success on static chaos “tests” creates overconfidence stemming not from rigorous resilience validation, but rather confirmation bias from an absence of exploring unknown failure modes outside a limited purview. This complacency leaves systems susceptible to turbulent conditions not matching pre-coded assumptions. For meaningful signal, chaos experiments cadence and content must mirror the pace of changes in underlying infrastructure, configurations, and risk exposures. Otherwise, continuity promotes repetition without enrichment more than enhanced resilience.
Moreover, organizations who initially attempt setting chaos experiments into their CI/CD pipelines are left with difficult tradeoffs. With limited pipeline capacity, engineers must prioritize which failure scenarios to execute chaos experiments for. These prioritization decisions can become subjective if not guided by clear criteria, potentially leading to biases based on recent events or personal perspectives. Pipeline failures from disruptive experiments can also erode confidence in automation processes and impede engineers focusing on routine deployments. Without defined policies around the scope and cadence of chaos experiments, they risk overloading pipelines and personnel. Rather than improving resilience, the outcome may instead be reduced productivity and misalignment across teams. These fundamental issues means chaos experiments are poor candidates to embed in CI/CD pipelines.
Dedicated chaos experimentation pipeline and environment
The solution is splitting these flows - maintaining distinct CI/CD pipelines for delivering features while learning via separate chaos engineering pipelines.
Core CI/CD pipelines retain their fast, reliable release mentality - rapidly progressing changes through test and validation steps. The focus is minimizing friction for code deployments.
Chaos pipelines take an inverted approach - deliberately injecting turbulence into systems. These target evolving resilience by exploring failure modes, not facilitating deployments.
A "chaos pipeline" refers to more than just orchestrating failure injection experiments. This chaos pipeline also handles deploying software changes into designated chaos environments. So code progresses through the chaos pipeline to land in a separated sandbox, then experiments run against that deployed version to validate resilience capabilities. This differs from a CI/CD pipeline which takes code to production level environments. So in essence, the chaos pipeline parallels the infrastructure deployment aspects of the CI/CD pipeline but directs it to isolated chaos test spaces rather than customer-impacting production. This ensures access to current software builds to experiment against without impacting operational stability or velocity of main deployment flows.
Splitting pipelines allows tailored objectives and cadence for each goal:
CI/CD Pipelines
Deploy code changes smoothly
Validate functionality and integration
Prioritize developer productivity
Chaos Pipelines
Inject faults safely to learn
Uncover risks and weaknesses
Prioritize resilience over velocity
Splitting pipelines isolates disruptive experiments from the flow of planned work and releases. However, learnings should make the jump between the two; new issues uncovered through chaos experiments should get turned into learning and drive broader improvements. Insights may reveal needed features, uncover bugs, or prompt architectural changes. Some may be codified into new test cases and added in the CI/CD pipeline test suites. While not all chaos experiments insights will map neatly to functional tests, the ability to promote key learning into regression test suites provides a feedback loop between the two pipelines. This allows development pipelines to leverage resilience insights without the overhead upfront.
Independent chaos pipelines frees up experimentation from deployment pressures. Failure scenarios evolve continuously without blocking deliveries or frustrating engineers. Separated pipelines allow independent testing and experiential mentalities - "never fail vs fail often" - to coexist and inform each other. Reliability grows incrementally without undermining velocity.
One inherent trade-off with having a separate chaos pipelines is the potential to release a version into production before resilience regressions are detected. However, this operational risk can be mitigated by maintaining a robust suite of deterministic resilience tests integrated into the core CI/CD pipeline. While transient resilience exposures cannot be fully eliminated, the velocity gains from decoupling deployment flows from chaos engineering experiments generally provide a net benefit when supported by thorough testing in the deployment pipeline.
Having a dedicated chaos pipeline also expands the possibilities for more comprehensive experiments by removing constraints around their duration. Without being time-boxed by CI/CD pipeline schedules, engineers gain flexibility to design and run experiments over extended periods - hours or even days if needed. This allows modeling complex, real-world scenarios like zonal and regional impairments that take extended time to run.
Additionally, separating chaos pipelines allows introducing random variations and unpredictable fault conditions into each experiment run, providing more chances to identify unknown issues. Rather than executing the same pre-defined tests repeatedly, randomized chaos experiments better emulate the unpredictable nature of real-world failures.
This continuous chaos experimentation pipeline is the essence of comprehensive continuous resilience validation, providing confidence that your system can withstand a wide range of turbulent scenarios across different conditions over time.
While CI/CD pipelines are driven by product release cadences, chaos pipelines should operate on their own schedules aligned to resilience needs rather than deployment schedules. Determining the appropriate chaos experiment execution frequency depends on factors like the rate of changes in the application and supporting infrastructure. More frequent experiments are beneficial before major updates to better understand potential new failure domains. It also depends on the criticality and resilience maturity of the system. More aggressive weekly or bi-weekly experiment executions allow tighter verification for high-criticality systems and faster learning for those early in their resilience journey. Highly regulated industries may also dictate specific resilience audit frequencies that chaos pipelines should match.
Introducing determinism through mocks
There is still value in creating deterministic resilience tests, but they must be clearly differentiated from fault-injection based chaos experiments. One approach is to use mocking techniques to simulate failure scenarios in an isolated, controlled environment.
Chaos experiments can help narrow down particular issues, but then we need to remove the non-deterministic element using mocks. By mocking external dependencies like databases, message queues, and third-party APIs, you can reproduce specific failure states more reliably.
You can also add fault injection into the mock APIs to simulate various failure scenarios. These "resilience tests" can validate how your application handles different error cases gracefully - like retries, fallbacks, and circuit-breaking.
Since the tests run in deterministic conditions based on mocked behavior, you can embed them alongside regular functional tests in the CI/CD pipeline. Their pass/fail results directly map to whether the application handles failure scenarios correctly based on those mocked conditions.
However, these mocked tests have inherent limitations compared to full chaos experiments against production-like systems:
They only validate resilience for the specific subset of mocked components, not true end-to-end resilience
They cannot properly simulate real-world complexities and emergent conditions from component interactions
Continuously expanding test scenarios can become onerous as systems grow
So while useful for scoped validation, mocked resilience tests are not a substitute for chaos experimentation against production-like systems. The determinism they provide is valuable but limited.
Conclusion
As systems grow more distributed and complex, resilience can no longer be an afterthought. Comprehensive resilience validation must become a core practice integrated into modern software delivery.
However, injecting uncontrolled chaos directly into deployment pipelines is untenable. It conflates fundamentally different activities - deterministic testing focused on predictable delivery, and non-deterministic chaos experimentation targeting failure exploration.
The solution is to decouple these flows through a dedicated chaos pipeline running in parallel to core CI/CD pipelines. Complementing the chaos pipeline with deterministic resilience testing using mocks then provides a mechanism to continuously validate known resilience issues in an automated fashion.
By properly segregating chaos from deployment, while establishing synergies between them, organizations can achieve both continuous delivery and comprehensive resilience assurance through continuous resilience validation practices.
References
[1] Humble, J., & Farley, D. (2010). Continuous delivery: reliable software releases through build, test, and deployment automation. Pearson Education.
[2] Fowler, M., & Foemmel, M. (2006). Continuous integration. Thought-Works - http://www. thoughtworks. com/Continuous Integration. pdf, 122, 14.
[3] Kass, Richard. (2008). Tests and Experiments: Similarities and Differences. 8.
[4] Cook, Richard. (2002). How complex systems fail.