Chaos Engineering: Breaking Things on Purpose

Unit and integration tests tell you that your code does what you expect when everything goes well. But in production, things fail: a server goes down, the network drops, a disk fills up, an external service takes 30 seconds to respond. The uncomfortable question is: does your system recover on its own or does it take everything down with it?

Chaos Engineering is the discipline of deliberately experimenting on systems (often in production, in a controlled way) to verify that the resilience you designed into the architecture actually works. Netflix has been doing it for years: after their migration to AWS, they needed to ensure that an instance or region failure wouldn’t take down streaming. They created Chaos Monkey (and later the whole “Simian Army”: Latency Monkey, Conformity Monkey, etc.) to randomly shut down servers and components in production. The idea: if your system is designed assuming anything can fail at any time, when it actually fails it’s no longer a surprise. In short: they inject chaos on purpose, observe how the platform responds, and fix what doesn’t hold up.

In this article I’ll introduce the philosophy of Chaos Engineering, why “normal” tests aren’t enough to cover these scenarios, and share a practical experience: a workflow orchestration system on AWS with dockerized microservices, PostgreSQL and Dragonfly as cache, where we decided to break things on purpose to validate that the system held up.

Why tests aren’t everything

Tests are essential. They cover business logic, contracts between services, regressions. But they have clear limits when it comes to validating resilience:

They usually don’t simulate real infrastructure failures: A test can mock a timeout, but it doesn’t simulate a container disappearing mid-transaction, a Redis node going down, or an AWS instance restarting without notice.
The test environment isn’t production: In tests you typically have a small DB, a single node, no real network, no latency or partitions. The failures you see in production (concurrency, connection limits, network timeouts) don’t always reproduce in CI.
They don’t test recovery orchestration: Having retries in a service is fine, but who detects the failure? Who reroutes traffic? How do the other microservices behave when one stops responding? That usually lives in the infrastructure and orchestration layer, not in a unit test.

Chaos Engineering doesn’t replace tests; it complements them. Tests validate “does it do the right thing?”. Chaos validates “when something fails, does the system keep meeting expectations or recover in an acceptable way?”.

Principles of Chaos Engineering

Define a steady state: Before injecting chaos, you define what “the system is fine” means (metrics, SLA, observable behavior).
Formulate hypotheses: For example: “If we kill a Dragonfly node, the system keeps serving traffic using the other node and we don’t lose critical data.”
Experiment in the real world: Experiments are run in environments that resemble production (staging) or, carefully, in production. Only then do you see real behavior under failure.
Automate and repeat: Experiments can be automated (scripts, pipelines) and repeated to detect resilience regressions.
Minimize impact: Start with small perturbations (one pod, one node) and scale up only if the system responds as expected.

The goal isn’t to “break everything” but to learn how the system behaves when something fails and use that learning to improve design, configuration, and procedures.

Practical experience: workflow orchestrator on AWS

In a workflow orchestration project (pipelines of tasks that run in sequence or parallel, with dependencies between steps), the architecture looked like this:

Dockerized microservices on AWS (ECS/EKS), several services: API, workers, scheduler.
PostgreSQL as source of truth: workflow state, tasks, results.
Dragonfly as cache and queue: sessions, intermediate results, task queues for workers (Redis-compatible but with better throughput and memory use in our case).
AWS: Load balancers, multiple instances, private networks.

Everything was covered by tests: unit, integration with DB and Dragonfly in Docker, and contract tests between services. In staging, workflows ran fine. The question was: when something fails in production (a container, a Dragonfly node, an instance), does the system recover or does state get corrupted?

What we did

We defined steady state
- Running workflows must complete or be marked as failed in a consistent way.
- No data loss in PostgreSQL.
- Latency and errors below defined thresholds.
Chaos scenarios (staging first)
- Kill a container of a microservice (worker or API) at random under load.
- Shut down a Dragonfly node (we had two in high-availability mode).
- Restart an instance that hosted part of the services.
Tools
- Scripts that, via AWS CLI and orchestration APIs, killed tasks or instances in bounded time windows.
- For Dragonfly, scripts that forced a node “down” and checked that the other one took over traffic.
What we learned
- Without chaos: Tests passed, but in the first experiment we saw that when a worker died, some tasks stayed “stuck” in the DB because the queue consumer (in Dragonfly) disappeared and we hadn’t defined timeouts and retries on the queue well.
- Dragonfly: When we took down one node, the other took over, but there was a latency spike and some retries we hadn’t accounted for on the client; we adjusted timeouts and retry logic.
- PostgreSQL: We didn’t lose data; well-scoped transactions and the fact that the DB was the source of truth saved us. What did fail was coordination: workers only picked up “orphan” tasks after a timeout we had set too high.
How we fixed the problems (in prod)
- Stuck tasks: We defined a visibility timeout on the queue (Dragonfly): if a worker took a task and didn’t confirm/complete it within X seconds, the task went back to the queue for another worker. So no message stayed “stuck” to a dead consumer.
- Dragonfly (latency spike): On the client talking to Dragonfly we added retries with backoff and a higher connection timeout during failover; if a node went down, the client didn’t fail on the first try, it waited and reconnected to the node that was still up.
- Orphan tasks and high timeout: We lowered the time after which a worker considers a task “abandoned” and another can claim it. Plus a periodic job (cron/scheduler) checks the DB for tasks in “in progress” whose worker hasn’t given a sign of life for more than N minutes and marks them as failed or re-queues them, depending on business policy.
- Consistent failure marking: When a worker dies, the tasks it had assigned don’t stay in limbo: they either go back to the queue (idempotency allowed) or are marked as failed in PostgreSQL and notified. So state in DB and queue stay aligned and there are no “zombie” workflows.

With those fixes deployed, we repeated the chaos experiments in staging and then in production in controlled windows. “Normal” tests hadn’t shown us these failures; chaos did, and the fixes gave us real recovery when something actually failed.

Why Dragonfly, PostgreSQL, AWS and Docker

Dragonfly: We needed high throughput capacity and good memory use; Dragonfly gave us Redis compatibility with fewer resources. A node going down was a real scenario we wanted to validate.
PostgreSQL: It was the source of truth. Chaos let us confirm that when other components failed, we didn’t corrupt data and recovery was based on the state stored there.
AWS and dockerized microservices: The failures we cared about were real infrastructure: instances, containers, network. That isn’t modeled well with mocks in tests; you need to touch the real system or something very close to it.

How to get started with Chaos Engineering

Start in staging, not production.
Define metrics and thresholds before injecting chaos.
One type of failure at a time (one service, one node, one disk).
Automate experiments so you can repeat them (e.g. on every release).
Document what you broke, what you observed, and what you changed (config, code, runbooks).

Tools you can explore: Chaos Monkey (Netflix), Litmus Chaos, Gremlin, or your own scripts on your orchestrator (Kubernetes, ECS, etc.) and on your services (Dragonfly, PostgreSQL, APIs).

My personal perspective

Tests give you confidence that the code does what it should when the world behaves. Chaos Engineering gives you confidence that when the world doesn’t (one less server, a cache node down, an unstable network), the system you designed in Series II (redundancy, retries, timeouts, sources of truth) actually responds as you expected.

In the orchestrator project, tests didn’t cover everything: infrastructure failures, a Dragonfly node going down, containers dying in the middle of a flow. Until we broke things on purpose, we didn’t see the weak spots in timeouts, retries, and coordination between workers and DB.

I recommend treating chaos as part of designing resilient systems: define what can fail, how the system should behave when it fails, and then validate it with controlled experiments. That way, when something really fails in production, it won’t be the first time your system has faced that scenario.