Controlled Chaos

How to embrace failure and use a little mayhem to make your system more resilient.

By Derrick Medina, Devtest Manager

It’s happened again. There’s an outage that’s impacting customers, and all hands are dialing into war rooms to sort out how to stop the bleeding. Meanwhile, project managers with unwavering deadlines are trying not to panic as they watch developer hours being eaten up by the emergency. While heads are down trying to find out exactly what went wrong, you can’t help but think about what could have been done to avoid this.

Sound familiar? It’s a disheartening scene that most of us have experienced. You know that all the free pizza in the world won’t make up for the toll these emergencies take on your teams. Or the effect they have on customers and your business.

Luckily, there’s another way. Choosing to adapt your culture into one that welcomes failure allows you to prepare for when your system fails outside of a controlled environment — whether it’s the tooling you build or the processes you put in place to triage problems quickly and accurately. All you have to do to get started is throw a wrench in the gears…

Make Friends With Failure

The fact is, your systems will fail. This isn’t really any fault of your own. Systems grow over time, becoming more complex. This is true for all kinds of systems, not just software. According to Gall’s Law, first introduced in 1975:

A complex system that works is invariably found to have evolved from a simple system that worked.

With every added feature, toggle, and commit, there are more unknowns being pushed into your environment. These unknowns make it increasingly difficult to understand how your services will react when the unexpected happens.

Instead of letting your systems fail, your goal should be to make them fail. Just not in production (at least not right away). Stop waiting for the next outage and padding your estimates to makeup for time spent on-call. Make changes to introduce chaos earlier in the pipeline, and actively break your systems in ways that simulate real-world outages. Then watch, record, and learn.

Start Small, Ask Questions

Search for “chaos” and you might see something like this image.

To get going, ask yourself (and others) a few questions about a recent failure:

  • Why did this happen?
  • How long did it take us to notice it, and why didn’t we notice it right away?
  • Was there anything leading up to this that could’ve been seen as a warning?
  • What was the last outage we had before this one, and is there a shared thread in common?
  • How much did the outage cost, including war room calls and customer impacts?
  • Could this have been worse?

These questions will help you to find holes in your system’s coverage and reporting. And if you’re having trouble answering these questions, then it might be good to start with tightening up your monitoring to make these things clearer and easier to see.

From here you can begin to measure more accurately and get a bearing on exactly where you stand. More importantly, you can use these metrics to help you get buy-in from team members and leaders alike.

And you really should get that buy-in from as many people as possible that share or manage the environment you’re planning to mess with. I’ve been in situations where a co-worker decided to kick off a chaos tool in an environment, go to lunch while it started breaking things, and not tell anyone. I know you’ll be better than that!

Plan Your First Break

Use this information to start planning some experiments around whatever your group deems the scariest. Again, be sure to alert all of the people this could affect (testing groups, developers pushing into the environment, managers giving demos). Otherwise, you might have more than a handful of people you need to offer an apology and homemade cookies to the next day.

When planning your experiments, only do what you’re comfortable with. You should set bounds for how big or small you want the issue to be before introducing failure. You definitely shouldn’t try to take down a critical feature in your environment if you stand to learn nothing about your system. You’re building up your system’s resilience, not proving it can go down!

Likewise, if you know an error will take out your environment with near 100% certainty, it may not be worth running that experiment. Drill down to the core of what you’re trying to see, and identify smaller tests that will make your inspection more accurate. You know your organization, you know the parts of the system you’re afraid of or don’t trust. Build your hypotheses around how outages in these areas might play out. Start there, and start small. You don’t need to start in production — aim to get there one day. But most importantly, start.

Unleash a Little Mayhem

For your first Game Day, you’ll assign people into two groups: offense and defense. The goal of the offense team is to trigger the predetermined experiments, increasing intensity over time. This group is also in charge of resetting the environment to a stable state. So before you begin, make sure you have a plan for getting back to a place of safety and stability. Meanwhile, the defense team will try to use their day-to-day triage skills, along with the available tools, to locate and identify the root causes of the issues.

After the experiment is over (and you’ve reset the environment), bring both teams together and review what took place:

  • How was the issue found?
  • Did anyone find it using other means?
  • Did the experiment create the expected failure, or did something else happen?
  • How long did it take for the issue to be found?
  • Where can monitoring be tightened up, and how can we be alerted faster?

Running an event like this puts you in the fortunate spot of reviewing something you might not have experienced unless there was a real issue. You’ve purposely made your system fail based on a hypothesis, and you can inspect the details of how our system reacted. This is the start of the mindset shift. From here on, you should try to repeat this with every failure — experiment and outage alike — asking questions and digging deeper.

What’s Next?

No matter where you’re doing your research around chaos engineering, you’ll find that the common theme is not to ask if your systems will fail, but rather to ask when and how they will fail. The hope is that after you’ve run a couple of Game Days, your next production outage will be caught by your updated tooling or improved monitoring. Which means that your developers won’t have to report to the war rooms, and instead get to stay home and enjoy pizza with their families.


Below are some helpful tools you can use to plan and conduct your next chaos event. And if you have any questions, don’t hesitate to let us know.

  • Game Day Agenda and Planning ( — Going in blind is really not advised here. Aim to be as organized and detailed as possible. This will make your chaos event play out much more smoothly, allowing you to focus on improving your resiliency.
  • Incident State ( — This living document is the first place someone who is brought in to help with an issue should look. It calls out the point of contact, exit criteria, and tasks. As simple as it is in concept, this document will cut back on added panic and noise around the issue.
  • Post Mortem ( — This template is similar to what we’re all used to, only it’s less an airing of grievances and more geared towards a highly-detailed series of documented events, including your incident reports.

Derrick Medina is an 8 year Kenzan veteran with an extensive board game collection. With professional experience in design, devtest, release management, and client solutions in digital transformation, he’s always looking for ways to improve processes.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store