How to embrace failure and use a little mayhem to make your system more resilient.

By Derrick Medina, Devtest Manager

It’s happened again. There’s an outage that’s impacting customers, and all hands are dialing into war rooms to sort out how to stop the bleeding. Meanwhile, project managers with unwavering deadlines are trying not to panic as they watch developer hours being eaten up by the emergency. While heads are down trying to find out exactly what went wrong, you can’t help but think about what could have been done to avoid this.

Sound familiar? It’s a disheartening scene that most of us have experienced. You know that all the free pizza in the world won’t make up for the toll these emergencies take on your teams. Or the effect they have on customers and your business.

Luckily, there’s another way. Choosing to adapt your culture into one that welcomes failure allows you to prepare for when your system fails outside of a controlled environment — whether it’s the tooling you build or the processes you put in place to triage problems quickly and accurately. All you have to do to get started is throw a wrench in the gears…

Make Friends With Failure

Failure is something we typically work hard to avoid. We stack up on testing as early as possible in the pipeline and try our best to keep key features standing. But most of the time, we don’t plan for anything to actually fail. Which is how we end up dialing into war rooms and munching on late-night pizza.

The fact is, your systems will fail. This isn’t really any fault of your own. Systems grow over time, becoming more complex. This is true for all kinds of systems, not just software. According to Gall’s Law, first introduced in 1975:

A complex system that works is invariably found to have evolved from a simple system that worked.

With every added feature, toggle, and commit, there are more unknowns being pushed into your environment. These unknowns make it increasingly difficult to understand how your services will react when the unexpected happens.

Instead of letting your systems fail, your goal should be to make them fail. Just not in production (at least not right away). Stop waiting for the next outage and padding your estimates to makeup for time spent on-call. Make changes to introduce chaos earlier in the pipeline, and actively break your systems in ways that simulate real-world outages. Then watch, record, and learn.

Start Small, Ask Questions

The key to introducing chaos is to start small. Failure and chaos are inherently scary. Just searching the web for “chaos” and looking at images can be a little off-putting. (Even if those images do look similar to some architectures!) So don’t go out tomorrow looking to change your whole company all at once. Show your successes and gradually guide the organization to a place where it’s more common to talk about how something will fail, not if something will fail.

Search for “chaos” and you might see something like this image.

To get going, ask yourself (and others) a few questions about a recent failure:

  • Why did this happen?
  • How long did it take us to notice it, and why didn’t we notice it right away?
  • Was there anything leading up to this that could’ve been seen as a warning?
  • What was the last outage we had before this one, and is there a shared thread in common?
  • How much did the outage cost, including war room calls and customer impacts?
  • Could this have been worse?

These questions will help you to find holes in your system’s coverage and reporting. And if you’re having trouble answering these questions, then it might be good to start with tightening up your monitoring to make these things clearer and easier to see.

From here you can begin to measure more accurately and get a bearing on exactly where you stand. More importantly, you can use these metrics to help you get buy-in from team members and leaders alike.

And you really should get that buy-in from as many people as possible that share or manage the environment you’re planning to mess with. I’ve been in situations where a co-worker decided to kick off a chaos tool in an environment, go to lunch while it started breaking things, and not tell anyone. I know you’ll be better than that!

Plan Your First Break

Let’s assume that worked — you showed your reporting, talked about what you stand to gain, and you got a green light. Now comes the fun part. Sit down with your project managers, developers, technical architects, and devtest engineers. Ask them about the concerns they have around your systems. These “boots on the ground” gut checks are often extremely accurate.

Use this information to start planning some experiments around whatever your group deems the scariest. Again, be sure to alert all of the people this could affect (testing groups, developers pushing into the environment, managers giving demos). Otherwise, you might have more than a handful of people you need to offer an apology and homemade cookies to the next day.

When planning your experiments, only do what you’re comfortable with. You should set bounds for how big or small you want the issue to be before introducing failure. You definitely shouldn’t try to take down a critical feature in your environment if you stand to learn nothing about your system. You’re building up your system’s resilience, not proving it can go down!

Likewise, if you know an error will take out your environment with near 100% certainty, it may not be worth running that experiment. Drill down to the core of what you’re trying to see, and identify smaller tests that will make your inspection more accurate. You know your organization, you know the parts of the system you’re afraid of or don’t trust. Build your hypotheses around how outages in these areas might play out. Start there, and start small. You don’t need to start in production — aim to get there one day. But most importantly, start.

Unleash a Little Mayhem

It’s time to unleash chaos! There are several methods you can use to find out how resilient your system is, such as a Game Day, Wheel of Mayhem, or Failure Friday. There are slight differences, but all involve a series of pre-planned experiments. If it’s your first time with chaos engineering, getting approval for and planning a Game Day is likely your best bet.

For your first Game Day, you’ll assign people into two groups: offense and defense. The goal of the offense team is to trigger the predetermined experiments, increasing intensity over time. This group is also in charge of resetting the environment to a stable state. So before you begin, make sure you have a plan for getting back to a place of safety and stability. Meanwhile, the defense team will try to use their day-to-day triage skills, along with the available tools, to locate and identify the root causes of the issues.

After the experiment is over (and you’ve reset the environment), bring both teams together and review what took place:

  • How was the issue found?
  • Did anyone find it using other means?
  • Did the experiment create the expected failure, or did something else happen?
  • How long did it take for the issue to be found?
  • Where can monitoring be tightened up, and how can we be alerted faster?

Running an event like this puts you in the fortunate spot of reviewing something you might not have experienced unless there was a real issue. You’ve purposely made your system fail based on a hypothesis, and you can inspect the details of how our system reacted. This is the start of the mindset shift. From here on, you should try to repeat this with every failure — experiment and outage alike — asking questions and digging deeper.

What’s Next?

The first step to improving resilience is not just admitting you need to improve resiliency. It’s changing how your organization views failures in the first place. When you begin to actively cause failures as a way to investigate your system and find ways to make it more resilient, life will start to improve for your teams.

No matter where you’re doing your research around chaos engineering, you’ll find that the common theme is not to ask if your systems will fail, but rather to ask when and how they will fail. The hope is that after you’ve run a couple of Game Days, your next production outage will be caught by your updated tooling or improved monitoring. Which means that your developers won’t have to report to the war rooms, and instead get to stay home and enjoy pizza with their families.

Resources

There’s a lot of great material out there that can help you generate some excitement for chaos engineering at your organization. Companies like Google, Netflix, and Gremlin have outlined what’s worked for them, and how they structure and plan their events. While every company is different, you can review the artifacts they’ve created for inspiration as you figure out what works for your organization.

Below are some helpful tools you can use to plan and conduct your next chaos event. And if you have any questions, don’t hesitate to let us know.

  • Game Day Agenda and Planning (https://www.gremlin.com/gameday/) — Going in blind is really not advised here. Aim to be as organized and detailed as possible. This will make your chaos event play out much more smoothly, allowing you to focus on improving your resiliency.
  • Incident State (https://landing.google.com/sre/sre-book/chapters/incident-document/) — This living document is the first place someone who is brought in to help with an issue should look. It calls out the point of contact, exit criteria, and tasks. As simple as it is in concept, this document will cut back on added panic and noise around the issue.
  • Post Mortem (https://landing.google.com/sre/sre-book/chapters/postmortem/) — This template is similar to what we’re all used to, only it’s less an airing of grievances and more geared towards a highly-detailed series of documented events, including your incident reports.

Derrick Medina is an 8 year Kenzan veteran with an extensive board game collection. With professional experience in design, devtest, release management, and client solutions in digital transformation, he’s always looking for ways to improve processes.

Make Next Possible