The What-If Machine Operator's Manual

Leveraging risk aversion at work

Dec 05, 2024

I am risk averse. I wrote another post a few weeks ago about speed-demon type risk takers and I am not one of them; I am not out there flying the fighter jet. If I were to stick with the vehicle metaphor then I’m more like an SUV, mom-style: Everyone’s seatbelts are on. We have agreed on a radio station ahead of time. Pit stops are scheduled and snacks are packed. There is some brief but sharp shushing whenever I execute a highway merge. Following distance is religiously maintained.

High levels of risk aversion can drive the risk takers around us bonkers. This is usually because we spend too much time saying “no.” Risk aversion can be a real gift, though, when applied thoughtfully. Enter the what-if machine:

The What-If-Machine (WIM)

The what-if-machine is the imaginative little part of you that allows us to identify and understand hypothetical risk. It asks questions like:

What if we run this urgent one-off data migration script mid-day?
- What if it breaks something?
  - What if we roll it back but the rollback script fails?
    - What if the rollback script fails and we don’t notice?
      - What if we don’t notice it for a week? A month?
    - What if the rollback script fails and we notice but we don’t know which database records were impacted?
  - What if…?
- What if the data migration script takes too long to finish running?…

We all have what-if machines but not all what-if machines are created equal. If you are particularly good at identifying risk, if you breathe contingency plans and find yourself glued to your platform’s monitoring dashboard(s) then congratulations, you are a proud owner of the WIM 3100.

The what-if machine’s purpose is not to make decisions; it shouldn’t output text like “no, let’s not do this, it’s too risky”. Its purpose is to explore all the risks inherent to a given what-if scenario and how we might navigate them so that the team can make an educated decision on how best to move forward.

It does this by traversing the what-if tree. The nodes are what-if scenarios and each one comes with a host of risks:

The WIM performs a special transformation on the nodes of the tree. Inputs are everything you know about the current what-if scenario under inspection. Outputs are a list of associated risks (the bad outcomes), probability of occurrence, severity of impact and a list of mitigations. Mitigations are things that make the bad outcomes less likely or less severe. This is where the magic happens. Mitigations are the secret sauce for pushing innovation forward through the risks that you have found.

Examples of the WIM in action

Time to take a break from Fun with Metaphor and switch over to some concrete examples. Here are a couple what-if scenarios that come up pretty often. For each one I’ve gone and filled in some risks and their mitigations. Can you think of a few more?

What if we roll out the new feature that you implemented solo?
- What if the on-call engineer gets paged during the night because the feature is broken and they don’t know what to do?
  - Risk: The feature will be broken until you log on.
  - Probability: High. New features tend to have a bumpy first few weeks and you were the sole engineer on this project. Oopsie.
  - Severity: High. This feature is considered to be critical path and we cannot afford for it to fail.
  - Mitigation: Instruct the team to page you directly in this scenario (ouch).
  - Mitigation: Write up an on-call playbook and link to it from your automated alerting as well as your central team wiki.
  - Mitigation: Host a team walkthrough of the feature.
  - Mitigation: Assign tickets associated with the feature to multiple folks on the team in order to increase the bus factor1.
What if we add a new dependency on a 3rd party API from our product?
- What if the 3rd party API goes down or becomes unbearably slow?
  - Risk: The feature which depends on that API will be broken.
  - Probability: Medium. The 3rd party API’s listed downtime is very low and it hasn’t had any incidents in the past year. Its published p95 turnaround time for requests is favorable. After some experimentation, though, you notice that its p99 is pretty high and you plan to hit the API with a lot of requests.
  - Severity: Medium. The feature won’t work but it doesn’t represent critical functionality.
  - Mitigation: Add automated monitoring and alerting so that we catch this event asap.
  - Mitigation: Tweak the feature such that it fails gracefully. It shouldn’t crash the application.
  - Mitigation: Add some responsible retry logic against the API.
  - Mitigation: Leverage caching where appropriate.
  - Mitigation: Insist on a tight and detailed SLA in your contract with the 3rd party.
- What if the 3rd party API introduces a non-backward compatible change?
  - Risk: The feature which depends on that API will be broken.
  - Probability: Low. This API is versioned and has been very stable for the past year and change.
  - Severity: Medium. The feature won’t work but it doesn’t represent critical functionality.
  - Mitigation: Add automated monitoring and alerting so that we catch this scenario asap.
  - Mitigation: Target the most recent stable version of the API.
  - Mitigation: Subscribe to their news feed and release notes so that we learn about upcoming changes early on, giving us time to prepare.

Note that the WIM 3100 never told us not to release the exciting new feature or add the helpful 3rd party dependency. It told us what to worry about and prompted us to take action so that we could make big things happen.

Troubleshooting the what-if machine

Sometimes the WIM seems to break down or run amok. Maybe you just can’t get past a particular what-if. Maybe every risk you identify feels like a non-starter. The WIM is glitching and you must fix it. Here are a few troubleshooting strategies:

Noise canceling headphones

When the WIM gets too loud you may be tempted to just tune it out. This is my least favorite option because it ignores whatever the underlying problem is and because you no longer get any benefit from your WIM. Don’t ignore risk. Headphones may be helpful for a short time so that you can focus but only if it is done while pursuing one of the following in parallel:

Calibration

Is your WIM outputting questionable things like “everything is too risky!” or “there’s no way to mitigate this”? Talk to your teammates live. Walk each other through the what-if tree. Perhaps you misunderstood the severity of a particular risk or missed some key detail related to the what-if scenario itself. A few chats putting things into perspective may quickly result in more helpful output. Do this even when your WIM appears to be operating well. Two WIMs are typically better than one.

Pruning the what-if tree

Don’t spend too much time chasing unlikely scenarios or risks. Yes, you may be able to dream up half a dozen risks associated with a given what-if scenario but perhaps only two of those risks are likely or severe. Focus your attentions there and then move on to the rest of the tree.

Experimentation

Sometimes the WIM gets stuck because we don’t have enough information. Sometimes it spews questionable output because the inputs we gave it are wrong. Either way, we can often get the WIM back on track by delivering a firm and loving thump to the chassis via an experiment.

I was introduced to the value of experimentation by another WIM 3100 operator. Their team had been stuck trying to determine whether they were developing the right user experience for their product. “What if we design the wrong solution?” they asked “What if no one uses this?”. Instead of relying too heavily on user research sessions and prior analysis they designed the smallest experiment they could and put it in front of a subset users with all kinds of monitoring and data collection ready to go. The team’s what-if questions were answered several times faster than they would have otherwise been allowing them to move forward with development.

Later on another member of my team took experimentation to a whole new level. I ran a small developer platform org at that point and this person wanted to know what developer workflow improvements were actually worthwhile. “What if this problem isn’t worth solving?” we worried. The problem in question was infrastructure setup. Instead of creating a whole new automated dev workflow they earmarked a certain amount of sprint bandwidth to manually performing infrastructure setup tasks on behalf of our client teams. They turned the team into a mechanical turk. Along the way they naturally automated much of the process for the sake of their own productivity. The process became so lightweight that they could have continued to provide this service on behalf of their clients more or less indefinitely.

Closing thoughts

Don’t apologize for risk aversion. The what-if machine can be an amazing asset at work. It allows you to see risks that are invisible to others so that you can team up to identify mitigations. All of this effort upfront helps the team move forward with their chosen path because they can see the hostile terrain that surrounds them. Used in this way the WIM speeds up innovation instead of slowing it down.

Nuts and Bolts

Discussion about this post