Crawling out of the Incident pit of despair
What to do when your manager asks "how can I help?"
Incident Management is a big topic within SRE (Site Reliability Engineering). There’s all kinds of great writing out there about relevant best practices and the processes that you should set up for your org. Maybe your team has even implemented a few of them. Perhaps you know all about deducing impact, assessing severity, and keeping your stakeholders informed.
Sometimes, though, you’re way far down the rabbit hole before you1 even realize that you are, in fact, mid-incident. You saw a weird little wobble in that one graph on your monitoring dashboard and started pulling on threads. Before you know it you’ve been heads down for over an hour. By now of course various alarms are blaring and you are getting pinged left and right but you can’t type out a reply because your fingers are tied up playing a desperate game of whack-a-mole with massive log files that appear to have been breeding in the root partition since yesterday at 20:52 UTC2.
This is the moment when your manager hits you with the dreaded “How can I help?” line.
What does it mean? It could be insincere; there are certainly some lousy managers out there. In my experience, though, the offer tends to be genuine. My manager (or sometimes, uh, my manager’s manager) legitimately wanted to help me resolve the incident sooner.
The good news is that we don’t actually need to be certain that the offer is sincere in order to treat it as such and you should do this because chances are you really do need the help if your stress level and the depth of that hole are anything to go by.
But then what do you say? How do you handle this wide open offer of assistance when you are so deep into the context of your work that you can just barely form three word sentences?
That is what today’s post is all about: understanding what that offer of help really means, identifying what you need, and then communicating those needs to your manager while continuing to put out the fire.
What your manager actually meant just now
Even when it is sincere there’s usually more going on than just a blanket offer of assistance. Here are a few of the things you might find when reading between the lines:
“I suspect you are super stressed out right now and I want you to know that you are not alone. I really am here to help you.” These managers are sweethearts.
“What can I tell our stakeholders?” Your manager is probably getting yelled at by the same people who are burying you in pings.
“What is the impact?” They want to know how bad it is not just so they can do damage control but so that they can tell you whether to keep working at your current pace or slow down and regroup.
“What are you doing right now?” They do need to know that you are, in fact, working on the problem. They also need to know what you are doing to resolve it so that they can help you course correct if you are looking in the wrong place.
“Are you blocked? Do you know what you need to do next?” They want to know if you are stuck, desperately reading tea leaves, or randomly trying things in the hopes that you’ll get lucky.
“Who else do we need to pull in?” Have you paged the infra team? Did anyone get back to you? Did their scheduled on-caller pick up the page but they don’t know this stuff very well and you actually need Becky, please?
“What tasks can be parallelized so we can resolve this faster?” Will throwing more people at this thing actually help right now or just make your job that much harder?
Sometimes your manager will actually follow up their first message with some of the questions I listed above. This adds clarity of course but, if you are like me, you may find it to be a little overwhelming all the same and it may even feel a little accusatory regardless of how it was actually intended.
So instead of waiting for them to inadvertently smash your stress button to pieces you can be prepared with your shortlist of requests which we’ll talk about in just a bit.
But first! Give them the context they seek
It’s hard to stop and type out answers when your focus is split with an urgent task. Let your manager see what you are doing. Invite them to a video call with screen share if they aren’t physically standing next to you. A picture plus a few distracted mutterings is worth a 1000 words here. If they are hands on then they may be able to pick up on what needs to be done and just start doing some of it on their own. If not, they will at least know without a doubt that you are working hard on this and then hopefully pull in someone who can pair with you effectively.
The short list
You are basically going to ask your manager to handle much of the incident management stuff that you would have been doing if you’d realized that this was, in fact, an incident before you got sucked all the way in. If you have an actual incident management process on your team then this is fairly straightforward: ask your manager to kick things off and wear the incident commander hat short term while you are busy putting out the fire.
If you don’t have one of those, though, then this is a good place to start:
Run Interference. I’ve used the actual phrase “Can you please run interference for me” with multiple managers and they have all jumped right in. This mostly means fielding questions from stakeholders, providing proactive status updates, and handling escalations so that you can focus on putting out the fire.
Secure Resources. Sometimes “resources” means hardware but it usually means people. Ideally you can page a team’s on-call engineer and get someone who knows what they are doing. If not, though, go ahead and ask for the specific person you need and then be very direct about what they will do for you. You can always reflect on why the on-call engineer wasn’t that person later at the postmortem.
Admin and Various Cat Herding Tasks. This might include setting up a war room, kicking off a dedicated Slack/Teams channel, scheduling touch bases and maybe even cooking up a spreadsheet or two.
When you finally hit a slow stretch:
Take a deep breath, lean back in your chair, close your eyes for a moment and exhale.
Jot down what you did and what remains to be done including any investigations that will allow you to determine the full extent of the impact (e.g. broken downstream data, list of clients impacted, etc).
Use the above to compose a quick status update. Make sure to include:
A description of the incident itself. What happened, why it’s a problem, and who/what is impacted.
Current status. Is the incident resolved for now or are we in the eye of the storm? Is the impacted system up? Down? Slow? Intermittently available?
Immediate next steps.
Take quick stock of the list you made in the first step to determine what needs to be done by you (maximum context tasks) and what can be picked up by someone else. This will allow you to distribute the rest of the incident work. Assign all tasks; there should be zero ambiguity here as to who is doing what.
Note down all open questions paying attention to what needs to be answered now vs later at the postmortem.
Determine with your manager whether you are now in a place to assume incident commander responsibilities or if they need to keep wearing that hat.
After the incident
Provide a final update then stand up, stretch, and walk away from your desk for a few minutes but keep your phone on you and don’t go too far just in case your incident rises from the dead. When you return, schedule the postmortem and include all key players from the incident. Prep the postmortem doc with a timeline of events. Have your list of questions from the incident ready to go for the postmortem itself.
Closing thoughts
Experienced incident responders may note that the process I explain here is far from ideal and they’d be right. Ideally you have a well established and strong incident management process which you faithfully execute every time. Ideally your manager does not have to ask you what you need because you have provided visibility into what you are doing throughout. Ideally you did not get stuck playing log file whack-a-mole for an hour without providing your stakeholders with an update.
But we are messy creatures who sometimes do messy things. It’s going to happen! Perhaps you are a junior engineer and this is your first incident. Or maybe you spent the last 5 years at a really mature organization with a well defined incident management process but this is your first startup and no one has ever heard the term “Time To Resolution” before. Or maybe you miscalculated, figured you could fix this thing in less time than it would take to report it, and are now well and truly stuck at the bottom of the hole.
It will be ok. Assume your manager is on your side, pull them in, and show them how to help you dig back out.
And by “you” I mean “I”. Hopefully that’s always clear.
Nothing changed. There were no app releases. All the symlinks are exactly where you left them. Config files are untouched for the past month. Logs just go to root now and also they are very much larger than they used to be because it turns out that they really, really like it there. Also because you are the on-call engineer and they like you, too.