I was nervous before going on-call for the first time. It would have been nice to read some accounts of what those high-pressure pages are like.
I have been paged 54 times since 2016, but Thursday was my first page coordinated by our incident response team.
I’m in a small meeting that will last for 30 minutes.
A page arrives tens minutes in.
A Slack app ensures the whole team knows. My phone is demanding attention. I get an email, a mobile app notification, a text, and a phone call.1
I acknowledge the page so it doesn’t escalate to my manager.
Pages have short descriptions. This one says a Jenkins server will not load. Internet and VPN issues also manifest this way, so my first move is to load Jenkins in the browser. If it loads for me, the page can be resolved and we can focus on what is different about the user’s setup.
It is not loading for me either.
Next I jump over to Spinnaker, our deployment system, to check on the server’s health.
It’s reporting healthy.
I want to verify this healthy signal. I get the ip address from Spinnaker and attempt to load it in the browser.
Clearly it’s not a problem with this Jenkins server. I try two other Jenkins servers and they aren’t loading either. A teammate mentions several email alerts have happened for our other services. Confidence is growing this is DNS.
DNS issues are rare for us. Prior DNS issues have been limited to Jenkins. I attribute this to us running in a region with few other services.
I already know the Slack support channel to use for DNS issues. I create a new thread, happy I have been able to skip the step of figuring out who to ask. The region seems relevant to me, so I include that along with a couple URLs I expect to resolve.
While triaging I’ve been paged two more times. The failure to load has been mentioned in a direct message and four other channels. My goal is to give the on-call some time to respond while sharing the current status with folks.
I link each mention to my support thread.
While I’ve been reaching out, engineers from other teams have added details to my support thread. It’s clear this is more than our one region.
I haven’t heard back, so I look up the paging command and page the service owner.
Shortly after, coordination moves to the main incident response channel.
Incident Response Channel
I’ve never been directly involved in an incident managed in the main channel before. I see someone else share when exactly they were impacted. I mention our first page is at 9:10am. This allows the operators to rule out a 9:22am event.
At this point I’m an interested viewer, watching for opportunities to share any context I have.
A deployment is rolled back. Jenkins is resolving everywhere again. I follow up on all the direct messages and threads I received to let everyone know.
- My first manager recommended a PagerDuty configuration that would be impossible to miss.↩