Niagara's Big Red Button

November 08, 2023

Niagara is a massively parallel build system, jokingly referred to as denial-as-a-service.

It was the first system I worked on that needed a big red button.

What is a Big Red Button?

Google recently posted Lessons Learned from Twenty Years of Site Reliability Engineering.

#4 is a big red button.

A “Big Red Button” is a unique but highly practical safety feature: it should kick off a simple, easy-to-trigger action that reverts whatever triggered the undesirable state to (ideally) shut down whatever’s happening.

The Stage

At the time git repositories were served by a single-node1.

Single Node Git Host

Traffic included users, Jenkins builds, and now a new massively parallel build system.

As Niagara’s workload ramped up, the source control system could be overwhelmed. This could mean anything from long response times to not responding at all.

We take this seriously as it blocks users from committing and builds from running. Each issue caused changes to the system.

  • Observability improved
  • Layers of throttling emerged
  • Large runs could be cancelled in batches

Solid improvements, but they still required someone who knew the system well.

Turning the system off needed to be easy.

Simple Shutdowns

By request, I created a button that could be used by anyone at the first sign of trouble.

Our big red button came in two flavors.

  1. Hold launching containers - anything running could complete
  2. Cancel everything running

The simplest way to hand over control was to wire it up to a Jenkins job. The job would notify whenever it ran, so we could all be on the same page.

Internally we referred to the big red button as a panic button or Andon cord.

  1. This setup has evolved to multi-node.

Profile picture

Written by @sghill, who works on build, automated change, and continuous integration systems.