Empowering Users with Build Agent History

November 01, 2023

Support channels are a wonderful place to discover ways to empower users.

The questions folks ask in support reveal the problems they face day-to-day.

What builds ran on this agent?

Jenkins, our build platform, already has build history for an agent out of the box. Sounds simple enough, why isn’t this already self-service?

Jenkins is a distributed system, and our setup has short-lived components.

Agents are launched and terminated as they autoscale. Builds know what agent they ran on, but limited build history¹ yields incomplete data. Additionally, we have disabled the built-in agent history feature to address performance concerns².

These conditions meant a user could not answer this question without help.

Until now.

Since this has been a repeated support question, we prioritize and build out the feature.

Build Metadata Service

Boost, a build metadata service, resolves all three problems. Each build’s metadata is already recorded by Boost in a permanent datastore.

By adding an agent history endpoint, we enable anyone to look up builds by agent. This lookup works long after the build and the agent are gone.

This has been most useful for users and other central teams -

Is there a problem with this agent?
Did this build fail because of a noisy neighbor?
I got an alert from this agent, what was running here?

Problem Detection

Agent history gives operators a quick way to tell if agents are failing all builds.

Our team provides general purpose build agents, but some teams have custom requirements. Custom agents can connect to our Jenkins controllers to create more specialized build environments. These agents are often long-lived, which can suffer from build environment pollution³.

Recovery from this situation is manual. An operator may want to take the agent offline until it can be refreshed.

Noisy Neighbor Investigation

Agent history gives users and support teams the ability to see what other builds were running concurrently.

Most of our agents are able to run four builds concurrently for efficiency. Being scheduled alongside other resource-intensive builds can cause failures in other builds. These failures appear random.

Since we get back exactly what other builds were running, we can narrow the search for what caused a given build to fail.

General Investigation

Agent history gives users the ability to know what was running on a build agent in a given time window.

An odd request comes in from an instance id that maps to a build agent. An alert goes off and all we know so far is it came from a particular agent. Knowing what was running at the time is key to understanding what was happening at the time.

Many parts of our Jenkins deployment are ephemeral. Builds have limited history to prevent filling up the disk. Agents come and go through autoscaling and regular updates.

By storing this metadata in a dedicated system we empower users to find more complete answers without waiting for our team.

Build artifacts are stored on disk by default, which causes the disk to fill up unless we limit retained history.↩
Jenkins performance sometimes suffered finding builds across large disks. The extreme use of the disk would drain our iops, impacting builds.↩
Build environment pollution is when the environment becomes unusuable, usually due to the actions of a previous build.↩