Linux build queue backing up

Incident Report for CircleCI

Postmortem

CircleCI is a platform for continuous integration and continuous delivery. We take care of all the low-level details so that you have the simplest, fastest continuous integration and deployment possible.

We are sincerely sorry for the outage that prevented builds from running late Wednesday and early Thursday. We know you rely on us to deploy, and that downtime is painful for you and your customers. We take our responsibility to you very seriously, and we're sorry we let you down.

Here’s what happened, what we learned, and what actions we're taking to prevent this from happening again:

What We Saw

WEDNESDAY 20:17 UTC: Our operations team identified that our standard tools for managing increasing demand were not having the expected impact. Capacity was available in the system but not being used to run builds, resulting in a rising queue of builds. The issue was escalated to engineering for investigation.

WEDNESDAY 21:02 UTC: Queues continued to rise after initial evaluation and attempts to improve utilization. It became clear that overall system load was preventing dequeueing of builds and the incident was identified on our status page. We turned attention to any recent changes in the platform that could have introduced a significant increase in load.

Increasing DB Load

Our initial efforts focused on changes that would have led to the increased load that we were seeing. There was no direct correlation with anything that had just pushed, but early afternoon PDT on Wednesdays is our busiest time, so we looked back further for anything that might have changed recently and not mattered until such a busy time.

We identified a couple of changes that required additional work but had performed well in evaluation and rolled them back. Unfortunately, by that time the queued DB operations were backed up enough that the impact of the changes could not be isolated and we were left with rolling back anything suspicious to try to get things under control.

Killing Queued DB Operations

THURSDAY 00:11 UTC: Once our DB had enough queued DB operations that it wasn't showing signs of catching up, we failed over to a different primary to kill off those operations and assess the load based on the new conditions. Initially this created a bit of headroom for operations, but queued operations returned relatively quickly, albeit at a lower level.

Only Half the Problem

THURSDAY 07:00 UTC: All builds that were marked by the system as runnable were being run in a timely manner. Unfortunately, at this point we thought our work was done and we published an update saying so, and asked for anyone seeing issues to contact us. We got way more response than we anticipated and dug back in.

THURSDAY 08:00 UTC: It turned out that between some work we had done to get builds out of the way and some sustained load, we had builds that were being blocked from moving into the runnable state and now that had to be resolved.

Relying on our normal code to manage promotion from one queue to the next led to flooding the queue, so we needed to find a way to batch that task. We started just manually forcing builds through into the runnable state, then quickly built some tools to automate pushing the existing builds through in batches.

New Bottleneck

THURSDAY 09:00 UTC: We worked to bring up excess capacity to handle the backlog and push builds in as fast as we could. Then we learned that the throttles we had built into our build scheduling code to back off during failure were actually triggering as a result of some standard conditions, causing them to back off when we wanted them to be working double time. First, we had to add new metrics to debug the problem. Then we had to update the throttling code to improve the behavior. At that point we could get back to moving more builds into the runnable state.

Too Quiet

THURSDAY 12:00 UTC: As the night wore on in California, we became a bit robotic in our attempts to get the build queue down and dropped the ball on updating our status for three hours. This shouldn't have happened and we're sorry for not keeping up with the flow of updates for those for whom these were working hours.

Finally

THURSDAY 14:20 UTC: The last of the leftover builds were run and we were down to processing new inbound traffic. We also finished clearing all the builds that we had manually removed from runnable state during the outage but were causing issues with new builds coming through.

What's happening at CircleCI

In July we had a similar issue with increasing load on the DB turning into queued operations to the point of a catastrophic failure. At that time we committed to getting to work on increasing the reliability around that system. We've made significant strides in that direction and we're obviously disappointed that we didn't stay ahead.

Since that outage we've moved to a completely different work distribution model that uses a central scheduler and dispatcher to push builds across the fleet. This has reduced the load on the DB significantly.

We've also moved over 50 percent of the data that was in our primary database onto its own deployment. The data was orthogonal and didn't need to be co-located, but its heavy use was causing contention and fighting for shared resources. So we built out a new DB cluster and did an on-the-fly migration of that dataset.

Finally, we took some smaller parts of the DB with different behaviors and pulled them out separately. Each now has its own custom deployment tuned for its specific access patterns.

All of these changes mean that we've been able to continue growing the business at a pace for which we're extremely grateful. This week's issues occurred at a much higher usage and load than our last incident, but that doesn't change the fact that we're extremely disappointed that it happened at all.

Where We Missed

In our attempts to increase reliability over the last few months, we have been very focused on improving the architecture and infrastructure, which drew our focus away from spending time on tools to deal with a similar situation.

In this incident, we had the experience to start reducing load on the system right away to pull it back from the catastrophic failure mode, but we didn't have any tools on hand with which to do it. So we were building them on the fly and that led to a combination of lost time and manually edited builds with inconsistent state. In the immediate term, we'll be investing in tools that will allow us to get similar situations under control much faster so they don't spiral into hours without builds.

Getting back to the architecture, we've known for a while that our queue model needs to be rebuilt and we've been laying the foundation to get there with the items outlined above, but we got caught behind on capacity again in this situation. We're pushing ahead as quickly as we can to get that piece pulled out and take the final bit of strain off this database.

Conclusion

Again, we are very sorry for the downtime. We take every customer outage seriously and have put all our energy and efforts in place to keep you up and running without service interruption.

If you have any questions or comments, please let us know.

Posted Oct 16, 2015 - 23:38 UTC

Resolved

We’re continuing to see builds running normally. We will be monitoring closely throughout the day. Please contact us in support if you’re seeing issues with your builds.

Posted Oct 15, 2015 - 15:50 UTC

Update

Builds with broken state from outage have been cleaned out and new builds should be processing without delay. Let us know in support if you're still seeing issues.

Posted Oct 15, 2015 - 14:18 UTC

Update

We've cleaned up the majority of remnant builds from the initial outage and are pushing all remaining builds through as fast as we can, regardless of plan. Will update again in ~30 mins.

Posted Oct 15, 2015 - 13:37 UTC

Update

We continue to work on getting backed up builds through and will be removing some old builds in the interest of getting new builds through in a more timely manner.

Posted Oct 15, 2015 - 12:07 UTC

Update

Seeing many customers with affected builds, adding more capacity to get those builds through as they get corrected.

Posted Oct 15, 2015 - 09:06 UTC

Update

We're still seeing customers with queued builds due to leftover bad state. Working through those now.

Posted Oct 15, 2015 - 08:06 UTC

Update

The site and builds are operating normally: builds are moving and the queue is empty. We’re focusing our efforts on extracting root cause, and will be continuing to do some (non-disruptive) maintenance to that end. We're suspending regular updates, and will update only with new information as we closely monitor the build queue throughout the night. We appreciate your patience, and we’ll have a post-mortem soon.

Posted Oct 15, 2015 - 06:55 UTC

Update

We've cleared the queue and things are moving. We're seeing another slowdown, we're working on it right now. We'll update again in ~30 min.

Posted Oct 15, 2015 - 06:24 UTC

Update

We're continuing to work on clearing the queue while monitoring very closely. Will update again in ~45 min.

Posted Oct 15, 2015 - 05:09 UTC

Update

We've cleared a good portion of the queue and we're bringing on extra capacity to get through the rest. We are continuing to monitor closely and work on root cause analysis. We will update again in 30 min.

Posted Oct 15, 2015 - 04:38 UTC

Monitoring

We’re cautiously optimistic and we’re very slowly starting to let builds through to work on clearing the queue and monitoring closely. We will update again in 30 min.

Posted Oct 15, 2015 - 04:09 UTC

Update

We think we’ve narrowed it down to a couple of potential causes. We’re working on validating them. Will update again in ~45 min

Posted Oct 15, 2015 - 03:26 UTC

Update

Our position is much the same. Each path we trace that doesn’t lead us to the source at least gives us direction as to what it’s not, which is ultimately helpful. No one wants this solved more than we do, and we’re continuing to do everything we can to get the service back up. Expect an update in ~45 min if not sooner.

Posted Oct 15, 2015 - 02:35 UTC

Update

We’re digging back through logs to continue to try to isolate the source of the load and are continuing to hold new builds. We’ll update as soon as we know more (~30 min).

Posted Oct 15, 2015 - 01:58 UTC

Update

We didn't get the impact we were hoping for so we’re working on isolating the source of new load. We’re holding off on running builds a little bit longer, but are continuing to devote all our resources to getting this moving asap. Expect another update in 30 min.

Posted Oct 15, 2015 - 01:24 UTC

Update

We’ve canceled all builds and aren’t currently running any. We are monitoring recovery before starting to push new builds. We know you rely on us, and we’ve still got all hands on deck working on getting this fixed asap. Will update again in 30 minutes.

Posted Oct 15, 2015 - 00:28 UTC

Update

Load is moving in the right direction but not pushing new builds through yet. Update in another 30.

Posted Oct 14, 2015 - 23:50 UTC

Update

Halting some processes to get load under control before starting back up. Next update in < 30 mins as we determine impact.

Posted Oct 14, 2015 - 23:17 UTC

Update

We've narrowed down to one system, but haven't identified source of additional load. Have pulled all resources to help accelerate getting build throughput back up. Update again in 30 mins.

Posted Oct 14, 2015 - 23:02 UTC

Update

The site is still down, we are still working on bringing everything up and restoring the site's availability.

Posted Oct 14, 2015 - 22:44 UTC

Update

The site is down, we are working on bringing in backup systems and restoring the site’s availability.

Posted Oct 14, 2015 - 22:10 UTC

Update

We are still looking into the possible root causes for the issue. We’ll update once we have more information.

Posted Oct 14, 2015 - 21:59 UTC

Update

Continuing to investigate. Narrowed to a couple central systems but no cause identified yet. Update again within 1hr.

Posted Oct 14, 2015 - 21:38 UTC

Investigating

We're seeing a backlog of linux builds and currently investigating. Update within 30 mins.

Posted Oct 14, 2015 - 21:02 UTC