CircleCI is a platform for continuous integration and continuous delivery. We take care of all the low-level details so that you have the simplest, fastest continuous integration and deployment possible.
We are sincerely sorry for the outage that prevented builds from running late Wednesday and early Thursday. We know you rely on us to deploy, and that downtime is painful for you and your customers. We take our responsibility to you very seriously, and we're sorry we let you down.
Here’s what happened, what we learned, and what actions we're taking to prevent this from happening again:
What We Saw
WEDNESDAY 20:17 UTC: Our operations team identified that our standard tools for managing increasing demand were not having the expected impact. Capacity was available in the system but not being used to run builds, resulting in a rising queue of builds. The issue was escalated to engineering for investigation.
WEDNESDAY 21:02 UTC: Queues continued to rise after initial evaluation and attempts to improve utilization. It became clear that overall system load was preventing dequeueing of builds and the incident was identified on our status page. We turned attention to any recent changes in the platform that could have introduced a significant increase in load.
Increasing DB Load
Our initial efforts focused on changes that would have led to the increased load that we were seeing. There was no direct correlation with anything that had just pushed, but early afternoon PDT on Wednesdays is our busiest time, so we looked back further for anything that might have changed recently and not mattered until such a busy time.
We identified a couple of changes that required additional work but had performed well in evaluation and rolled them back. Unfortunately, by that time the queued DB operations were backed up enough that the impact of the changes could not be isolated and we were left with rolling back anything suspicious to try to get things under control.
Killing Queued DB Operations
THURSDAY 00:11 UTC: Once our DB had enough queued DB operations that it wasn't showing signs of catching up, we failed over to a different primary to kill off those operations and assess the load based on the new conditions. Initially this created a bit of headroom for operations, but queued operations returned relatively quickly, albeit at a lower level.
Only Half the Problem
THURSDAY 07:00 UTC: All builds that were marked by the system as runnable were being run in a timely manner. Unfortunately, at this point we thought our work was done and we published an update saying so, and asked for anyone seeing issues to contact us. We got way more response than we anticipated and dug back in.
THURSDAY 08:00 UTC: It turned out that between some work we had done to get builds out of the way and some sustained load, we had builds that were being blocked from moving into the runnable state and now that had to be resolved.
Relying on our normal code to manage promotion from one queue to the next led to flooding the queue, so we needed to find a way to batch that task. We started just manually forcing builds through into the runnable state, then quickly built some tools to automate pushing the existing builds through in batches.
THURSDAY 09:00 UTC: We worked to bring up excess capacity to handle the backlog and push builds in as fast as we could. Then we learned that the throttles we had built into our build scheduling code to back off during failure were actually triggering as a result of some standard conditions, causing them to back off when we wanted them to be working double time. First, we had to add new metrics to debug the problem. Then we had to update the throttling code to improve the behavior. At that point we could get back to moving more builds into the runnable state.
THURSDAY 12:00 UTC: As the night wore on in California, we became a bit robotic in our attempts to get the build queue down and dropped the ball on updating our status for three hours. This shouldn't have happened and we're sorry for not keeping up with the flow of updates for those for whom these were working hours.
THURSDAY 14:20 UTC: The last of the leftover builds were run and we were down to processing new inbound traffic. We also finished clearing all the builds that we had manually removed from runnable state during the outage but were causing issues with new builds coming through.
What's happening at CircleCI
In July we had a similar issue with increasing load on the DB turning into queued operations to the point of a catastrophic failure. At that time we committed to getting to work on increasing the reliability around that system. We've made significant strides in that direction and we're obviously disappointed that we didn't stay ahead.
Since that outage we've moved to a completely different work distribution model that uses a central scheduler and dispatcher to push builds across the fleet. This has reduced the load on the DB significantly.
We've also moved over 50 percent of the data that was in our primary database onto its own deployment. The data was orthogonal and didn't need to be co-located, but its heavy use was causing contention and fighting for shared resources. So we built out a new DB cluster and did an on-the-fly migration of that dataset.
Finally, we took some smaller parts of the DB with different behaviors and pulled them out separately. Each now has its own custom deployment tuned for its specific access patterns.
All of these changes mean that we've been able to continue growing the business at a pace for which we're extremely grateful. This week's issues occurred at a much higher usage and load than our last incident, but that doesn't change the fact that we're extremely disappointed that it happened at all.
Where We Missed
In our attempts to increase reliability over the last few months, we have been very focused on improving the architecture and infrastructure, which drew our focus away from spending time on tools to deal with a similar situation.
In this incident, we had the experience to start reducing load on the system right away to pull it back from the catastrophic failure mode, but we didn't have any tools on hand with which to do it. So we were building them on the fly and that led to a combination of lost time and manually edited builds with inconsistent state. In the immediate term, we'll be investing in tools that will allow us to get similar situations under control much faster so they don't spiral into hours without builds.
Getting back to the architecture, we've known for a while that our queue model needs to be rebuilt and we've been laying the foundation to get there with the items outlined above, but we got caught behind on capacity again in this situation. We're pushing ahead as quickly as we can to get that piece pulled out and take the final bit of strain off this database.
Again, we are very sorry for the downtime. We take every customer outage seriously and have put all our energy and efforts in place to keep you up and running without service interruption.
If you have any questions or comments, please let us know.