CircleCI is a platform for continuous delivery. This means (among other things) we’re building serious distributed systems: hundreds of servers managing thousands of containers, coordinating between all the moving parts, and taking care of all the low-level details so that you have the simplest, fastest continuous integration and deployment possible.
Last Tuesday, we experienced a severe and lengthy downtime, during which our build queue was at a complete standstill. The entire company scrambled into firefighting mode to get the queue unlocked and customer builds moving again. Here's what happened....
How did we discover the problem
Our first indication of a problem was that we stopped receiving push hooks from GitHub. Over the past few years, this has been a irregular but not unusual occurrence, and we know how to respond. We typically alert customers, make sure that we don't scale down our machines, and wait for hooks to resume.
After an hour or so, GitHub's hooks did come back, but they came back at an intensity we had not experienced before. This is partly due to having so many more customers than the last time this occurred, and also from it being in the middle of the West-coast daytime, our busiest period of the day.
As soon as builds arrived, we saw the queue backing up, a normal occurrence because so many pushes arrive at once. So far, everything looked totally normal.
After a few minutes, we noticed that although our queue length was rising, our build capacity was not being fully utilized. Instead of dequeuing many builds per second, we were dequeueing on the order of one build per minute.
Our build queue is not a simple queue, but must take into account customer plans and container allocation in a complex platform. As such, it's built on top of our main database. As a result of recent growth we have been working on new build scheduling infrastructure, but it is not fully deployed.
When GitHub's hooks started up again we saw a sustained surge with an arrival rate several multiples of our normal peak. The rapid insertion of new builds and the associated writing to the db started to cause other processing to slow down.
The degradation in DB performance was a special kind of non-linear, going from "everything is fine" to "fully unresponsive" within 2 minutes. Symptoms included a long list of queued builds, and each query taking a massive amount of time to run, along side many queries timing out.
We ended up with significant resource contention across everything accessing the same DB. The result was slow builds, slow build dequeueing, and all the bad charts going up and to the right.
Deleting the queue
Initially, we had tried to salvage the queue and to process all the builds in it. Once the queue locked up and the builds started aging, the value to customers in running those builds dropped massively, and the importance of getting the DB back became our focus. The obvious thing was to purge the queue.
Unfortunately, the complexity of the state in queued builds meant we couldn't just drop data, but had to rewrite parts of our unresponsive database. So we still needed some way to unlock it to be able to get back in control.
First we tried to stop new builds from joining the queue, and we tried it from an unusual place: the load balancer. Theoretically, if the hooks could not reach us, they couldn't join the queue. A quick attempt at this proved ill-advised: when we reduced capacity to throttle the hooks naturally they significantly outnumbered our customer traffic, making it impossible for our customers to reach us and effectively shutting down our site.
Next we attempted to use parts of our new build scheduling infrastructure that produces much less load on the system than the current. But the database was too backed up for that to make a difference.
Extended failure mode
At this point, we were in extended failure mode: the original cause of the outage was no longer the fire to be fought. We were suffering a cascading effect, and that was now where we needed to put our focus.
We were sustaining thousands of queued operations and a fully locked DB, preventing us from doing anything. In an effort to at least flush the database operations, we triggered a stepdown and promoted a secondary. Unfortunately, this didn't work - the new primary was quickly overwhelmed by activity. Not only that, but the builds which failed as result of the stepdown were re-enqueued, driving more DB operations and further growing the build queue.
It was abundantly clear that even when we got the DB back, we'd want to be operating on bigger hardware with more headroom, so we started the parallel process of building out an upgrade. Unfortunately we knew the sync would take many hours and we still needed to get the service running with what we had.
Clearing the DB
The next step in reducing the load on the DB was to kill off as many builders as we could. Knowing that the odds of a good build were low, we turned off automatic re-enqueueing of builds on infrastructure failures and started killing builder machines. With nowhere for builds to be run, we started pulling slow query logs, which are generally empty during normal operations, and dealing with the worst offenders. These queries were mostly being run too often for what was required to operate the service, but some also had poor query plans that were exacerbated by the unusually high queue sizes.
Sidenote: live patching and LISP
CircleCI is written in Clojure, a form of Lisp. One of the major advantages of this type of language is that you can recompile code live at run-time. We typically use Immutable Architecture, in which we deploy pre-baked machine images to put out new code. This works well for keeping the system in a clean state, as part of a continuous delivery model. Unfortunately, when things are on fire, it doesn't allow us to move as quickly as we would like.
This is where Clojure's live patching comes in. By connecting directly to our production machines and connecting to the Clojure REPL, we can change code live and in production. We've built tooling over the last few years to automate this across the hundreds of machines we run at a particular time.
This has saved us a number of times in the past, turning many potential-crise into mere blips. Here, this allowed us to disable queries, fix bugs and otherwise swap out and disable undesirable code when needed.
Draining build queue
After some significant effort, we were able to get the DB to a state where we could operate on it. We had scripts written to clear the queues, which we felt would do no further damage to the DB. The first script killed the "usage queue", which is where builds sit when a customer doesn't have enough capacity on their plan. The second script killed the "run queue", which is where builds go when they are ready to run and awaiting system capacity. All builds pass through both queues, though in this case the run queue had filled very quickly and the usage queue as a result was overflowing.
It took over an hour for the script to drain the usage queue, and an additional 30 minutes to drain the run queue. During this time we disabled requeuing of builds. From here on we had control over our DB and the queue, and were able to run builds again.
New hardware, finally
Once we had control over the DB and queue, we were able to initiate the switch to the new DB hardware. We quickly swapped over, and started monitoring as we ran more builds and scaled our capacity up. As we became confident in the integrity of the system we started to clean up our hacks: the code we patched went through code review and entered our repo, we cycled our fleet to be clean pre-baked images again, and we re-enabled the features and services we had disabled during the crisis.
It was an exhaustingly brutal day for everyone involved, but we made it through to the other side and have improved our infrastructure as a result. We've made a number of simple changes to our architecture in the last few days, which should prevent a recurrence of the same problem. We are also investing in a number of major improvements which will put our architecture and capacity far ahead of our growth and give us far more room to spare.
We're looking forward now with the intent of building the next pieces of Circle, to allow us to continue providing continuous delivery smoothly and seamlessly as we grow, and to reduce the likeliness of future outages like the one we saw last Tuesday. To everyone who dropped a kind word of encouragement via twitter, email, or in-app while we dug through the trenches: those messages of support were really encouraging, and we really appreciated them. Thank you.