Summary:
Starting on May 21, 2021, at 13:30, UTC CircleCI customers running Docker, Machine executor (which includes Windows, Mac, Arm, and GPU) and remote Docker jobs experienced delayed or blocked pipelines. This incident lasted until May 22, 2021, at 01:18 UTC. This report serves to inform our customers of what happened during the incident and what we are doing to ensure an outage of this nature does not happen again. We know how critical CircleCI is to our customers, and we want to assure our users that we take this seriously. We are continually working to scale the CircleCI platform in a manner that does not impact our customers. We appreciate the support and patience you showed throughout this incident and apologize for the disruption in service.
Contributing Event:
On May 20th, we upgraded RabbitMQ from 3.8.9 to 3.8.16 as part of routine upgrades. As usual, the team checked the changelog to ensure there were no breaking changes listed and noted the ones that were. The upgrade happened with no significant issues. On Monday, the 24th of May, our investigation determined that there was a change that had been described incorrectly in the changelog. There was a section describing a change to quorum queue behavior, which is not used by CircleCI applications:
Quorum queue consumers that consume messages and do not acknowledge them will now have a 15 minute acknowledgment timeout applied to them by default. Operators can increase the timeout if necessary.
Such consumers delay or even prevent on-disk data compaction, which can run a node out of disk space much earlier than anticipated.
GitHub issue: #2990
On reviewing the GitHub issue and the code involved, we learned the change affected all queue types, not just quorum queues. This upgrade, therefore, resulted in some messages hitting the acknowledgment timeout, and the RabbitMQ channel associated with the corresponding consumer to be closed. This caused the number of consumers on one queue, in particular, to drop over time, eventually hitting zero consumers. This queue is used to asynchronously clean up VMs in a specific region, and the loss of consumers meant we stopped deleting those VMs.
This was a long-running incident with numerous follow-on effects. Remediation falls into two broad categories: detecting this specific failure sooner and improving the resiliency of the VM service. Specifically:
(all times UTC)
On the 20th of May at 08:37, CircleCI engineers performed routine maintenance on our RabbitMQ cluster and upgraded to version 3.8.16 from 3.8.9. The upgrade was successful, and our systems continued to operate as normal.
At 02:31 on the 21st of May, our engineers were notified that one of our VM destroyer service queues had no consumers. This occurred during off-peak hours; as such, the relevant VM service Kubernetes pods were restarted. The appropriate team was notified of the restart and all systems returned to normal. At 03:10, the VM destroyer service once again began to drop consumers from its queue, however, at the time this went unnoticed. At 10:11, an alert was triggered informing us that one of the queues used to destroy VMs was backing up. Unfortunately, this alert was misconfigured and only set to notify once a threshold was reached and not if the queue continued to rise. It also did not cause one of the engineers to be paged.
The image below shows the VM destroy queue backing up beyond alerting levels.
At 13:00, CircleCI engineers received the first series of pages alerting them to excessive delays starting jobs. At this time, our engineers began to investigate the issue more deeply. The initial investigation, and alerts, indicated that CircleCI was unable to provision certain types and classes of VM instances. We saw that our system was hitting CPU quotas within GCP which prevented us from starting VMs. At 13:23, an incident was declared, and at 13:26, we opened a ticket with Google Cloud to increase these limits. As the investigation continued, we noticed an uptick in cryptomining-related platform abuse. Believing this caused the observed impact, we focused efforts on removing the jobs associated with mining. This did not visibly mitigate customer impact, and in the meantime, we continued to hit GCP quotas which led to further slowdown of the system and prevented recovery. Further investigation revealed that automated retries were the leading cause of our GCP quotas being hit.
At 14:13, we manually blocked further Linux machine jobs from entering the work queue entirely. Coupling between our Docker executor fleet and the VM service resulted in slowdowns to Docker jobs as well. At 14:37, GCP increased our quotas, but the system was unable to recover on its own. We began to formulate a plan to delete old VMs manually. As this manual deletion progressed, we investigated other ways to manually shed load. At 17:22 we chose to cancel some workflows in an effort to free up resources and “unwedge” the platform as we worked to return things to an operational state.
At 16:25, engineers determined that the queues which account for destroying old VMs had no consumers. In an effort to re-establish connectivity to the queues we reset the VM services deployment and the queues began to drain. This continued to work well until approximately 18:05, at which time our RDS-based databases had exhausted the allocated Burst Balance the result of which was a substantial decrease in IO and ultimately connections to the database began to time out and dropped. The decision was made to reboot the database instance in an attempt to reset the burst balance and force a reset of all connections.
The database reboot did not have the desired effect as connections and IO were quickly exhausted due to the volume of DB queries. It was determined that the best course forward was to increase the size of the RDS-based instance to add more memory and CPU, which would also increase and reset the burst balance.
At 18:20, our Infrastructure team scaled one of our RDS databases for the first time. As the database scaled, we had to turn down the VM provisioning service such that it did not flood the database with connections as it returned to service. As the database returned to service, we observed a rapid decrease in our job queues which was an encouraging sign. The increase in database resources coupled with some load-shedding had shown positive results. We continued to observe a drop in our queues until we had exhausted our GP2 credits which once again led to the RDS instance running out of resources. As the database began to recover our engineers took steps to block remote-docker jobs from running as these jobs were causing our vm-create queues to increase. This was done to shed some load and remove any back pressure on our VM creation queues as we tried to stabilize the system and resume normal operations. We observed the database recovering and our queued backlogs dropping until approximately 18:50 at which time we began to once again hit our GCP API quotas. At 19:00, we began to scale down some of our provisioning services in order to provide the database and downstream services room to process the backlog of transactions without adding any further pressure to the system. This did not have the expected impact and we continued to burn through our RDS burst balance - to give some context we used up approximately 30% of our balance inside of 15 minutes. At 19:15 the decision was made to scale the database as the increase in memory and CPU would give us additional resources to process transactions as well as increased burst protection. At 19:45 our burst balance and database resources had stabilized, however VM provisioning times were once again on the rise.
The graph below shows the provisioning times beginning to rise:
The system had become “wedged” and we had found ourselves in a vicious cycle of creating VMs and then not being able to remove them due to hitting our GCP quotas. In order to break out of this cycle, the team scaled down all the pods which scaled new VMs and destroyed old ones. This occurred at 21:00. In the time leading up to this the team had also been working on deploying a revised distribution system which did not provision remote-docker jobs. While remote-docker was not the cause of the issues, the jobs running exacerbated the issue. The revised distribution backend was deployed at 21:57 which blocked remote-docker jobs from entering our provisioning queues. At 22:21, our engineers scaled down our provisioning system as the removal of remote-docker jobs afforded the system some room to begin flushing queues.
At 22:36, the decision was made to slowly bring pieces of the system back online. Over the course of the next two hours, our engineers brought provisioning and distribution systems back online one component at a time, ensuring the stability of each before moving to the next. By 00:39 all pieces of the infrastructure had been brought back online and any code changes that were made to mitigate the outage were reversed. At 01:00, the incident was moved to a monitoring state and resolved at 01:18
At this time there was confidence that the system was in a stable state and everyone dispersed to get some much-needed rest. This incident involved folks from EMEA and NA and came close to requiring our folks in the APAC region. The leading cause of the outage was discovered on Monday, the 24th of May. A look at various graphs showed that consumers of our RabbitMQ queues were being disconnected after a 15 minute period. This did not impact all of our queues as most of them do not require a timeout of that length, however, the ones which do continually saw their consumers drop and not restart. This change was introduced in RabbitMQ 3.8.15. Our engineers did note this in the changelog, however, the RabbitMQ documentation notes that this change only impacts quorum queues which we do not use. We have since reviewed the RabbitMQ source code and found that this change impacts all queues. We have since asked RabbitMQ about this and they have followed up informing us that this timeout does in fact affect all queue types and that a larger timeout will be applied, along with further documentation about this setting, in the next release.
The incident status page can be found here: https://status.circleci.com/incidents/qxkj1832wf9l