On April 6, 2022, starting from approximately 9:00 UTC until approximately 16:20 UTC large portions of our cloud offering were rendered unavailable due to one of our core services being unable to communicate with its underlying database. Customers operating in EMEA timezones as well as the early morning for customers on the East coast of North America were unable to create pipelines and/or workflows as a result of them pushing code during our downtime. As a result of this CI results were not available, and in many cases customers were unable to merge PRs until around 13:00 UTC, after which we had fully recovered. We apologize for this disruption and we are taking steps to prevent a future occurrence.
As with many incidents in larger distributed systems, there isn’t a single smoking gun, but a few things that had to line up and compound which led to a cascading failure scenario. This incident was triggered by a change that was deployed the evening before which erroneously increased traffic on an API endpoint. This standalone was not the problem, however, as traffic increased throughout the following workday an automatic vacuum process on the database began which led to exceeding the IOPS limits, causing the database to dramatically slow down. To resolve this incident, we reverted the offending change once we identified it as the primary contributing factor after redirecting database capacity and isolating traffic patterns.
On April 5, 2022, at 8:41 UTC, we deployed a change to our front end to improve the user experience. This change was deployed incrementally and showed no immediate signs of any issues. On April 6, 2022, at 7:22 UTC, an auto-vacuum job was triggered on one of our core databases. We started to experience increased read/write latencies which caused all API calls to builds-service to slow down and a few API calls to start failing.
At 7:32 UTC our alerting system notified us about an unusually high load on one of our API endpoints. At 7:45 UTC, we received alerts that IOPS on the builds-service was nearing its limits. We declared an incident at 8:13 UTC and summoned our Incident Response Team.
At the beginning of the incident, the impact to customers was negligible. The auto-vacuum job was terminated and the incident response team monitored the situation; approximately 45 minutes later, we observed a drastic increase in failures to our builds-service API as we hit the database’s IOPS limits. We called in more expertise from our team to assist with the incident. At 9:35 UTC, we elevated the incident to ‘major’. At 9:41 UTC, we received our first support ticket from a customer impacted by this incident.
The incident response team continued to investigate the issue, while terminating any long-running processes and queries on the database in an attempt to reduce IOPS. We observed disk errors on RDS and struggled to determine the correlation to the incident in progress. We then restarted the database and started to see some signs of recovery.
At 10:18 UTC, we scaled up the affected applications horizontally to handle the increased traffic, however IOPS were increasing again. The team upsized and upscaled the impacted database while redirecting a portion of the traffic to a replica database to alleviate pressure. We engaged with the support team at AWS for assistance at 12:50 UTC.
At 15:20 UTC, we identified and rolled back the problematic change, drastically reducing traffic to builds-service. At that point, builds-service recovered, but during the incident, we had built up a backlog of unprocessed pipelines. It took our systems until 16:00 UTC to work through that queue and fully restore service. Our incident response team monitored the situation for a time to ensure no further issues.
Like our customers, we strive to continuously deploy changes and improvements to our platform. Incidents like this significantly impact everyone’s ability to do their jobs.
We learned from this, and we aim to improve. We are actively working on the following steps to mitigate future outages:
For more information on our strategy to improve incident response and recovery times, here’s a blog post from our CTO, Rob Zuber: https://circleci.com/blog/an-update-on-circleci-reliability/