Delays in operations

Incident Report for CircleCI

Postmortem

Summary

On April 6, 2022, starting from approximately 9:00 UTC until approximately 16:20 UTC large portions of our cloud offering were rendered unavailable due to one of our core services being unable to communicate with its underlying database. Customers operating in EMEA timezones as well as the early morning for customers on the East coast of North America were unable to create pipelines and/or workflows as a result of them pushing code during our downtime. As a result of this CI results were not available, and in many cases customers were unable to merge PRs until around 13:00 UTC, after which we had fully recovered. We apologize for this disruption and we are taking steps to prevent a future occurrence.

As with many incidents in larger distributed systems, there isn’t a single smoking gun, but a few things that had to line up and compound which led to a cascading failure scenario. This incident was triggered by a change that was deployed the evening before which erroneously increased traffic on an API endpoint. This standalone was not the problem, however, as traffic increased throughout the following workday an automatic vacuum process on the database began which led to exceeding the IOPS limits, causing the database to dramatically slow down. To resolve this incident, we reverted the offending change once we identified it as the primary contributing factor after redirecting database capacity and isolating traffic patterns.

What Happened

On April 5, 2022, at 8:41 UTC, we deployed a change to our front end to improve the user experience. This change was deployed incrementally and showed no immediate signs of any issues. On April 6, 2022, at 7:22 UTC, an auto-vacuum job was triggered on one of our core databases. We started to experience increased read/write latencies which caused all API calls to builds-service to slow down and a few API calls to start failing.

At 7:32 UTC our alerting system notified us about an unusually high load on one of our API endpoints. At 7:45 UTC, we received alerts that IOPS on the builds-service was nearing its limits. We declared an incident at 8:13 UTC and summoned our Incident Response Team.

At the beginning of the incident, the impact to customers was negligible. The auto-vacuum job was terminated and the incident response team monitored the situation; approximately 45 minutes later, we observed a drastic increase in failures to our builds-service API as we hit the database’s IOPS limits. We called in more expertise from our team to assist with the incident. At 9:35 UTC, we elevated the incident to ‘major’. At 9:41 UTC, we received our first support ticket from a customer impacted by this incident.

The incident response team continued to investigate the issue, while terminating any long-running processes and queries on the database in an attempt to reduce IOPS. We observed disk errors on RDS and struggled to determine the correlation to the incident in progress. We then restarted the database and started to see some signs of recovery.

At 10:18 UTC, we scaled up the affected applications horizontally to handle the increased traffic, however IOPS were increasing again. The team upsized and upscaled the impacted database while redirecting a portion of the traffic to a replica database to alleviate pressure. We engaged with the support team at AWS for assistance at 12:50 UTC.

At 15:20 UTC, we identified and rolled back the problematic change, drastically reducing traffic to builds-service. At that point, builds-service recovered, but during the incident, we had built up a backlog of unprocessed pipelines. It took our systems until 16:00 UTC to work through that queue and fully restore service. Our incident response team monitored the situation for a time to ensure no further issues.

Future Prevention and Process Improvement

Like our customers, we strive to continuously deploy changes and improvements to our platform. Incidents like this significantly impact everyone’s ability to do their jobs.

We learned from this, and we aim to improve. We are actively working on the following steps to mitigate future outages:

Improve database standards for our services in order to ensure our systems scale with increased load
Implement stricter rate limiting on API endpoints
Implement query timeouts and circuit-breakers in builds-service’s database queries
Improve observability and alerting to catch performance regressions earlier
Revisit resource capacity planning to account for greater scale
Reduce retention policies on our database to reduce database size

For more information on our strategy to improve incident response and recovery times, here’s a blog post from our CTO, Rob Zuber: https://circleci.com/blog/an-update-on-circleci-reliability/

Posted Apr 18, 2022 - 18:00 UTC

Resolved

This incident is now resolved. Thank you for your patience.

Posted Apr 06, 2022 - 16:20 UTC

Monitoring

We have made some changes to mitigate the issue and we are currently monitoring the system.

Posted Apr 06, 2022 - 15:56 UTC

Update

We are continuing working on a fix.

Posted Apr 06, 2022 - 15:49 UTC

Update

We have now deployed some mitigation to get more work through our system and continue to work on a longer-term fix.

Posted Apr 06, 2022 - 15:04 UTC

Update

We are still working on making the builds run as normal.

Posted Apr 06, 2022 - 14:47 UTC

Update

We are applying changes to ensure that builds are running as usual. Thank you for bearing with us.

Posted Apr 06, 2022 - 14:17 UTC

Update

We are still looking into the root cause but builds are going through from our backlog now. We apologize for the disruption.

Posted Apr 06, 2022 - 13:28 UTC

Update

We are investigating multiple possible causes, including database changes and code changes.

Posted Apr 06, 2022 - 12:50 UTC

Update

We are investigating database read/write delays and are taking measures to mitigate the delays.

Posted Apr 06, 2022 - 12:25 UTC

Update

We are looking at database-level data to continue to determine the cause of the incident. Thank you for your patience while we continue to investigate.

Posted Apr 06, 2022 - 12:01 UTC