Between October 22nd and October 29th, 2019, CircleCI experienced a number of incidents that affected the availability and reliability of our platform, causing delays and errors for customers using the UI and attempting to run builds. The issues affected the following areas of our application:
The underlying issues for each incident are largely independent, and we have detailed each incident separately on our status page and the full report is available here. Each of the 6 incidents has been fully resolved, and no action is needed from any user at this time.
We know that consistency and reliability are incredibly important to our users, and we are deeply sorry for any disruption these outages may have caused to your work. We have performed an internal postmortem for each incident and have taken actions to make our system more resilient against the underlying causes. We have also worked to decrease our time to recovery in the event of an incident and have taken extensive measures to reduce the likelihood of these incidents occurring in the future.
On October 24th, at 18:33 UTC we began receiving error responses from GCP indicating that the zones in which we run machine executors and remote Docker instances were out of capacity. We also began receiving rate limit error responses from GCP at this time.
Focusing first on the rate limit issues, we reduced the number of worker processes making calls to the GCP API. This successfully addressed the rate limit errors, but we continued to receive errors related to zone capacity.
At 19:10 UTC we initiated a failover to zones in our alternate GCP region and delays began recovering. By 19:16 UTC delays had returned to normal levels and the incident was resolved.
Who was affected?
Customers using machine executors and remote Docker experienced delays. A subset of these tasks (approximately 30%) were marked as failed and would have required manual re-run.
What did we do to resolve the issue?
We resolved the issue by failing over to our alternate GCP region.
What are we doing to make sure this doesn’t happen again?
In addition to the actions taken in response to the related incident on Oct 23rd, we have automated the failover process to our alternate zones.