Between October 22nd and October 29th, 2019, CircleCI experienced a number of incidents that affected the availability and reliability of our platform, causing delays and errors for customers using the UI and attempting to run builds. The issues affected the following areas of our application:
The underlying issues for each incident are largely independent, and we have detailed each incident separately on our status page and the full report is available here. Each of the 6 incidents has been fully resolved, and no action is needed from any user at this time.
We know that consistency and reliability are incredibly important to our users, and we are deeply sorry for any disruption these outages may have caused to your work. We have performed an internal postmortem for each incident and have taken actions to make our system more resilient against the underlying causes. We have also worked to decrease our time to recovery in the event of an incident and have taken extensive measures to reduce the likelihood of these incidents occurring in the future.
What happened?
On October 23rd at 20:18 UTC, we began receiving error responses from GCP indicating that the zones in which we run machine executors and remote Docker instances were out of capacity. As a result, we were unable to create new VMs for customer jobs.
We were delayed in correctly diagnosing this root cause due to a few tangential, but ultimately unrelated, issues:
At 21:03 UTC we initiated a failover to zones in our alternate GCP region and delays began recovering.
_
_
_
VM creation errors_
Who was affected?
Over the course of this incident, customers using machine executors and remote Docker experienced delays. A subset of these tasks (approximately 10%) were marked as failed and would have required manual re-run.
What did we do to resolve the issue?
We resolved the issue by failing over to our alternate GCP region.
What are we doing to make sure this doesn’t happen again?
A number of fixes have been deployed to address the tangential issues identified during this incident. Bug fixes were made to correctly handle zone capacity errors and fix the error handling in our “destroy-vm” worker. We have also adjusted our retry backoff times to respond more conservatively to rate limiting by GCP.
Additionally, we’ve introduced additional instrumentation to the affected services to increase observability, particularly around error responses from GCP.
Lastly, we are investigating the possibility of changing our default zones for machine executors and remote Docker instances to span regions rather than being colocated in a single region. We believe doing this will decrease our chances of hitting capacity in multiple zones simultaneously.