Machine and Remote Docker Provisioning Delays

Incident Report for CircleCI

Postmortem

Between October 22nd and October 29th, 2019, CircleCI experienced a number of incidents that affected the availability and reliability of our platform, causing delays and errors for customers using the UI and attempting to run builds. The issues affected the following areas of our application:

Onboarding and pipelines experiences
Workflows
Machine and remote Docker jobs
CircleCI UI

The underlying issues for each incident are largely independent, and we have detailed each incident separately on our status page and the full report is available here. Each of the 6 incidents has been fully resolved, and no action is needed from any user at this time.

We know that consistency and reliability are incredibly important to our users, and we are deeply sorry for any disruption these outages may have caused to your work. We have performed an internal postmortem for each incident and have taken actions to make our system more resilient against the underlying causes. We have also worked to decrease our time to recovery in the event of an incident and have taken extensive measures to reduce the likelihood of these incidents occurring in the future.

October 23rd - Machine and remote Docker provisioning delays

What happened?

On October 23rd at 20:18 UTC, we began receiving error responses from GCP indicating that the zones in which we run machine executors and remote Docker instances were out of capacity. As a result, we were unable to create new VMs for customer jobs.

We were delayed in correctly diagnosing this root cause due to a few tangential, but ultimately unrelated, issues:

The system responsible for tracking VM creation requests had a bug which caused it to mishandle the VM creation requests which were rejected by GCP due to insufficient zone capacity. As a result, these VMs were left in a “starting” state within CircleCI systems rather than being transitioned to an “error” state.
The system responsible for automatically scaling our pool of ready VMs makes its decisions based on the number of jobs waiting to execute, and the number of existing VMs in the pool. The large number of VMs mistakenly left in a "starting" state prevented this autoscaler from meeting the underlying demand.
During the incident, we began hitting rate limits while calling GCP APIs. These rate limit errors caused a worker responsible for processing “destroy-vm” requests to be killed. This was due to a recent change in our error handling.

At 21:03 UTC we initiated a failover to zones in our alternate GCP region and delays began recovering.

_
_

_
VM creation errors_

Who was affected?

Over the course of this incident, customers using machine executors and remote Docker experienced delays. A subset of these tasks (approximately 10%) were marked as failed and would have required manual re-run.

What did we do to resolve the issue?

We resolved the issue by failing over to our alternate GCP region.

What are we doing to make sure this doesn’t happen again?

A number of fixes have been deployed to address the tangential issues identified during this incident. Bug fixes were made to correctly handle zone capacity errors and fix the error handling in our “destroy-vm” worker. We have also adjusted our retry backoff times to respond more conservatively to rate limiting by GCP.

Additionally, we’ve introduced additional instrumentation to the affected services to increase observability, particularly around error responses from GCP.

Lastly, we are investigating the possibility of changing our default zones for machine executors and remote Docker instances to span regions rather than being colocated in a single region. We believe doing this will decrease our chances of hitting capacity in multiple zones simultaneously.

Posted Nov 15, 2019 - 22:58 UTC

Resolved

This incident has been resolved.

Posted Oct 23, 2019 - 22:48 UTC

Monitoring

A fix has been implemented and we are currently monitoring results.

Posted Oct 23, 2019 - 21:59 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Oct 23, 2019 - 21:32 UTC

Update

We are continuing to investigate this issue.

Posted Oct 23, 2019 - 21:12 UTC

Update

We are continuing to investigate this issue.

Posted Oct 23, 2019 - 20:51 UTC

Investigating

Some machine and remote-docker builds are queueing. We are currently investigating this issue.

Posted Oct 23, 2019 - 20:43 UTC

This incident affected: Machine Jobs.