Problems restoring Workspaces for some executors

Incident Report for CircleCI

Postmortem

Summary:

At 09:53 UTC on April 29th a code change was deployed that resulted in all customer jobs using an attach_workspace step running on Google Cloud Platform (GCP) to fail. We rolled back the deployment immediately, which resolved customer impact by 12:21 UTC. We thank our customers for their patience and understanding during this outage.

The original status page can be found here.

What Happened

All timestamps are UTC.

A code change was released at 09:53 which would attempt to restore workspaces from Google Cloud Storage (GCS) for jobs running on GCP, and fail back to S3 in the case of any errors. Starting at 10:55 we began to receive support tickets from customers experiencing job failures due to failures in the attach_workspace step. At 12:12 we reverted the pull request for the contributing code change and immediately observed attach_workspace errors declining.

We are currently implementing a change in our workspace service to write workspaces to two providers. At the time of the incident, the double-write was only activated for a handful of internal projects. The code change on the 29th attempted to download workspaces from GCP and fallback to an alternative provider if the download failed. A bug in the code caused a failed download attempt to report as successful, so the failover was never triggered and almost all calls to download a workspace reported a Not Found error.

Therefore any job running on GCP that included an attach_workspace step would:

attempt to download the workspace from GCS;
receive a Not Found error;
erroneously report success due to the bug;
continue processing the step (without the required workspace present);
fail when something attempted to use the missing workspace.

Future Prevention and Process Improvement:

The incident revealed several issues in our detection and response processes. We have monitoring for attach_workspace failures but no alerting and have since added those alerts. We have updated automated testing to validate the expected behavior for this specific change. And we have fixed a glitch in our rollback script that prevented us from reverting the change faster.

We once again thank our customers for their patience as we worked to resolve this issue.

Posted 3 years ago. May 09, 2022 - 15:18 UTC

Resolved

Restoring workspaces has returned to normal operation. Thank you for your patience throughout.

Posted 3 years ago. Apr 29, 2022 - 12:45 UTC

Monitoring

We have identified the caused and implemented a fix. We are monitoring ongoing performance.

Posted 3 years ago. Apr 29, 2022 - 12:26 UTC

Identified

We are observing some failures when restoring Workspaces on some executors including Machine and Windows. We have identified the cause and are working to resolve the issue.

Posted 3 years ago. Apr 29, 2022 - 12:16 UTC

This incident affected: Machine Jobs and Windows Jobs.