At 09:53 UTC on April 29th a code change was deployed that resulted in all customer jobs using an attach_workspace
step running on Google Cloud Platform (GCP) to fail. We rolled back the deployment immediately, which resolved customer impact by 12:21 UTC. We thank our customers for their patience and understanding during this outage.
The original status page can be found here.
All timestamps are UTC.
A code change was released at 09:53 which would attempt to restore workspaces from Google Cloud Storage (GCS) for jobs running on GCP, and fail back to S3 in the case of any errors. Starting at 10:55 we began to receive support tickets from customers experiencing job failures due to failures in the attach_workspace
step. At 12:12 we reverted the pull request for the contributing code change and immediately observed attach_workspace
errors declining.
We are currently implementing a change in our workspace service to write workspaces to two providers. At the time of the incident, the double-write was only activated for a handful of internal projects. The code change on the 29th attempted to download workspaces from GCP and fallback to an alternative provider if the download failed. A bug in the code caused a failed download attempt to report as successful, so the failover was never triggered and almost all calls to download a workspace reported a Not Found
error.
Therefore any job running on GCP that included an attach_workspace
step would:
Not Found
error;The incident revealed several issues in our detection and response processes. We have monitoring for attach_workspace
failures but no alerting and have since added those alerts. We have updated automated testing to validate the expected behavior for this specific change. And we have fixed a glitch in our rollback script that prevented us from reverting the change faster.
We once again thank our customers for their patience as we worked to resolve this issue.