Failure to download cache, artifacts and workspaces

Incident Report for CircleCI

Postmortem

Summary:

On April 20th, 2022, from 21:50 UTC to 23:27 UTC CircleCI customers saw increased timeouts and errors when loading caches, artifacts, and workspaces, and users could not access the site when using app.circleci.com or app.circleci.com/dashboards. This was caused by a change to our distributed tracing infrastructure which unexpectedly impacted our API Gateway.

Thank you for your understanding and patience as we worked to resolve this issue.

The original status page can be found here.

What Happened

All timestamps are in UTC

We were migrating our distributed tracing infrastructure from OpenCensus (which has been deprecated) to OpenTelemetry. At 21:50 the phased rollout was completed. At 21:52 our monitoring alerted us to an increase in API response latency.

We found that this was caused by a misconfiguration of OpenTelemetry which prevented our API Gateway from being able to report tracing data. A limitation of the gateway component that’s responsible for tracing means that it uses a different protocol and routing than our other services that report tracing data. This protocol and routing hadn’t been set up in OpenTelemetry.

The tracing component reports data asynchronously from request handling but used insufficient timeouts when doing so, which meant that too many resources were used waiting for tracing submissions to fail, which eventually impacted request handling.

Some clients, particularly ones that we own and have short timeouts, gave up and reported errors.

The exhaustion of resources also meant that other gateway components such as authentication were affected and reported errors to clients.

At 22:29, the old routing was deleted and new routing was created for OpenTelemetry. We manually modified the gateway component to send traces to OpenTelemetry to alleviate pressure. While we did see a decrease in timeouts, due to an omitted configuration, we were still seeing connection errors and were not yet sending traces.

At 22:55, this configuration was added to enable the appropriate receiver on OpenTelemetry, which resolved connection errors. We monitored and began seeing positive results from these changes. Due to portions of the gateway services being oversaturated for so long, their recovery was slower than desired.

At 23:04, we created a PR to make our manual change permanent, and force a deployment of the API Gateway to remove the saturated instances and instantiate new ones. By 23:21 new instances of the gateway began to serve traffic and systems were looking healthier.

At 23:27, we moved to monitoring and operational. We stayed in monitoring for 20 minutes to confirm all was well.

At 23:48, this incident was marked as resolved.

Future Prevention and Process Improvement:

We have set timeouts on all gateway components that use asynchronous network calls in order to prevent this happening in the future. Whilst we can’t change the tracing protocol, we will be changing the routing so that it works similarly to the other services that report tracing data. In addition, we are improving tests and observability that would have been valuable during and after this incident.

Posted May 06, 2022 - 21:58 UTC

Resolved

The incident has been resolved. Thank you for your patience.

Posted Apr 20, 2022 - 23:48 UTC

Monitoring

A fix has been implanted and we are monitoring the results

Posted Apr 20, 2022 - 23:27 UTC

Update

We are continuing to work on a fix for this issue.

Posted Apr 20, 2022 - 23:11 UTC

Identified

We are seeing failures in downloading cache and artifacts. We have identified the issue and are implementing a fix. Thank you for your patience.

Posted Apr 20, 2022 - 21:50 UTC

This incident affected: Machine Jobs, CircleCI UI, Artifacts, and Runner.