CircleCI UI unavailable, API still accessible
Incident Report for CircleCI
Postmortem

What Happened:

On March 11, 2021, at 15:35 EST, CircleCI experienced an outage lasting 1 hour and 54 minutes. The outage was due to an update made to a test Kubernetes cluster which inadvertently impacted CircleCI's primary production cluster.

The deployment to the test cluster removed Traefik and Calico resources to both the production and test clusters. The removal of Traefik led to the CircleCI UI and runner.circleci.com being unavailable. However, it is important to note that the API uses a separate ingress point and was unaffected.

Upon being alerted, CircleCI engineers began to investigate the cause of the issue. It was not immediately clear that a deployment to staging had impacted the production cluster, leading to DNS and AWS-related red herrings being investigated first. A deeper investigation revealed the issues with Traefik, which were quickly restored. However, this did not address all of the issues. At this point, our engineers discovered the link between Calico and Traefik was the root cause of the outage. Once this was determined, Calico was restored, and pods in the production cluster returned to servicing customer requests.

What we learned and our next steps:

There are several takeaways from this situation which our engineering teams are actively addressing:

  • Documentation: We need to improve our network-level documentation and runbooks to account for all the layers of complexity that exist within our k8s network(s)
  • Alerting: We have improved monitoring and alerting as they relate to our network infrastructure
  • Guardrails: We are adding further guardrails to better isolate our production and staging clusters to ensure they don’t interfere with one another.
Posted Mar 25, 2021 - 19:36 UTC

Resolved
We have applied a fix and customers should begin to be able to access the CircleCI UI. Due to this being DNS-related, some customers may need to clear the DNS cache of their web browser. Other customers may experience a slight delay for this to update to their location.

If customers continue to experience issues accessing the CircleCI UI, please reach out to CircleCI support for additional assistance.

This issue is now resolved.
Posted Mar 11, 2021 - 22:33 UTC
Update
We are continuing to work to resolve the issue with accessing the CircleCI UI. All builds outside of CircleCI Runner are running as expected.
Posted Mar 11, 2021 - 22:03 UTC
Update
Additional investigation into the issue has uncovered that Runner jobs are not running as part of this incident. We are working on that aspect as well.
Posted Mar 11, 2021 - 21:34 UTC
Identified
We have identified what is causing the issue and currently working on a resolution. As noted previously, the API is still accessible, and builds are still actively running.
Posted Mar 11, 2021 - 21:10 UTC
Investigating
We are currently investigating an issue with the CircleCI UI being unavailable. However, the API is still accessible, and builds are still running.
Posted Mar 11, 2021 - 20:45 UTC
This incident affected: CircleCI UI and Runner.