On March 11, 2021, at 15:35 EST, CircleCI experienced an outage lasting 1 hour and 54 minutes. The outage was due to an update made to a test Kubernetes cluster which inadvertently impacted CircleCI's primary production cluster.
The deployment to the test cluster removed Traefik and Calico resources to both the production and test clusters. The removal of Traefik led to the CircleCI UI and runner.circleci.com being unavailable. However, it is important to note that the API uses a separate ingress point and was unaffected.
Upon being alerted, CircleCI engineers began to investigate the cause of the issue. It was not immediately clear that a deployment to staging had impacted the production cluster, leading to DNS and AWS-related red herrings being investigated first. A deeper investigation revealed the issues with Traefik, which were quickly restored. However, this did not address all of the issues. At this point, our engineers discovered the link between Calico and Traefik was the root cause of the outage. Once this was determined, Calico was restored, and pods in the production cluster returned to servicing customer requests.
There are several takeaways from this situation which our engineering teams are actively addressing: