Between October 22nd and October 29th, 2019, CircleCI experienced a number of incidents that affected the availability and reliability of our platform, causing delays and errors for customers using the UI and attempting to run builds. The issues affected the following areas of our application:
The underlying issues for each incident are largely independent, and we have detailed each incident separately on our status page and the full report is available here. Each of the 6 incidents has been fully resolved, and no action is needed from any user at this time.
We know that consistency and reliability are incredibly important to our users, and we are deeply sorry for any disruption these outages may have caused to your work. We have performed an internal postmortem for each incident and have taken actions to make our system more resilient against the underlying causes. We have also worked to decrease our time to recovery in the event of an incident and have taken extensive measures to reduce the likelihood of these incidents occurring in the future.
What happened?
On Oct 28th, 2019 at 13:36 UTC we were alerted to issues loading UI pages by DataDog Synthetics monitors. The tests showed intermittent failure and subsequent manual and automated testing showed a fractional but relatively consistent error level. A set of Nginx instances that are responsible for proxying this traffic were found to be failing their Kubernetes readiness checks. The Nginx proxy pods were restarted and the issue was resolved at 14:30 UTC.
At 17:30 UTC a deploy of a key service triggered a substantial increase in API errors, both internally and externally. This caused a brief period of elevated external API errors from approximately 17:33 UTC-18:01 UTC. We again discovered that a number of Nginx proxy instances were unhealthy. Subsequent investigation revealed that the underlying cause was the eviction of kube-proxy on a small number of Kubernetes workers. Without kube-proxy running, routing rules on the affected host were no longer updated in response to service deployments. This led to Nginx proxy instances on the affected hosts being unable to communicate with the backing service after the deploy. We moved the affected Nginx proxy instances to healthy hosts and the API recovered at 18:00 UTC.
API Errors
Internally, the request errors led to a backlog in a system called the run queue. However, even after the API errors were resolved, the backlog in this queue persisted. At 18:19 UTC we identified the cause to be excess load on the database instance storing this queue, which had reached 100% CPU utilization. A number of attempts were made to reduce load on this database by scaling down services, and limiting the rate at which work is added to the run queue, but these were unsuccessful.
Run queue growing
By 20:00 UTC we had identified the underlying cause to be a database query whose performance had gotten progressively worse as the run queue grew. At this time, we began work to develop and deploy a more performant query. At 20:53 UTC an updated query was applied, but no change in performance was observed. At 21:18 UTC a new database index was proposed and added, but also failed to improve performance. Finally, at 21:40 UTC a new more performant query was applied and throughput improved immediately.
Run queue clearing
At this point a number of other systems were scaled up to help drain the backlog of delayed jobs. By 22:20 UTC the run queue had been cleared, but additional delays remained in downstream systems as the backlog was processed. By 23:09 delays for non-macOS jobs returned to normal levels. By 23:33 delays for macOS also returned to normal levels and the incident was declared resolved.
Who was affected?
The first UI issue affected users accessing our UI, particularly pages including data on user settings or onboarding. The second UI issue affected all customers using CircleCI during this time period. Complex workflows were more likely to be dramatically affected because every serial job was delayed. Mac users experienced delays even after the run queue issue was fixed.
What did we do to resolve the issue?
We altered a poorly performing query and temporarily increased concurrency to flush delayed work through the system.
What are we doing to make sure this doesn’t happen again?
Following the first UI incident, continued investigation determined that some kube-proxy pods were being evicted from Kubernetes workers. We have updated the configuration of these pods such that they are much less likely to be evicted, and if evicted, will recover automatically.
In response to the run queue delays we have load-tested and optimized the affected query and improved our internal guidance for how to test and roll out changes to hot-path queries.