502 errors in UI - Builds Queueing

Incident Report for CircleCI

Postmortem

Between October 22nd and October 29th, 2019, CircleCI experienced a number of incidents that affected the availability and reliability of our platform, causing delays and errors for customers using the UI and attempting to run builds. The issues affected the following areas of our application:

Onboarding and pipelines experiences
Workflows
Machine and remote Docker jobs
CircleCI UI

The underlying issues for each incident are largely independent, and we have detailed each incident separately on our status page and the full report is available here. Each of the 6 incidents has been fully resolved, and no action is needed from any user at this time.

We know that consistency and reliability are incredibly important to our users, and we are deeply sorry for any disruption these outages may have caused to your work. We have performed an internal postmortem for each incident and have taken actions to make our system more resilient against the underlying causes. We have also worked to decrease our time to recovery in the event of an incident and have taken extensive measures to reduce the likelihood of these incidents occurring in the future.

October 28th - UI errors and prolonged build delays

What happened?

On Oct 28th, 2019 at 13:36 UTC we were alerted to issues loading UI pages by DataDog Synthetics monitors. The tests showed intermittent failure and subsequent manual and automated testing showed a fractional but relatively consistent error level. A set of Nginx instances that are responsible for proxying this traffic were found to be failing their Kubernetes readiness checks. The Nginx proxy pods were restarted and the issue was resolved at 14:30 UTC.

At 17:30 UTC a deploy of a key service triggered a substantial increase in API errors, both internally and externally. This caused a brief period of elevated external API errors from approximately 17:33 UTC-18:01 UTC. We again discovered that a number of Nginx proxy instances were unhealthy. Subsequent investigation revealed that the underlying cause was the eviction of kube-proxy on a small number of Kubernetes workers. Without kube-proxy running, routing rules on the affected host were no longer updated in response to service deployments. This led to Nginx proxy instances on the affected hosts being unable to communicate with the backing service after the deploy. We moved the affected Nginx proxy instances to healthy hosts and the API recovered at 18:00 UTC.

‌

API Errors

Internally, the request errors led to a backlog in a system called the run queue. However, even after the API errors were resolved, the backlog in this queue persisted. At 18:19 UTC we identified the cause to be excess load on the database instance storing this queue, which had reached 100% CPU utilization. A number of attempts were made to reduce load on this database by scaling down services, and limiting the rate at which work is added to the run queue, but these were unsuccessful.

‌

Run queue growing

By 20:00 UTC we had identified the underlying cause to be a database query whose performance had gotten progressively worse as the run queue grew. At this time, we began work to develop and deploy a more performant query. At 20:53 UTC an updated query was applied, but no change in performance was observed. At 21:18 UTC a new database index was proposed and added, but also failed to improve performance. Finally, at 21:40 UTC a new more performant query was applied and throughput improved immediately.

‌

Run queue clearing

At this point a number of other systems were scaled up to help drain the backlog of delayed jobs. By 22:20 UTC the run queue had been cleared, but additional delays remained in downstream systems as the backlog was processed. By 23:09 delays for non-macOS jobs returned to normal levels. By 23:33 delays for macOS also returned to normal levels and the incident was declared resolved.

Who was affected?

The first UI issue affected users accessing our UI, particularly pages including data on user settings or onboarding. The second UI issue affected all customers using CircleCI during this time period. Complex workflows were more likely to be dramatically affected because every serial job was delayed. Mac users experienced delays even after the run queue issue was fixed.

What did we do to resolve the issue?

We altered a poorly performing query and temporarily increased concurrency to flush delayed work through the system.

What are we doing to make sure this doesn’t happen again?

Following the first UI incident, continued investigation determined that some kube-proxy pods were being evicted from Kubernetes workers. We have updated the configuration of these pods such that they are much less likely to be evicted, and if evicted, will recover automatically.

In response to the run queue delays we have load-tested and optimized the affected query and improved our internal guidance for how to test and roll out changes to hot-path queries.

Posted Nov 15, 2019 - 23:02 UTC

Resolved

Our job queue has returned to normal levels.

Posted Oct 28, 2019 - 23:33 UTC

Monitoring

Our backlog is now at normal levels. Due to limited capacity, our macOS jobs will experience delays due to backlogged jobs.

Posted Oct 28, 2019 - 23:09 UTC

Update

Our engineers have almost cleared our backlog, but due to limited capacity, our macOS jobs will still experience issues due to backlogged jobs.

Posted Oct 28, 2019 - 23:05 UTC

Update

Our engineers are currently working through the backlog and are actively working on improvements in builds queueing.

Posted Oct 28, 2019 - 22:17 UTC

Update

Our engineers have improved the throughput of our systems and are now working through the backlog more effectively.

Posted Oct 28, 2019 - 21:54 UTC

Update

Our engineers are still working to resolve this incident. Our systems are working through the backlog of queued builds.

Posted Oct 28, 2019 - 21:16 UTC

Update

Our engineers are still working to resolve this incident. Our systems are working through the backlog of queued builds.

Posted Oct 28, 2019 - 20:40 UTC

Identified

We have identified the issue and our engineers are currently deploying a fix. The backlog of builds queued is beginning to clear.

Posted Oct 28, 2019 - 20:18 UTC

Update

We are continuing to investigate. We have marked all potentially impacted components accordingly on our status page.

Posted Oct 28, 2019 - 19:33 UTC

Update

We are still investigating the cause of build queuing. Our engineers have stabilized our systems and some customers are able to see their builds running now. We will continue to update our status page.

Posted Oct 28, 2019 - 19:05 UTC

Update

Our engineers are continuing to investigate why builds are queueing.

Posted Oct 28, 2019 - 18:14 UTC

Investigating

Some users are unable to access our UI. We are currently investigating.

Posted Oct 28, 2019 - 17:50 UTC

This incident affected: Docker Jobs, Machine Jobs, macOS Jobs, Pipelines & Workflows, and CircleCI UI.