Circle 2.0 Workflows not executing
Incident Report for CircleCI
Postmortem

On May 17 between 16:35 and 23:47 UTC we experienced an outage that caused Workflows to be unavailable or degraded for the majority of customers.

We identified the root cause of the incident as hitting scale limits on our database service, resulting in increased wait times to run queries. Consequently, our backend was unable to keep up with the growing backlog, and customers experienced long wait periods to run Workflows. We addressed our resource limits immediately by increasing capacity for the database. This process took 2 hours and 20 minutes, and once complete we were able to restore normal operations. Our build job and Machine execution fleets were scaled up to handle the higher volume, whereas our macOS fleet suffered longer degraded performance due to limited capacity.

In these past weeks, we have taken several measures to prevent this failure from happening again:

  • We monitor our database utilization more rigorously.
  • We have split up queries across multiple database instances to distribute the load better.
  • We have been working on improving the efficiency of our queries and also adding the ability to track down long-running and resource-intensive queries.
  • We’re also prioritizing working on long-term architectural improvements, such as making the User Interface resilient to backend outages, so each of these components can be available independently of each other.
Posted Jul 13, 2018 - 16:27 UTC

Resolved
We have worked thru all of the backlog from the incident. Thank you all for your patience during this incident.
Posted May 17, 2018 - 23:51 UTC
Update
We are continuing to reduce the macOS backlog and are monitoring. We will update in 20 minutes.
Posted May 17, 2018 - 23:37 UTC
Update
We are continuing to reduce the backlog of jobs. We will update again in 20 minutes.
Posted May 17, 2018 - 23:07 UTC
Monitoring
We are continuing to work thru the backlog of jobs and are now monitoring. We will update again in 20 minutes.
Posted May 17, 2018 - 22:26 UTC
Update
We are continuing to process Workflows jobs and will provide an update in 20 minutes.
Posted May 17, 2018 - 22:16 UTC
Update
We are currently processing jobs in the Workflows queue and working on resuming normal service.
Posted May 17, 2018 - 21:54 UTC
Update
We are processing Workflows jobs again in small numbers and continuing to work on a fix to resume normal services.
Posted May 17, 2018 - 21:27 UTC
Update
We are continuing to work on implementing a fix and will provide an update in 20 minutes.
Posted May 17, 2018 - 21:15 UTC
Update
We are continuing to work on implementing a fix and will provide an update in 20 minutes.
Posted May 17, 2018 - 20:53 UTC
Update
We are still working on a fix for our databases and will provide another update in 20 minutes.
Posted May 17, 2018 - 20:29 UTC
Update
We are continuing to work on implementing a fix with our databases and will provide an additional update in 20 minutes.
Posted May 17, 2018 - 20:08 UTC
Update
We are continuing to work on implementing a fix with our databases and will provide an additional update in 20 minutes.
Posted May 17, 2018 - 19:47 UTC
Update
We are working to increase capacity on one of our databases. Until this work is complete, we shall endeavour to process queued Workflows at a degraded rate.
Posted May 17, 2018 - 19:25 UTC
Update
We are continuing to work on implementing the fix and will provide another update in 20 minutes.
Posted May 17, 2018 - 19:02 UTC
Update
We are continuing to work on implementing the fix and will provide another update in 20 minutes.
Posted May 17, 2018 - 18:41 UTC
Update
We are temporarily stopping the processing of Workflows jobs and will provide an update within 20 minutes.
Posted May 17, 2018 - 18:20 UTC
Identified
We have identified the cause of the issue and are currently working on implementing a fix.
Posted May 17, 2018 - 18:02 UTC
Update
We are continuing to investigate the cause of this issue with Workflows and will provide further updates in 20 minutes.
Posted May 17, 2018 - 17:45 UTC
Update
We are continuing to investigate the cause of this issue with Workflows and will provide further updates in 20 minutes.
Posted May 17, 2018 - 17:25 UTC
Update
We are continuing to investigate a problem with executing Workflows and will provide an update in 20 minutes.
Posted May 17, 2018 - 17:06 UTC
Investigating
We are investigating a problem with Circle 2.0 Workflow execution.
Posted May 17, 2018 - 16:44 UTC
This incident affected: Docker Jobs, macOS Jobs, and Pipelines & Workflows.