Drop in number of running workflows

Incident Report for CircleCI

Postmortem

Between October 22nd and October 29th, 2019, CircleCI experienced a number of incidents that affected the availability and reliability of our platform, causing delays and errors for customers using the UI and attempting to run builds. The issues affected the following areas of our application:

Onboarding and pipelines experiences
Workflows
Machine and remote Docker jobs
CircleCI UI

The underlying issues for each incident are largely independent, and we have detailed each incident separately on our status page and the full report is available here. Each of the 6 incidents has been fully resolved, and no action is needed from any user at this time.

We know that consistency and reliability are incredibly important to our users, and we are deeply sorry for any disruption these outages may have caused to your work. We have performed an internal postmortem for each incident and have taken actions to make our system more resilient against the underlying causes. We have also worked to decrease our time to recovery in the event of an incident and have taken extensive measures to reduce the likelihood of these incidents occurring in the future.

October 22nd - Drop in number of running workflows

What happened?

On October 18th, 2019 at 00:05 UTC we received notification from Amazon AWS that a MongoDB replica set member was on an EC2 instance that was suffering from hardware degradation and was scheduled to be decommissioned. We started receiving exceptions at 18:54 UTC the same day concerning connection timeouts to the same MongoDB replica. Our site reliability engineering (SRE) team hid the replica and shut down the EC2 instance. Exceptions stopped occurring and service returned to normal. The instance was left shut down so the SRE team could schedule an appropriate time to migrate the instance to new hardware.

On October 22nd, 2019 at 09:09 UTC, the SRE team began the process of migrating the EC2 instance to new hardware and adding the replica back to the MongoDB cluster. This process finished by 09:31 UTC, with the node successfully unhidden and handling connections. At 22:46 UTC, we received an alert that there was a sudden drop in the number of scheduled running workflows. Upon investigation, we noticed high CPU load on the EC2 instance our SRE team had returned back to service earlier in the day. Acting quickly, the MongoDB replica on this instance was rehidden. While most of our services responded quickly to this change and automatically disconnected from the hidden replica, a service providing our core API was slow to respond and remained connected. This service was restarted manually and workflow performance returned back to normal by 23:25 UTC. The degraded EC2 instance was kept out of service.

_
_

_
Workflow processing rate_

AWS sent a notification on October 23rd, 2019 at 11:56 UTC stating that hardware we had migrated to on the 22nd had also been suffering from degradation. The EC2 instance and EBS volumes were subsequently rebuilt from scratch and unhidden to bring it back into service on October 23rd at 13:23 UTC.

Who was affected?

Some customers may have experienced problems scheduling or running workflows during this time period.

What did we do to resolve the issue?

Once the issue was discovered, the CircleCI SRE team hid the problematic MongoDB replica from the replica set. Some connections to the hidden replica remained open and the responsible service was restarted manually.

What are we doing to make sure this doesn’t happen again?

The CircleCI engineering team is adjusting client timeout values for MongoDB connections in order to eliminate the need to restart services manually when replica set members are hidden or otherwise fail.

Posted Nov 15, 2019 - 22:56 UTC

Resolved

The incident has been resolved and running workflows are back to normal.

Posted Oct 22, 2019 - 23:25 UTC

Monitoring

A fix has been implemented and our engineers are monitoring the results.

Posted Oct 22, 2019 - 23:13 UTC

Identified

Our engineers have identified the issue and a fix is being implemented.

Posted Oct 22, 2019 - 23:06 UTC

Investigating

At 23:32 UTC we experienced a large drop in number of running workflows, we are currently investigating this issue.

Posted Oct 22, 2019 - 22:52 UTC

This incident affected: Pipelines & Workflows.