Delays in Workflows Starting

Incident Report for CircleCI

Postmortem

November 18, 2020 CircleCI Outage

What happened

On Wednesday, November 18 at 07:00UTC our engineers were alerted to issues with our workflow queues not being processed. At the time of the incident, our engineers observed various network errors and proceeded to investigate further. After approximately 15 minutes the issue resolved and workflow queues began to process. At 07:25UTC the issue was set to monitoring whilst our engineers continued to investigate Our monitoring showed that a database backup job began at approximately the same time as the incident occurred; during this time we also observed a large amount of TCP retransmits. This led us to believe this was a database or network issue. Given that the incident had resolved, our engineers were no longer able to observe things in a degraded state.

At 14:10UTC we were once again alerted to an issue. At the start of the incident, we also noticed that our work queues were not being processed. At approximately 14:45UTC our engineers rolled the affected Kubernetes pods. This action was not seen as a long term fix but something which quickly resolved the issue and allowed our systems to resume processing customer workflows. It is important to note that this action was taken after a discussion as to whether it was better to observe the incident or mitigate the impact on our customers.

In between the 14:10 incident and the next incident, which occurred at 17:00UTC, our engineering teams continued their investigation into the cause of the outages. It was during this time that the issue was identified and work began to address the issue. The issue was narrowed down to an uneven distribution of load-balanced traffic amplifying an existing database lock contention issue. When the third incident began our teams took immediate steps to mitigate the impact felt by our customers by proactively rolling the affected Kubernetes pods and resume customer workflows. During this incident, we took action and recovered quicker. The code change to address the load balancing issue was pushed to production at 17:42UTC. Once deployed our engineering teams observed the fix for several hours and finally marked the incident resolved at 20:34UTC. Our engineering teams pushed further code fixes to production the following day which addressed the lock contention issue, as well as implementing a more robust load balancing mechanism than had been shipped the evening before.

What we learned and our next steps

Every incident at CircleCI is followed up with an Incident Review. The review process allows us to reflect on what happened, what we learned, and, when possible, to turn those learnings into actions.

Service-level objectives (SLOs): This incident highlighted a gap in our observability and the need for service-level indicators, and SLOs, which better inform our latency in starting workflows
Standardization: As part of the incident follow up we are evaluating if there is a need and opportunity for our engineering teams to standardize some of our load balancing and connection pooling implementations; in addition to ensuring that existing setups align with our documented standards.
Process: This incident has highlighted that we can do a better job with our incident handling process, in particular how we hand off an incident between different time zones.

We value our customers tremendously and apologize for any disruption you may have experienced in your work. Thank you for your support.

Posted Nov 23, 2020 - 21:08 UTC

Resolved

After monitoring we haven't had any further instances of any issues, we are moving this to resolved.

Posted Nov 18, 2020 - 20:34 UTC

Update

We are continuing to monitor to ensure no further issues arise.

Posted Nov 18, 2020 - 19:46 UTC

Update

Our systems have recovered. We will continue to monitor this for an extended period of time.

Posted Nov 18, 2020 - 17:55 UTC

Monitoring

Our systems are recovering and any delayed workflows will run as expected. We will continue to monitor the recovery.

Posted Nov 18, 2020 - 17:25 UTC

Investigating

We are currently investigating an issue causing delays in workflows starting.

Posted Nov 18, 2020 - 17:07 UTC

This incident affected: Pipelines & Workflows.