Increased latency in some workflows and jobs APIs

Incident Report for CircleCI

Postmortem

Summary

On August 1, 2024, from 00:07 UTC to 05:20 UTC, some customers experienced delays when retrieving workflow status, rerunning workflows, and canceling jobs and workflows. This was due to a code change deployed on July 31, 2024, at 12:03 UTC, which modified how workflow status is calculated. At 00:07 UTC on August 1, 2024, specific conditions triggered an unanticipated computational demand, leading to customer impact. After investigating, we identified the issue at 04:55 UTC and reverted the faulty code. Customer impact was fully mitigated around 05:29 UTC. We appreciate our customers' patience and understanding as we worked to resolve this incident.

The original status page can be found here.

What Happened

On July 31, 2024, at 12:03 UTC, we deployed a change to the workflow status calculation. This change passed all necessary tests to ensure that the correct statuses were calculated. However, it was unknown at the time that, under specific conditions, it could significantly increase computational demand on our system.

Approximately 12 hours later, on August 1, 2024, at 00:07 UTC, these conditions were met, leading to a spike in CPU utilization across our service. As a result, the service scaled up the number of pods, which increased the number of connections used by one of our databases.

At 01:13 UTC, our database team was alerted to database connection saturation and began an initial investigation. By 02:30 UTC, after identifying the correct service and the extent of customer impact, the responsible engineering team was paged to assist with the issue.

Between 02:42 UTC and 04:30 UTC, we attempted several mitigation steps to relieve database saturation and reduce CPU utilization, including rolling the pods and reducing the number of replicas. However, these steps did not have the desired effect. At that point, we decided to perform CPU profiling on some of the pods.

The profiler revealed an issue with the workflow status calculation function, pointing to one of the recently merged commits as the source of the problem. At 04:55 UTC, we reverted that change, and the customer impact ceased around 05:29 UTC.

Future Prevention and Process Improvement

We made strategic adjustments to our Horizontal Pod Autoscaler (HPA) settings for the service, ensuring that scaling is better aligned with our database capabilities. This adjustment will further enhance the stability and reliability of the service.

We have thoroughly analyzed and addressed the underlying performance issue in the workflow status calculation, as well as identified which conditions that could potentially lead to similar issues. To prevent this we have implemented additional tests specifically designed to catch these scenarios. These tests ensure that any changes to workflow status calculation, are rigorously vetted before deployment, reducing the risk of recurrence.

Additionally, we have also fine-tuned our monitoring systems to provide earlier notifications to our team in the event of increased latency in our read APIs. This adjustment will enable us to respond more quickly to potential issues, reducing the likelihood of significant customer impact.

Posted Aug 30, 2024 - 14:18 UTC

Resolved

The issue affecting some of our workflows and jobs APIs, causing them to have higher latency, has now been fully resolved. We thank you for your patience while our engineers worked through this incident.

Posted Aug 01, 2024 - 05:40 UTC

Monitoring

We have implemented a fix for the increased latency affecting some of our Workflows and Jobs APIs and are currently observing signs of improvements. The latency has significantly decreased and continues to return toward normal levels. We will continue to monitor the situation closely.

Posted Aug 01, 2024 - 05:29 UTC

Update

Our engineers are continuing to investigate the issue. At this moment, the increased latency is limited to "Cancel Jobs", "Get Workflow status" and "Rerun workflow from failed" APIs. We will provider further updates as more information becomes available.

Posted Aug 01, 2024 - 03:29 UTC

Investigating

We are currently investigating an issue causing increased latency to some of our Workflows and Jobs APIs. We will provide further updates as more information becomes available.

Posted Aug 01, 2024 - 03:09 UTC

This incident affected: Pipelines & Workflows.