Pipelines page not loading - Cont.

Incident Report for CircleCI

Postmortem

Summary

All timestamps are in UTC

On December 2, 2025, at 16:20, we deployed a code change to workflow status computation that introduced a race condition in how workflows are terminated. This caused the following impacts:

Duplicated auto-rerun workflows (Dec 2, 16:20 - Dec 3, 14:20): Some customers using auto-rerun experienced duplicate workflows being created and duplicate email notifications.

API and UI degradation (Dec 3, 12:10 - 22:50): Some customers experienced issues with the Workflows API and errors in the UI when trying to view pipelines.

Job execution delays (Dec 3, 16:20 - 16:32): Several customers experienced delays in job execution, with some jobs failing to start or becoming stuck.

The incident was fully resolved on December 3 at 22:50.

We thank our customers for their patience and understanding as we worked to resolve this incident.

This incident tracked as multiple separate issues and status pages:

Pipelines page not loading (Link 1, Link 2)
Jobs stuck in running state (Link)
Jobs not starting (Link)

What Happened

On December 2, 2025, at 16:20, we deployed a code change that introduced a race condition in how workflows are terminated. When workflows failed with concurrent jobs terminating, our system published duplicate workflow completion events. For some customers with auto-rerun enabled, these duplicate events triggered duplicate workflows to be created, causing cascading issues across our platform.

Initial Detection and Response (Dec 2, 21:40 - Dec 3, 14:20)

At 21:40 on December 2nd, we received the first customer report of excessive workflow executions. Our notifications team was also alerted to a high number of notifications being sent. We initially thought it was an isolated issue related to customer automation, so we flagged the affected project at 21:46 and saw recovery on the notifications side.

At 21:51 we learned that a different customer reported duplicate workflow failure emails. We suspected it might be related to a recent feature rollout, but investigation did not identify a root cause. We initially thought the elevated rate of messages from the first customer somehow impacted other customers. By 22:19, no other customer had reported similar issues and metrics were back to normal, so we decided to continue the investigation the next day.

At 9:55 on December 3rd, we received alerts to elevated notifications and we resumed investigation. We escalated to a full incident at 13:44. We learned that the notifications issues seemed to be caused by auto-reruns, so we disabled auto-rerun system-wide at 13:56. We identified and rolled back the problematic code change at 14:15, then re-enabled auto-rerun at 14:20.

This resolved the primary issue of duplicate workflows and notifications being created.

38 projects were impacted by the auto-rerun issue with a total of 33,135 duplicate workflows created. It also led to duplicate email notifications for 18 organizations.

Because the issue manifested across multiple domains (Orchestration, Notifications, API/UI, and Execution), we initially investigated these as separate incidents. It took time to correlate these incidents back to a single root cause. However, this fix did not immediately resolve all customer-facing issues due to downstream impacts.

Downstream Impact

The excessive workflows caused cascading issues across our platform:

Notification Issues (Dec 2, 21:40 - Dec 3, 14:20)
18 organizations received duplicate email notifications for workflow failures.

API and UI Issues (Dec 3, 12:10 - 22:50)
At 11:54, our monitoring detected a service queue growing due to duplicate events. While investigating, we discovered the Pipelines page was struggling to load starting at 12:10. The issue was intermittent and was initially resolved at 15:11. However, at 15:30 API errors returned and we continued to investigate.

At 21:30 we identified that some pipelines had over 10,000 workflows, causing our APIs to slow down and fail. Affected customers experienced failed requests to the Workflows API and errors when trying to view pipelines. We deployed targeted fixes between 22:16 and 22:50 that filtered out problematic pipelines, restoring API performance.

Job Execution Delays (Dec 3, 16:20 - 16:32)

At 16:18, our monitoring alerted us to issues with our job orchestration system. While investigating, we identified a drop in jobs being processed. At 16:20, our job execution infrastructure began experiencing memory pressure due to increased API traffic from the duplicate workflows. This memory pressure disrupted normal job processing and cancellation handling, and caused services to restart. Job processing recovered at 16:32. During this window, approximately 3,500 jobs were delayed, and 1,100 jobs (3% of tasks) failed to start. Additionally, 8,000 jobs became stuck in a canceling state. We then scaled our execution infrastructure to prevent further memory pressure.

We later identified that this execution impact was connected to the API issues described above, demonstrating the cascading nature of the incident across multiple systems.

Resolution

All systems fully recovered by 22:50 UTC on December 3rd. We have refunded credits consumed by duplicate workflows for all 38 impacted projects on December 5th.

Future Prevention and Process Improvement

We're taking several actions to prevent similar incidents and improve our detection and response:

Preventing Duplicate Workflows

We're putting safeguards in place to prevent race conditions when workflows are terminated and we're also adding constraints to prevent duplicate auto-rerun workflows from being created in the first place.

Improved Monitoring and Alerting

While we have monitors in place, we didn't get paged for some of the issues, which delayed our incident response. As a result, we've updated existing monitoring alerts to page on-call engineers for execution infrastructure API errors and service restarts, and we're adding new alerts for UI-related issues.

Better System Resilience

We will be updating the Pipelines page to handle large numbers of workflows more gracefully while also looking to understand and improve how our systems behave under pressure.

Incident Response

One of the key challenges in this incident was the difficulty in correlating that these incidents all came from the same root cause. As a result we're taking the following actions:

Establishing better communication protocols to quickly correlate related incidents across teams
Improving coordination when mitigations in one system may impact downstream services

Posted Dec 18, 2025 - 21:29 UTC

Resolved

The issues affecting the pipelines page display are related to a broader incident impacting our systems. We have opened a separate incident tracking job triggering and API status issues, which encompasses the pipelines page loading problems. Please follow https://status.circleci.com/incidents/jq4bgq2sjt1r for ongoing updates.

Posted Dec 03, 2025 - 21:58 UTC

Update

We are continuing to investigate this issue.

Posted Dec 03, 2025 - 21:31 UTC

Investigating

We are seeing some issues loading the pipelines page. This doesn't affect all users but affects the display of pipelines. We are continuing to investigate the cause of this issue.

Posted Dec 03, 2025 - 21:31 UTC

This incident affected: CircleCI UI.