Jobs stuck in running state

Incident Report for CircleCI

Postmortem

Summary

All timestamps are in UTC

On December 2, 2025, at 16:20, we deployed a code change to workflow status computation that introduced a race condition in how workflows are terminated. This caused the following impacts:

Duplicated auto-rerun workflows (Dec 2, 16:20 - Dec 3, 14:20): Some customers using auto-rerun experienced duplicate workflows being created and duplicate email notifications.

API and UI degradation (Dec 3, 12:10 - 22:50): Some customers experienced issues with the Workflows API and errors in the UI when trying to view pipelines.

Job execution delays (Dec 3, 16:20 - 16:32): Several customers experienced delays in job execution, with some jobs failing to start or becoming stuck.

The incident was fully resolved on December 3 at 22:50.

We thank our customers for their patience and understanding as we worked to resolve this incident.

This incident tracked as multiple separate issues and status pages:

Pipelines page not loading (Link 1, Link 2)
Jobs stuck in running state (Link)
Jobs not starting (Link)

What Happened

On December 2, 2025, at 16:20, we deployed a code change that introduced a race condition in how workflows are terminated. When workflows failed with concurrent jobs terminating, our system published duplicate workflow completion events. For some customers with auto-rerun enabled, these duplicate events triggered duplicate workflows to be created, causing cascading issues across our platform.

Initial Detection and Response (Dec 2, 21:40 - Dec 3, 14:20)

At 21:40 on December 2nd, we received the first customer report of excessive workflow executions. Our notifications team was also alerted to a high number of notifications being sent. We initially thought it was an isolated issue related to customer automation, so we flagged the affected project at 21:46 and saw recovery on the notifications side.

At 21:51 we learned that a different customer reported duplicate workflow failure emails. We suspected it might be related to a recent feature rollout, but investigation did not identify a root cause. We initially thought the elevated rate of messages from the first customer somehow impacted other customers. By 22:19, no other customer had reported similar issues and metrics were back to normal, so we decided to continue the investigation the next day.

At 9:55 on December 3rd, we received alerts to elevated notifications and we resumed investigation. We escalated to a full incident at 13:44. We learned that the notifications issues seemed to be caused by auto-reruns, so we disabled auto-rerun system-wide at 13:56. We identified and rolled back the problematic code change at 14:15, then re-enabled auto-rerun at 14:20.

This resolved the primary issue of duplicate workflows and notifications being created.

38 projects were impacted by the auto-rerun issue with a total of 33,135 duplicate workflows created. It also led to duplicate email notifications for 18 organizations.

Because the issue manifested across multiple domains (Orchestration, Notifications, API/UI, and Execution), we initially investigated these as separate incidents. It took time to correlate these incidents back to a single root cause. However, this fix did not immediately resolve all customer-facing issues due to downstream impacts.

Downstream Impact

The excessive workflows caused cascading issues across our platform:

Notification Issues (Dec 2, 21:40 - Dec 3, 14:20)
18 organizations received duplicate email notifications for workflow failures.

API and UI Issues (Dec 3, 12:10 - 22:50)
At 11:54, our monitoring detected a service queue growing due to duplicate events. While investigating, we discovered the Pipelines page was struggling to load starting at 12:10. The issue was intermittent and was initially resolved at 15:11. However, at 15:30 API errors returned and we continued to investigate.

At 21:30 we identified that some pipelines had over 10,000 workflows, causing our APIs to slow down and fail. Affected customers experienced failed requests to the Workflows API and errors when trying to view pipelines. We deployed targeted fixes between 22:16 and 22:50 that filtered out problematic pipelines, restoring API performance.

Job Execution Delays (Dec 3, 16:20 - 16:32)

At 16:18, our monitoring alerted us to issues with our job orchestration system. While investigating, we identified a drop in jobs being processed. At 16:20, our job execution infrastructure began experiencing memory pressure due to increased API traffic from the duplicate workflows. This memory pressure disrupted normal job processing and cancellation handling, and caused services to restart. Job processing recovered at 16:32. During this window, approximately 3,500 jobs were delayed, and 1,100 jobs (3% of tasks) failed to start. Additionally, 8,000 jobs became stuck in a canceling state. We then scaled our execution infrastructure to prevent further memory pressure.

We later identified that this execution impact was connected to the API issues described above, demonstrating the cascading nature of the incident across multiple systems.

Resolution

All systems fully recovered by 22:50 UTC on December 3rd. We have refunded credits consumed by duplicate workflows for all 38 impacted projects on December 5th.

Future Prevention and Process Improvement

We're taking several actions to prevent similar incidents and improve our detection and response:

Preventing Duplicate Workflows

We're putting safeguards in place to prevent race conditions when workflows are terminated and we're also adding constraints to prevent duplicate auto-rerun workflows from being created in the first place.

Improved Monitoring and Alerting

While we have monitors in place, we didn't get paged for some of the issues, which delayed our incident response. As a result, we've updated existing monitoring alerts to page on-call engineers for execution infrastructure API errors and service restarts, and we're adding new alerts for UI-related issues.

Better System Resilience

We will be updating the Pipelines page to handle large numbers of workflows more gracefully while also looking to understand and improve how our systems behave under pressure.

Incident Response

One of the key challenges in this incident was the difficulty in correlating that these incidents all came from the same root cause. As a result we're taking the following actions:

Establishing better communication protocols to quickly correlate related incidents across teams
Improving coordination when mitigations in one system may impact downstream services

Posted Dec 18, 2025 - 21:30 UTC

Resolved

Between 16:20 and 16:32 UTC, job triggering and workflow starts experienced disruptions across all resource classes due to memory pressure on our internal job distributor systems. We identified the issue and scaled our infrastructure to handle the load. Services returned to normal operation at 16:32 UTC.

What was impacted: Job triggering and workflow starts were disrupted for 12 minutes. Some workflows and jobs appeared stuck in a running state during this window.

Resolution: Our systems are now operating normally with additional capacity in place to prevent similar disruptions. If you had workflows or jobs that were stuck during this window, please manually rerun them.

The incident is now resolved and we will be conducting a thorough review to understand what triggered the memory pressure and identify any additional preventive measures

Posted Dec 03, 2025 - 18:12 UTC

Monitoring

As of 16:32 UTC, job triggering and workflow starts have returned to normal operation across all resource classes. The impact was limited to a 12-minute window between 16:20 and 16:32 UTC.

What's impacted: All new jobs and workflows are now starting normally.

What to expect: If you have workflows or jobs that were stuck during the 16:20-16:32 UTC window, please manually rerun them.

We are continuing to investigate the root cause of this disruption and will provide an update within 30 minutes or once our investigation is complete.

Posted Dec 03, 2025 - 17:27 UTC

Investigating

At 16:20 UTC, we began experiencing delays in job triggering and starts across all resource classes. Some workflows and jobs may appear stuck in a running state.

What’s impacted: Job triggering is experiencing delays or is stuck. This affects all resource classes and executors.

What to expect: If you have workflows that appear stuck and haven’t started, we recommend manually rerunning them.

We are actively investigating the root cause and working to restore normal processing speeds. Next update: We will provide an update within 30 minutes or earlier with our progress

Posted Dec 03, 2025 - 16:53 UTC

This incident affected: Pipelines & Workflows.