Jobs not starting

Incident Report for CircleCI

Postmortem

Summary

All timestamps are in UTC

On December 2, 2025, at 16:20, we deployed a code change to workflow status computation that introduced a race condition in how workflows are terminated. This caused the following impacts:

Duplicated auto-rerun workflows (Dec 2, 16:20 - Dec 3, 14:20): Some customers using auto-rerun experienced duplicate workflows being created and duplicate email notifications.

API and UI degradation (Dec 3, 12:10 - 22:50): Some customers experienced issues with the Workflows API and errors in the UI when trying to view pipelines.

Job execution delays (Dec 3, 16:20 - 16:32): Several customers experienced delays in job execution, with some jobs failing to start or becoming stuck.

The incident was fully resolved on December 3 at 22:50.

We thank our customers for their patience and understanding as we worked to resolve this incident.

This incident tracked as multiple separate issues and status pages:

  • Pipelines page not loading (Link 1, Link 2)
  • Jobs stuck in running state (Link)
  • Jobs not starting (Link)

What Happened

On December 2, 2025, at 16:20, we deployed a code change that introduced a race condition in how workflows are terminated. When workflows failed with concurrent jobs terminating, our system published duplicate workflow completion events. For some customers with auto-rerun enabled, these duplicate events triggered duplicate workflows to be created, causing cascading issues across our platform.

Initial Detection and Response (Dec 2, 21:40 - Dec 3, 14:20)

At 21:40 on December 2nd, we received the first customer report of excessive workflow executions. Our notifications team was also alerted to a high number of notifications being sent. We initially thought it was an isolated issue related to customer automation, so we flagged the affected project at 21:46 and saw recovery on the notifications side.

At 21:51 we learned that a different customer reported duplicate workflow failure emails. We suspected it might be related to a recent feature rollout, but investigation did not identify a root cause. We initially thought the elevated rate of messages from the first customer somehow impacted other customers. By 22:19, no other customer had reported similar issues and metrics were back to normal, so we decided to continue the investigation the next day.

At 9:55 on December 3rd, we received alerts to elevated notifications and we resumed investigation. We escalated to a full incident at 13:44. We learned that the notifications issues seemed to be caused by auto-reruns, so we disabled auto-rerun system-wide at 13:56. We identified and rolled back the problematic code change at 14:15, then re-enabled auto-rerun at 14:20.

This resolved the primary issue of duplicate workflows and notifications being created.

38 projects were impacted by the auto-rerun issue with a total of 33,135 duplicate workflows created. It also led to duplicate email notifications for 18 organizations.

Because the issue manifested across multiple domains (Orchestration, Notifications, API/UI, and Execution), we initially investigated these as separate incidents. It took time to correlate these incidents back to a single root cause. However, this fix did not immediately resolve all customer-facing issues due to downstream impacts.

Downstream Impact

The excessive workflows caused cascading issues across our platform:

Notification Issues (Dec 2, 21:40 - Dec 3, 14:20)
18 organizations received duplicate email notifications for workflow failures.

API and UI Issues (Dec 3, 12:10 - 22:50)
At 11:54, our monitoring detected a service queue growing due to duplicate events. While investigating, we discovered the Pipelines page was struggling to load starting at 12:10. The issue was intermittent and was initially resolved at 15:11. However, at 15:30 API errors returned and we continued to investigate.

At 21:30 we identified that some pipelines had over 10,000 workflows, causing our APIs to slow down and fail. Affected customers experienced failed requests to the Workflows API and errors when trying to view pipelines. We deployed targeted fixes between 22:16 and 22:50 that filtered out problematic pipelines, restoring API performance.

Job Execution Delays (Dec 3, 16:20 - 16:32)

At 16:18, our monitoring alerted us to issues with our job orchestration system. While investigating, we identified a drop in jobs being processed. At 16:20, our job execution infrastructure began experiencing memory pressure due to increased API traffic from the duplicate workflows. This memory pressure disrupted normal job processing and cancellation handling, and caused services to restart. Job processing recovered at 16:32. During this window, approximately 3,500 jobs were delayed, and 1,100 jobs (3% of tasks) failed to start. Additionally, 8,000 jobs became stuck in a canceling state. We then scaled our execution infrastructure to prevent further memory pressure.

We later identified that this execution impact was connected to the API issues described above, demonstrating the cascading nature of the incident across multiple systems.

Resolution

All systems fully recovered by 22:50 UTC on December 3rd. We have refunded credits consumed by duplicate workflows for all 38 impacted projects on December 5th.

Future Prevention and Process Improvement

We're taking several actions to prevent similar incidents and improve our detection and response:

Preventing Duplicate Workflows

We're putting safeguards in place to prevent race conditions when workflows are terminated and we're also adding constraints to prevent duplicate auto-rerun workflows from being created in the first place.

Improved Monitoring and Alerting

While we have monitors in place, we didn't get paged for some of the issues, which delayed our incident response. As a result, we've updated existing monitoring alerts to page on-call engineers for execution infrastructure API errors and service restarts, and we're adding new alerts for UI-related issues.

Better System Resilience

We will be updating the Pipelines page to handle large numbers of workflows more gracefully while also looking to understand and improve how our systems behave under pressure.

Incident Response

One of the key challenges in this incident was the difficulty in correlating that these incidents all came from the same root cause. As a result we're taking the following actions:

  • Establishing better communication protocols to quickly correlate related incidents across teams
  • Improving coordination when mitigations in one system may impact downstream services
Posted Dec 18, 2025 - 21:30 UTC

Resolved

We have resolved the issues affecting job triggering, workflow starts, and API queries. Our systems have been stabilized and are operating normally.

What was impacted: Job triggering, workflow starts, API queries, and pipeline page loading experienced disruptions for some customers. This affected all resource classes and executors.

Resolution: We implemented mitigation measures to address high volume workflow queries impacting our internal systems and increased system capacity. All new jobs and workflows are now starting normally, pipeline pages are loading, and API queries are functioning as expected.

What to expect: If you have jobs that became stuck during this incident, please rerun them. If you continue to experience issues after rerunning, please contact our support team. Some customers may still see jobs stuck in a cancelling state. Engineering is aware and addressing to mitigate risk.

We will continue monitoring our systems and conducting a thorough review to identify additional preventive measures.
Posted Dec 03, 2025 - 23:48 UTC

Update

We have deployed changes to mitigate the high volume of workflow queries impacting our systems. Pipeline pages that were previously failing to load are now loading successfully, and we are seeing significant reduction in API errors.

What's impacted: Some customers continue to experience jobs stuck in a not-running state from earlier in the incident. New job triggering and workflow starts are now functioning normally.

What's happening: We have implemented mitigation measures and increased system capacity. We are continuing to investigate the remaining stuck jobs for affected customers.

What to expect: If you experienced issues loading pipeline pages or querying workflow data via the API, these should now be resolved. New jobs and workflows should trigger normally. If you have jobs that appeared stuck earlier, please try rerunning them while we continue to investigate the reports of jobs that do remain stuck for a small number of customers. The data for those workflows should be available and queryable.

Next update: We will provide an update within 30 minutes. Thank you for your patience while our engineers work through this incident.
Posted Dec 03, 2025 - 22:59 UTC

Update

We are currently experiencing issues affecting job triggering and workflow starts across all resource classes. Jobs may appear stuck in a not-running state, and some customers may encounter 500 errors when making API calls to check job or workflow status.

What's impacted: Job triggering, workflow starts, and API queries for job and workflow status are experiencing disruptions. This affects all resource classes and executors. Some users may also experience issues loading the pipeline page.

What to expect: We are actively working to stabilize our systems and restore normal operations. We will provide updates as we make progress toward resolution.

We thank you for your patience while we work through these issues - we will update with our progress within 30 minutes or earlier.
Posted Dec 03, 2025 - 21:58 UTC

Investigating

We are currently investigating reports of jobs not starting. We apologize for the inconvenience.
Posted Dec 03, 2025 - 21:28 UTC
This incident affected: Pipelines & Workflows.