Delays in starting workflows

Incident Report for CircleCI

Postmortem

Summary

On April 3, 2025, from 22:08 UTC to 23:45 UTC, CircleCI customers experienced increased latency and some failures with starting and canceling workflows and jobs. During this time customers may have experienced delays and difficulty viewing workflows in the UI. We appreciate your patience and understanding as we worked to resolve this incident.

What Happened

(all times UTC)

At approximately 22:00 on April 4, we initiated an upgrade to the service responsible for workflows. We expected a short delay (< 90 seconds) during the database upgrade where calls to the database from the workflows service would get sent to a queue and retried over a 10 minute period. We expected to see the queues grow slightly during and immediately after the upgrade. At 22:08, when the blue/green deployment was complete, we verified queries being served. At 22:17, we identified increased latency in the workflows service, as well as some errors from jobs being dropped due to exhausting their 10 minute retries.

At 22:29 additional engineers were engaged, and at 22:30 the team restarted the workflows pods to ensure they were all connecting to the correct database. At 22:35 a public incident was declared. At 22:41, it was observed that all queries on the new database were hitting disk, which indicated that the database statistics tables had not updated. The team immediately upsized the database and disabled any non business critical operations on the database. At 23:00, the workflows service was scaled down to a single pod to give the database capacity to recover while the statistics table was rebuilt.

At 23:10, the team observed the workflows queue backing up due to the reduction in pods as expected but also did not see an improvement in database performance. At 23:19, the team decided to re-enable writes on the old database and reinstate its primary status to restore service to customers sooner. This work completed at 23:29. The team continued to monitor the workflows queue. At 23:45 it was determined that the workflow queue was back to normal operating levels, and no further errors were observed.

Post-incident, the team continued to investigate. The root cause was determined to be that the analyze operation to rebuild the database’s statistics table, which is used for indexes, had been executed too early in the operation and was made stale by a second major version upgrade within the same deployment.

Future Prevention and Process Improvement

The blue/green database deployment procedures have been updated to run an analysis procedure after every major version change. The team has also tested running the analyze command while a database is under pressure to determine it has no further degrading effects on the database performance. This will be noted for future remediation.

Before any additional migrations are run, the team will add additional automated tests and manual checkpoints throughout the process to identify and resolve issues before the blue/green cutover.

Posted Apr 16, 2025 - 13:35 UTC

Resolved

The issue impacting workflows and pipelines has now been resolved.
Posted Apr 04, 2025 - 00:11 UTC

Monitoring

Our engineers have implemented a fix for the issue impacting workflows and pipelines and are back within normal operation range. We will continue to monitor the situation. We thank you for your patience while we worked to resolve this issue.
Posted Apr 03, 2025 - 23:52 UTC

Update

We are continuing to work on the issue impacting workflows and pipelines and are starting to see our systems recover. Thank you for your patience while our engineers are working to resolve this.
Posted Apr 03, 2025 - 23:41 UTC

Identified

We have identified the issue causing workflows and pipelines to be delayed or not start at all. Our engineers are working on a fix. We appreciate your patience and understanding as we actively work to resolve this disruption. We will keep you updated on our progress.
Posted Apr 03, 2025 - 23:03 UTC

Update

We are continuing to investigate this issue.
Posted Apr 03, 2025 - 22:35 UTC

Investigating

We are investigating a delays in starting workflows.
Posted Apr 03, 2025 - 22:35 UTC
This incident affected: Pipelines & Workflows.