Test Results not available in UI and for test splitting

Incident Report for CircleCI

Postmortem

Summary:

From 19:15 UTC to 22:50 UTC on April 29th, test result processing failed. Test results were not available for customer jobs, and test splitting did not function.

The original status page can be found here.

What Happened

All timestamps are UTC.

We are currently migrating away from a legacy test results processing service. As part of this migration, we needed to understand the true sizes of test result blobs for customer workloads. We added instrumentation to gather this data in a change that deployed at 19:15 on April 29th. This change contained a bug that resulted in a null value being returned instead of a file.

Since this null appeared to be a valid response, no errors fired and no alerts triggered. The message system that reports test result processing also worked normally, as these retrievals registered as successes. We discovered the failure at 22:50 as a team member did a routine check of dashboards and saw no test result processing, and immediately rolled back the change.

When investigating the failure on Monday, the team discovered the full impact of the bug and we declared a retroactive incident.

Future Prevention and Process Improvement:

While we have monitoring for test result processing failures, we did not alert based on that monitoring. We’ll be adding those alerts, as well as looking into our post-deploy validation strategy. Additionally, as we continue with the migration away from the legacy service we are ensuring the new service has tests to cover this scenario.

Posted May 09, 2022 - 15:19 UTC

Resolved

A faulty change was made to test-results-service 19:15UTC causing test-results to not be processed. This affected the UI and the ability for test splitting to work correctly for subsequent jobs. The deployment was rolled back 22:50UTC.

Posted Apr 29, 2022 - 19:15 UTC