Workflows not running and jobs failing

Incident Report for CircleCI

Postmortem

Summary

On March 1, 2022, starting from approximately 14:50 UTC until approximately 16:40 UTC jobs that accessed contexts were unable to run. To our customers, this looked like jobs appearing to start in the UI, and then moving to “Failed” after a few minutes. This affected users using GitHub on both cloud and server.

This incident was caused by a change in GitHub’s API. An endpoint we used to determine permissions was moved to a different path. To resolve this incident, we updated the path for this request. We apologize for this disruption and are taking steps to prevent a future occurrence.

What Happened

On January 21, 2020, GitHub provided notice that a set of API endpoints were due to be migrated. We looked at the changes and mistakenly concluded that they did not apply to us. On March 1, 2022, at 14:00 UTC GitHub implemented a 12-hour brownout of this endpoint. All requests to this endpoint responded with a 404 status code with explanation text in the body of the response.

We do not have the ability to deal with a response in this format and threw an error. This request was part of the logic needed to determine whether a user was able to access a context. Because this information was not available, jobs accessing contexts were unable to start.

In the initial stages of investigating this issue, we saw that our jobs queue was growing quickly. We determined that an update to a vital service had been merged at around the same time as the symptoms started. This led us to believe that this update was the root cause of the symptoms we were seeing. At 15:30 UTC we reverted that change and saw that the queue cleared out. This led us to believe that we had addressed the root cause of the issue and so we moved the status of the incident to Monitoring. The queues started growing again, and manual runs of jobs were failing so we moved the incident back into Investigating.

When we realized that the requests to the upstream provider were failing we investigated and realized at 16:04 UTC that we were making requests to one of the endpoints which were browned out. We were able to quickly update the path and finished deploying the updated service at 16:35.

Upon resolution of this incident, we took steps to ensure customers using the server version of our platform were not impacted. We prepared a patch and notified these customers with a plan for mitigation.

Future Prevention and Process Improvement

We realized during the incident that although we had been notified several times about this change, we had not realized that it would affect us. We misread the announcement as applying to only the endpoints called out in the document rather than all endpoints under the ‘teams’ path. This is not something that should have happened in a company whose mission includes managing change.

We have moved the notification of breaking changes from our providers to an alerts channel which is monitored by our senior engineers to ensure that change notifications like this are investigated carefully when they occur and any mitigations are put in place in advance of the deadline.

We are also adding changes to our roadmap to decouple our tight dependencies from upstream service providers to avoid having such a large impact if a similar issue arises in the future.

Posted Mar 14, 2022 - 22:29 UTC

Resolved

This incident has been resolved.

Posted Mar 01, 2022 - 17:18 UTC

Update

We are continuing to monitor the fix that has been implemented.

Posted Mar 01, 2022 - 17:02 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 01, 2022 - 16:42 UTC

Update

We are continuing to investigate this issue and our team are currently working on a fix.

Posted Mar 01, 2022 - 16:40 UTC

Update

We have now narrowed down the category of workflows/jobs that are affected and our team is continuing to investigate.

The workflows/jobs that are impacted are those that:
- Use contexts
- Are under GitHub organizations

Posted Mar 01, 2022 - 16:07 UTC

Investigating

We are still seeing categories of jobs failing, and we are currently trying to identify the exact type of jobs that are impacted.

We are also noticing that the ongoing issue affects access to specific pages of the UI. For example, the "Organization Settings --> Contexts" page.

Posted Mar 01, 2022 - 15:43 UTC

Update

We are continuing to monitor for any further issues.

Posted Mar 01, 2022 - 15:24 UTC

Monitoring

We are currently working at identifying the cause of this issue, in order to implement a solution.

Posted Mar 01, 2022 - 15:19 UTC

Investigating

We are investigating an issue preventing workflows from running and causing jobs to fail with no particular error.

Posted Mar 01, 2022 - 15:13 UTC

This incident affected: Docker Jobs, Machine Jobs, macOS Jobs, Windows Jobs, CircleCI UI, and CircleCI Insights.