Inaccessible Orbs
Incident Report for CircleCI
Postmortem

Incident Report: Sept 22, 2022 - Major - All Orbs Inaccessible

Summary:

On September 21, 2022, at 22:29 UTC, the orb index was inadvertently deleted due to an incorrect command being issued in a production database. This resulted in all orbs being unavailable, and jobs referencing orbs to fail, for the duration of the incident. We were able to restore the majority of the contents of that database, however the most recent backup at that time was 45 minutes old and any changes to orbs during that 45 minute window were lost. We want to apologize for any disruption this might have caused and to thank our customers for their patience and understanding as we worked to restore normal service.

What Happened

(all times UTC)

On September 21, 2022, for 76 minutes beginning at 22:29, any job that referenced an any orb would fail with an error stating that the referenced orb couldn’t be found. Additionally, orb pages in the registry would return a 404 Not Found error, and orbs in the CircleCI UI and CLI were inaccessible. Jobs that did not utilize orbs were unaffected.

This issue was traced back to an incorrectly entered command being issued against all orbs in their database. In most circumstances, our teams utilize our internal administrative tools to remove orbs per a customer request. In this instance, however, an error in that tooling required an engineer to perform a manual operation to remove the requested orb. During that process, an incorrectly entered command caused the removal of all orbs from the database rather than the specific orb requested.

An incident was declared at 22:48, moving to identified at 22:50. Once identified, the team set about restoring the orbs database to a known good state. On first pass, the team restored a database snapshot from earlier that same day. At this point, 58 minutes after the incident began, functionality was restored and jobs referencing orbs began completing as expected.

Following that, at 23:45, the team began retrieving orb changes that occurred between the snapshot time and the time of the deletion. They were able to restore all but the then most recent 45 minutes of changes and the incident was resolved at 00:16.

With that, any potential changes made to orbs on September 21, 2022 between 21:44 and 22:29 would have been lost. No other customer data, aside from orbs, was affected.

Future Prevention and Process Improvement:

Ultimately, human errors are system problems that we can and will solve. In this case, an error in our internal tooling led to a manual procedure that lacked sufficient guardrails to prevent a mistake from occurring. Going forward, we have removed this specific process for deleting orbs through direct access to our production database and we are improving our internal administrative tooling to appropriately handle orb removal with abundant safeguards in place.

Additionally, the time required to restore data from our backups was longer than we consider acceptable. We’re actively working to improve our backup and restore procedures and documentation to ensure that any downtime required to restore missing data in the future is minimized.

Posted Sep 29, 2022 - 21:57 UTC

Resolved
This incident has been resolved. If you encounter issues with any specific orb versions please contact support. We appreciate your patience and understanding.
Posted Sep 22, 2022 - 00:16 UTC
Monitoring
A fix has been implemented and we are monitoring the results. Jobs utilizing orbs should begin to run as usual again. We have restored the data up until 10 hours ago. This mean that ~5 orbs have updates that haven't been restored yet. We are starting on restoring the data for those orbs now.
Posted Sep 21, 2022 - 23:45 UTC
Update
As we continue working on a solution, please avoid publishing orbs. This will be a temporary ask for the duration of this incident.
Posted Sep 21, 2022 - 23:29 UTC
Update
We are continuing our work on deploying a fix. Customers utilizing orbs are still experiencing job failures. We deeply apologize for this inconvenience and thank you for your patience.
Posted Sep 21, 2022 - 23:18 UTC
Identified
The cause for this issue has been identified. Thank you for your patience as we are working on a solution.
Posted Sep 21, 2022 - 22:50 UTC
Update
We are continuing to investigate this issue.
Posted Sep 21, 2022 - 22:49 UTC
Investigating
Users are seeing jobs fail when utilizing orbs. We are investigating the cause for this disruption at this time.
Posted Sep 21, 2022 - 22:48 UTC
This incident affected: Docker Jobs, Machine Jobs, macOS Jobs, and Windows Jobs.