On September 21, 2022, at 22:29 UTC, the orb index was inadvertently deleted due to an incorrect command being issued in a production database. This resulted in all orbs being unavailable, and jobs referencing orbs to fail, for the duration of the incident. We were able to restore the majority of the contents of that database, however the most recent backup at that time was 45 minutes old and any changes to orbs during that 45 minute window were lost. We want to apologize for any disruption this might have caused and to thank our customers for their patience and understanding as we worked to restore normal service.
(all times UTC)
On September 21, 2022, for 76 minutes beginning at 22:29, any job that referenced an any orb would fail with an error stating that the referenced orb couldn’t be found. Additionally, orb pages in the registry would return a
404 Not Found error, and orbs in the CircleCI UI and CLI were inaccessible. Jobs that did not utilize orbs were unaffected.
This issue was traced back to an incorrectly entered command being issued against all orbs in their database. In most circumstances, our teams utilize our internal administrative tools to remove orbs per a customer request. In this instance, however, an error in that tooling required an engineer to perform a manual operation to remove the requested orb. During that process, an incorrectly entered command caused the removal of all orbs from the database rather than the specific orb requested.
An incident was declared at 22:48, moving to identified at 22:50. Once identified, the team set about restoring the orbs database to a known good state. On first pass, the team restored a database snapshot from earlier that same day. At this point, 58 minutes after the incident began, functionality was restored and jobs referencing orbs began completing as expected.
Following that, at 23:45, the team began retrieving orb changes that occurred between the snapshot time and the time of the deletion. They were able to restore all but the then most recent 45 minutes of changes and the incident was resolved at 00:16.
With that, any potential changes made to orbs on September 21, 2022 between 21:44 and 22:29 would have been lost. No other customer data, aside from orbs, was affected.
Ultimately, human errors are system problems that we can and will solve. In this case, an error in our internal tooling led to a manual procedure that lacked sufficient guardrails to prevent a mistake from occurring. Going forward, we have removed this specific process for deleting orbs through direct access to our production database and we are improving our internal administrative tooling to appropriately handle orb removal with abundant safeguards in place.
Additionally, the time required to restore data from our backups was longer than we consider acceptable. We’re actively working to improve our backup and restore procedures and documentation to ensure that any downtime required to restore missing data in the future is minimized.