On September 14, 2022 from approximately 07:45 to 17:05 UTC, customer pipelines were delayed due to an issue with high load on an internal database that is central to coordinating and orchestrating work on our platform. During this window, all customer pipelines experienced delays in starting. As part of the remediation effort, parts of the the application UI were disabled to reduce load on the impacted system between 08:21 and 15:34 UTC.
The original status page can be found here: Pipelines not loading
We want to thank our customers for your patience and understanding as we worked to resolve this incident.
At approximately 07:40 UTC on September 14, there was a sudden increase in disk I/O and CPU utilization on two read-only replicas of a database that is part of the system responsible for coordinating and orchestrating work on the CircleCI platform. By 07:54 internal circuit breakers tripped and an incident was declared.
During the incident, engineers rolled the pods of the affected services several times in an attempt to terminate slow-running queries and free up connection pools. The team also enabled an “incident mode” flag that disables portions of the application user interface to help reduce the volume of calls into the database.
This did succeed in reducing the load temporarily, but the load would quickly spike again as the work queue backed up. After further analysis, the team determined that there were a number of factors contributing to the database load:
All of these factors contributed to pushing the affected database into a state where it could not keep up with the workload.
The system was restored to full functionality by scaling up the replicas to match the primary and reverting the change to the database library.
By 17:05 UTC, the remediation work was completed, all pipelines were processing normally and the UI was re-enabled. Customer builds did continue to flow (at reduced capacity) for the entire duration of the incident, however in retrospect we did not communicate this effectively during the incident.
Our database engineers have initiated a review and audit of our critical systems to evaluate and shift applicable workloads (like reporting and analytics) away from critical production systems. This work has been completed for the workflow system affected by this incident, and is ongoing and is targeted for completion by the end of 2022 for the remainder of our critical systems.
The change to enable SQL comments has now been gated behind a configuration flag and is not enabled by default.
We are working to revise our incident process to help facilitate more effective communication to our customers during incidents.