On Wednesday, November 18 at 07:00UTC our engineers were alerted to issues with our workflow queues not being processed. At the time of the incident, our engineers observed various network errors and proceeded to investigate further. After approximately 15 minutes the issue resolved and workflow queues began to process. At 07:25UTC the issue was set to monitoring whilst our engineers continued to investigate Our monitoring showed that a database backup job began at approximately the same time as the incident occurred; during this time we also observed a large amount of TCP retransmits. This led us to believe this was a database or network issue. Given that the incident had resolved, our engineers were no longer able to observe things in a degraded state.
At 14:10UTC we were once again alerted to an issue. At the start of the incident, we also noticed that our work queues were not being processed. At approximately 14:45UTC our engineers rolled the affected Kubernetes pods. This action was not seen as a long term fix but something which quickly resolved the issue and allowed our systems to resume processing customer workflows. It is important to note that this action was taken after a discussion as to whether it was better to observe the incident or mitigate the impact on our customers.
In between the 14:10 incident and the next incident, which occurred at 17:00UTC, our engineering teams continued their investigation into the cause of the outages. It was during this time that the issue was identified and work began to address the issue. The issue was narrowed down to an uneven distribution of load-balanced traffic amplifying an existing database lock contention issue. When the third incident began our teams took immediate steps to mitigate the impact felt by our customers by proactively rolling the affected Kubernetes pods and resume customer workflows. During this incident, we took action and recovered quicker. The code change to address the load balancing issue was pushed to production at 17:42UTC. Once deployed our engineering teams observed the fix for several hours and finally marked the incident resolved at 20:34UTC. Our engineering teams pushed further code fixes to production the following day which addressed the lock contention issue, as well as implementing a more robust load balancing mechanism than had been shipped the evening before.
What we learned and our next steps
Every incident at CircleCI is followed up with an Incident Review. The review process allows us to reflect on what happened, what we learned, and, when possible, to turn those learnings into actions.
We value our customers tremendously and apologize for any disruption you may have experienced in your work. Thank you for your support.