Between October 22nd and October 29th, 2019, CircleCI experienced a number of incidents that affected the availability and reliability of our platform, causing delays and errors for customers using the UI and attempting to run builds. The issues affected the following areas of our application:
The underlying issues for each incident are largely independent, and we have detailed each incident separately on our status page and the full report is available here. Each of the 6 incidents has been fully resolved, and no action is needed from any user at this time.
We know that consistency and reliability are incredibly important to our users, and we are deeply sorry for any disruption these outages may have caused to your work. We have performed an internal postmortem for each incident and have taken actions to make our system more resilient against the underlying causes. We have also worked to decrease our time to recovery in the event of an incident and have taken extensive measures to reduce the likelihood of these incidents occurring in the future.
On Oct 29th, 2019 at 15:10 UTC CircleCI engineers detected slow response times and an elevated error rate for an internal GraphQL service. At 15:20 UTC, instances of the GraphQL service began failing Kubernetes liveness checks which led Kubernetes to restart these service instances. The restarts caused in-flight requests to be aborted, leading to a spike in the error rate for the service. This, in turn, led to errors and timeouts loading parts of the CircleCI UI. The restarts also reduced the number of available instances of the GraphQL service, putting increased load on the remaining instances, and compounding the underlying problem.
Initially, we suspected that the restarts were caused by misconfigured memory settings for the GraphQL service. These settings were corrected and the service was redeployed at 16:05 UTC, but the restarts continued and the error rate for the service remained elevated.
At approximately 16:35 UTC, load on the GraphQL service began decreasing and liveness checks began passing again. With liveness checks passing, the instance restarts ceased and error rates returned to normal levels. The 504 Gateway Timeout errors in the UI subsided shortly thereafter, but slow response times from the GraphQL service persisted.
Further investigation via thread dumps found that a majority of the threads in the GraphQL service were blocked in calls to resolve the address of another internal CircleCI service. Of these threads, only one was performing an address lookup with the system resolver. The remaining threads were instead waiting for a lock within Java's DNS caching code. We subsequently discovered that our practice of disabling this cache (by setting the cache’s TTL to 0 seconds) was inadvertently causing these address lookups to be executed sequentially rather than in parallel. This locking code is designed to prevent redundant lookup requests from being made while populating the cache, but in our case, it instead became a bottleneck on our request processing.
At 20:05 UTC a change was deployed to the GraphQL service to re-enabled Java DNS caching with a TTL of 5 seconds. Following this change, we observed an increase in throughput and reduction in response times._
Response time improvement (log scale; Oct 29th in grey, Oct 30th in blue)_
Who was affected?
Customers using the CircleCI UI between 15:20 and 16:35 UTC may have experienced errors. In particular, customers viewing workflows and workflow maps, re-running a workflow from the beginning, approving jobs, or managing contexts and orbs may have experienced these errors. Customers using the UI in the hours before and after this time period may have experienced slowness and occasionally timeouts.
What did we do to resolve the issue?
The most severe symptoms of this incident resolved without intervention as load on the service decreased. Later, a Java DNS caching configuration change was made leading to improved performance for an internal GraphQL service and the CircleCI UI by extension.
What are we doing to make sure this doesn’t happen again?
The Java DNS caching configuration change described above is being applied to all of CircleCI’s Java services. The memory configuration changes made to the GraphQL service during the incident have been made permanent and will improve the service’s reliability. Additional caching was also introduced in the GraphQL service to further improve performance.