Between October 22nd and October 29th, 2019, CircleCI experienced a number of incidents that affected the availability and reliability of our platform, causing delays and errors for customers using the UI and attempting to run builds. The issues affected the following areas of our application:
The underlying issues for each incident are largely independent, and we have detailed each incident separately on our status page and the full report is available here. Each of the 6 incidents has been fully resolved, and no action is needed from any user at this time.
We know that consistency and reliability are incredibly important to our users, and we are deeply sorry for any disruption these outages may have caused to your work. We have performed an internal postmortem for each incident and have taken actions to make our system more resilient against the underlying causes. We have also worked to decrease our time to recovery in the event of an incident and have taken extensive measures to reduce the likelihood of these incidents occurring in the future.
October 22nd - New onboarding & pipelines experiences inaccessible for some users
On October 22nd, 2019, at 20:14 UTC we received synthetic testing alerts that our user onboarding and user settings pages were experiencing loading failures. Upon investigation, we found that users of our new user interface were receiving either white screens or pages that were not fully loading.
On October 23rd, 2019, Amazon AWS declared that there had been a DDoS (Distributed Denial of Service) attack on their Route53 DNS service. The attack began on October 22nd, 2019, at 18:30 UTC, and was resolved by October 23rd, 2019, at 02:30 UTC. The problems that occurred prevented our CDN provider from properly resolving the locations of our assets, causing the white pages and partial page loads.
Who was affected?
Users of the new CircleCI user interface or user onboarding accessing the site from 20:14 UTC through 21:16 UTC may have been affected.
What did we do to resolve the issue?
Once we identified that our CDN provider was no longer able to serve our page assets, we pushed a change that allowed our web servers to serve assets directly. Once we pushed this change, our internal testing and synthetic testing confirmed that pages were loading properly again.
What are we doing to make sure this doesn’t happen again?
We will continue to monitor the health of our cloud provider services with both internal and synthetic testing to reduce our time to recovery in the event that a similar outage occurs.