New Onboarding & Pipelines Experiences inaccessible for some users

Incident Report for CircleCI

Postmortem

Between October 22nd and October 29th, 2019, CircleCI experienced a number of incidents that affected the availability and reliability of our platform, causing delays and errors for customers using the UI and attempting to run builds. The issues affected the following areas of our application:

Onboarding and pipelines experiences
Workflows
Machine and remote Docker jobs
CircleCI UI

The underlying issues for each incident are largely independent, and we have detailed each incident separately on our status page and the full report is available here. Each of the 6 incidents has been fully resolved, and no action is needed from any user at this time.

We know that consistency and reliability are incredibly important to our users, and we are deeply sorry for any disruption these outages may have caused to your work. We have performed an internal postmortem for each incident and have taken actions to make our system more resilient against the underlying causes. We have also worked to decrease our time to recovery in the event of an incident and have taken extensive measures to reduce the likelihood of these incidents occurring in the future.

October 22nd - New onboarding & pipelines experiences inaccessible for some users

What happened?

On October 22nd, 2019, at 20:14 UTC we received synthetic testing alerts that our user onboarding and user settings pages were experiencing loading failures. Upon investigation, we found that users of our new user interface were receiving either white screens or pages that were not fully loading.

We identified the cause of this issue to be errors retrieving page assets, like Javascript and CSS, from our CDN provider. While diagnosing these issues, some of our staff noticed they were able to retrieve the underlying assets directly while others could not, depending on DNS configuration. This led us to believe that the problem was an intermittent external problem with DNS resolution affecting our CDN’s ability to retrieve our static assets from S3. At 20:57 UTC, we pushed a patch that would allow our web servers to directly serve these page assets. Once pushed, we confirmed that by 21:14 UTC the issues had been remediated and users were no longer experiencing issues with the website.

On October 23rd, 2019, Amazon AWS declared that there had been a DDoS (Distributed Denial of Service) attack on their Route53 DNS service. The attack began on October 22nd, 2019, at 18:30 UTC, and was resolved by October 23rd, 2019, at 02:30 UTC. The problems that occurred prevented our CDN provider from properly resolving the locations of our assets, causing the white pages and partial page loads.

Who was affected?

Users of the new CircleCI user interface or user onboarding accessing the site from 20:14 UTC through 21:16 UTC may have been affected.

What did we do to resolve the issue?

Once we identified that our CDN provider was no longer able to serve our page assets, we pushed a change that allowed our web servers to serve assets directly. Once we pushed this change, our internal testing and synthetic testing confirmed that pages were loading properly again.

What are we doing to make sure this doesn’t happen again?

We will continue to monitor the health of our cloud provider services with both internal and synthetic testing to reduce our time to recovery in the event that a similar outage occurs.

Posted Nov 15, 2019 - 22:54 UTC

Resolved

This incident has been resolved. app.circleci.com, account.circleci.com, and onboarding.circleci.com are fully operational for all users.

Posted Oct 22, 2019 - 20:27 UTC

Monitoring

Our engineers have deployed a fix and are currently monitoring the results.
Any UI loading errors encountered while viewing app.circleci.com, account.circleci.com, and onboarding.circleci.com are resolving successfully. Some users may need to refresh their screens to see pages load.

Posted Oct 22, 2019 - 20:16 UTC

Identified

The issue has been identified and we are working on a fix.

Posted Oct 22, 2019 - 19:50 UTC