Build failures and delays due to upstream service disruption

Incident Report for CircleCI

Postmortem

Summary

On October 20, 2025, from 06:55 to 22:00 UTC, CircleCI customers experienced a variety of failures and delays across the system including job failures, pipeline creation failures, periods of increased queueing, errors in the UI and APIs, and a period of unavailability for audit log streaming failures due to an AWS outage affecting multiple AWS services in the US-EAST-1 region. We want to thank our customers for your patience and understanding as we worked to resolve this incident.

The original status page for this incident can be found here.

What Happened

(All times UTC)

On October 20, 2025 at 06:55 the system that handles requests to create new pipelines (builds service) began to receive 503 Service Unavailable error responses when attempting to retrieve pipeline values from AWS S3, resulting in failures and delays across multiple executor types.

At 07:11, AWS declared an incident indicating that multiple AWS services were experiencing increased error rates and latencies in the US-EAST-1 region. During the same time, CircleCI customers began experiencing premature failures on running jobs across multiple executors. Audit log access and streaming was also unavailable at this time. CircleCI declared an incident at 07:29.

At 07:40, builds service started to encounter errors when attempting to read and write from S3. This resulted in errors loading the pipelines page in the CircleCI UI as well as errors creating pipelines. At the same time, the service responsible for starting workflows also encountered issues accessing S3, resulting in some jobs being dropped. Pipeline creation slowly began to recover at 08:24, and completely recovered by 09:14 along with workflow runs. During this time, Docker, Mac M1 and M4, Windows, and Android jobs began to run again. Linux, Remote Docker, and Mac M2 jobs continued to queue.

At 09:30, the system responsible for provisioning machines was unable to start new instances in AWS EC2 to meet the growing backlog of customer jobs. AWS acknowledged the increase in error rates when launching EC2 instances at 10:35. Docker jobs continued to run, but customers experienced high queue times as demand grew.

By 12:20, the queue time for Docker jobs dropped to 18 minutes from a previous high of 50 minutes. The queue of Docker jobs continued to shrink and eventually clear by 14:00. At 14:30, CircleCI engineers made a change to prevent infrastructure responsible for processing Docker jobs from scaling down and releasing instances, as EC2 was still experiencing provisioning failures.

At 15:28, Mac M4 machines became temporarily unavailable because our connectivity to the Mac environment relies on AWS networking. As network stability improved, Mac workloads were able to resume.

At 17:10, machine jobs (Linux, Remote Docker, Mac M1 and M2) began starting again after AWS applied a fix to EC2.

At 17:30, Mac M4 machines began to successfully boot again after AWS applied fixes to the network load balancers. At the same time, CircleCI’s task queue hit its peak and began to decrease as AWS service health continued to improve.

The backlog of jobs that had built up during the incident continued to clear over the next few hours. By 22:00, all customer jobs in the backlog had been processed, and new incoming jobs were being processed at pre-incident rates.

We continued to monitor our systems, and the CircleCI incident was marked resolved at 22:15.

Future Prevention and Process Improvement

While outages of this scale are rare, we are strengthening both our resilience and the experience our customers have during major upstream disruptions.

During this incident, we evaluated options to shift workloads to backup compute providers and different AWS regions. Based on the errors we were observing across multiple AWS services as well as instability in several upstream dependencies also impacted by the outage, we did not have confidence that shifting workloads would improve the situation. There was a significant risk of introducing new failures without reducing the disruption customers were already experiencing. In addition, based on the duration and scope of similar AWS incidents in the past, it was unlikely that shifting workloads mid-incident would restore service faster than AWS could complete their recovery. As a result, jobs continued to queue and then ran as expected once AWS restored stability.

Additionally, in the event of an AWS incident anticipated to span multiple days or longer, we would initiate disaster recovery to a different region. However, in this particular case, we had information from AWS indicating that the service would be restored in a timeframe where waiting was the best approach for our customers.

In order to ensure that our customers have the information they need to make timely decisions, we are also investing in improving our status page communication during major incidents. These improvements include:

  • Clearer visibility into which CircleCI functionality is impacted at any point in time.
  • More frequent updates based on our own service health, not only upstream reports.
  • Better expectations for how jobs will behave during disruptions.

Our goal is to improve transparency, recovery options, and the predictability of the customer experience when the unexpected happens.

Posted Nov 05, 2025 - 15:01 UTC

Resolved

This incident has been resolved. All job types are now operating normally with standard queue times. Linux machine and Remote Docker capacity has been fully restored following the AWS incident recovery.

Thank you for your patience.
Posted Oct 20, 2025 - 22:15 UTC

Monitoring

Queue times for all job types have returned to normal as AWS continues to recover from their incident (https://health.aws.amazon.com/health/status).

All job types: Operating at normal capacity with standard queue times
Performance has fully restored across Linux machine, Remote Docker, Docker, ARM, IP-Ranges Docker, and Mac jobs. Windows and Android were unaffected throughout the incident.

We're continuing to monitor system performance closely to ensure stability. Thank you for your patience during this incident.
Posted Oct 20, 2025 - 21:42 UTC

Update

We're seeing meaningful improvement in Linux machine and Remote Docker job performance as AWS recovers from their incident (https://health.aws.amazon.com/health/status).

Current Status:
Linux machine and Remote Docker jobs: Average wait times have improved to approximately 10-15 minutes
Docker and Mac jobs: Operating normally though ARM and IP-Ranges Docker jobs may still experience longer queue times
Windows and Android jobs: Unaffected

While we haven't returned to full capacity, the situation is steadily improving. We continue to work and monitor the queue times closely.
Thank you for your patience during this incident. We'll continue to update you as conditions improve.
Posted Oct 20, 2025 - 21:08 UTC

Update

We continue to experience capacity limitations for Linux machine and Remote Docker jobs related to the AWS incident (https://health.aws.amazon.com/health/status). Docker, Mac, Windows and Android jobs are all operating normally.

We're actively monitoring the situation and will continue to provide updates as the situation evolves.
We appreciate your patience as we work through these infrastructure constraints.
Posted Oct 20, 2025 - 20:06 UTC

Update

We continue to experience issues running Linux machine and Remote Docker jobs due to capacity issues at AWS following their recent incident (https://health.aws.amazon.com/health/status).

Docker and Mac job performance has recovered to normal levels, and Windows and Android jobs are still unaffected.
Posted Oct 20, 2025 - 18:31 UTC

Update

We are continuing to experience high levels of errors attempting to run instances in AWS (https://health.aws.amazon.com/health/status).
Due to this, we are currently unable to run Linux Machine and Remote Docker jobs, and Docker is experiencing slow scaling.
Additionally, AWS network instability prevents us from booting our MacOS M4 fleet.

Customers attempting to trigger Docker jobs will see queueing with slow progress.
Windows and Android jobs are unaffected.
Posted Oct 20, 2025 - 17:01 UTC

Update

We continue to experience delays in acquiring new instances from AWS (https://health.aws.amazon.com/health/status) and are actively monitoring recovery. In addition, we’re investigating issues affecting macOS M4Pro jobs where a critical network service is intermittently failing, causing an increased queue time. Our teams are working to mitigate impact and will provide updates as we learn more.
Posted Oct 20, 2025 - 15:45 UTC

Update

We continue to see delays in getting instances from AWS (https://health.aws.amazon.com/health/status) and are actively monitoring the situation.
Posted Oct 20, 2025 - 14:40 UTC

Identified

We continue to see delays in getting instances to run jobs from AWS due to their continued incident https://health.aws.amazon.com/health/status. This is causing delays in jobs starting across Docker and Linux.
Posted Oct 20, 2025 - 12:06 UTC

Update

Our upstream service provider has identified the root cause and is actively working on mitigation. We'll continue monitoring and provide updates as more information becomes available.
Posted Oct 20, 2025 - 09:10 UTC

Investigating

We are currently experiencing an issue with an upstream service provider that is causing builds to fail or experience delays in the queue. We will provide updates as more information becomes available.
Posted Oct 20, 2025 - 07:49 UTC
This incident affected: CircleCI Dependencies (AWS) and Docker Jobs, Machine Jobs, macOS Jobs, Windows Jobs, Pipelines & Workflows, Artifacts, Notifications & Status Updates.