Delay in Job Start Time

Incident Report for CircleCI

Postmortem

On July 19, 2022, from 20:02 to 21:47 UTC, customers experienced delays of up to 10 minutes in job start time. We want to thank our customers for your patience and understanding as we worked to resolve this incident.

The original status page can be found here.

What happened (all times UTC)

At approximately 18:55, one of our database clusters began to show elevated read activity, with a corresponding increase in disk I/O and CPU utilization.

This activity increased over time, and by 19:30, this began to impact customer job start time through increased request latency.

On-call engineers were alerted, and a public incident was declared at 20:02. At 20:12, the database connection pools became saturated and new connections began to fail, contributing to delays in job start times as our system was slow to process workloads during the event.

Through investigation the team discovered an unusually high amount of requests generated from a specific API endpoint. Steps were taken to address the additional load, and at 21:15 we restarted services with “hung” database connections. This freed up the connection pool and allowed the team to scale up services responsible for processing job output to work through the backlog. This succeeded in reducing the database pressure and processing returned to normal by 21:25.

During this incident, most customers experienced increased job start times of up to 10 minutes and delays in both status updates and the availability of job output. A smaller number of customers experienced “stuck” jobs that presented as jobs showing as “queued” or “started” with no progress. These had to be manually stopped and restarted.

Future Prevention and Process Improvement:

We are prioritizing work to decouple job execution from this database. Once this is complete, a similar incident may result in a delay in displaying job output but should not impact job start or overall execution and job status results.

Our teams are also analyzing and improving rate limiting logic for specific inter-service APIs in order to prevent excessive request load.

Posted Sep 12, 2022 - 10:16 UTC

Resolved

This incident has been resolved. Thank you for your patience.

Posted Jul 19, 2022 - 21:47 UTC

Monitoring

A fix has been implemented and we are monitoring results. Jobs are beginning to run again as normal.

Posted Jul 19, 2022 - 21:24 UTC

Identified

We have identified the root of this issue. We are implementing a fix.

Posted Jul 19, 2022 - 21:17 UTC

Update

We are still seeing job latency of around 5 minutes. We are diligently working to identify this issue.

Posted Jul 19, 2022 - 21:02 UTC

Update

We are continuing to see latency in job starts, averaging around 7 minutes. We are continuing to investigate this issue.

Posted Jul 19, 2022 - 20:28 UTC

Investigating

We are seeing delays of up to 10 minutes on job start time. We are currently investigating the source of this issue.

Posted Jul 19, 2022 - 20:02 UTC

This incident affected: Docker Jobs.