On July 19, 2022, from 20:02 to 21:47 UTC, customers experienced delays of up to 10 minutes in job start time. We want to thank our customers for your patience and understanding as we worked to resolve this incident.
The original status page can be found here.
What happened (all times UTC)
At approximately 18:55, one of our database clusters began to show elevated read activity, with a corresponding increase in disk I/O and CPU utilization.
This activity increased over time, and by 19:30, this began to impact customer job start time through increased request latency.
On-call engineers were alerted, and a public incident was declared at 20:02. At 20:12, the database connection pools became saturated and new connections began to fail, contributing to delays in job start times as our system was slow to process workloads during the event.
Through investigation the team discovered an unusually high amount of requests generated from a specific API endpoint. Steps were taken to address the additional load, and at 21:15 we restarted services with “hung” database connections. This freed up the connection pool and allowed the team to scale up services responsible for processing job output to work through the backlog. This succeeded in reducing the database pressure and processing returned to normal by 21:25.
During this incident, most customers experienced increased job start times of up to 10 minutes and delays in both status updates and the availability of job output. A smaller number of customers experienced “stuck” jobs that presented as jobs showing as “queued” or “started” with no progress. These had to be manually stopped and restarted.
Future Prevention and Process Improvement:
We are prioritizing work to decouple job execution from this database. Once this is complete, a similar incident may result in a delay in displaying job output but should not impact job start or overall execution and job status results.
Our teams are also analyzing and improving rate limiting logic for specific inter-service APIs in order to prevent excessive request load.