On May 17 between 16:35 and 23:47 UTC we experienced an outage that caused Workflows to be unavailable or degraded for the majority of customers.
We identified the root cause of the incident as hitting scale limits on our database service, resulting in increased wait times to run queries. Consequently, our backend was unable to keep up with the growing backlog, and customers experienced long wait periods to run Workflows. We addressed our resource limits immediately by increasing capacity for the database. This process took 2 hours and 20 minutes, and once complete we were able to restore normal operations. Our build job and Machine execution fleets were scaled up to handle the higher volume, whereas our macOS fleet suffered longer degraded performance due to limited capacity.
In these past weeks, we have taken several measures to prevent this failure from happening again: