We were alerted by Support that multiple customers were having builds queued for long periods of time. SRE looked into the situation and realized that our Trusty fleet was unbalanced - this meant that some jobs were starved of resources even though our monitoring system showed plenty of capacity.
The root cause was determined to be an alert that triggered during a short period of time our on-call coverage was unavailable and the alert remained triggered even after full on-call coverage was restored.
We are going to review our on-call handoff process to remove the chance of this happening in the future.