On June 30th, starting at 8:35 AM UTC, maintenance on our API token management system inadvertently increased the percentage of rate-limited API calls from a typical 1% to approximately 10%. Due to a gap in our monitoring, we were not alerted to this spike until customers began reporting issues. We declared an incident at 3:11 PM UTC and fully resolved the issue by 3:54 PM UTC. We're grateful for your patience and partnership as we addressed this.
The original status page can be found here.
On June 30th, at 8:35 AM UTC, we performed a maintenance of our API token management system. This work involved changes to how token validation interacts with our rate-limiting system.
Unintentionally, the change led to a stricter enforcement of API rate limits across all calls using tokens, including some that should not have been affected. This caused the percentage of API calls being rate limited to climb from the normal baseline of around 1% to approximately 10%. This increase was not immediately visible due to a blind spot in our monitoring: while we do track rate-limited calls, we did not have alerting configured for sudden spikes in this metric.
We declared an incident at 3:11 PM UTC after having received customer reports of an increase of the amount API calls being rate limited. On verifying those claims, we realized that there was indeed a measured increase in our rate limiting of API calls.
From 3:29 to 3:43 PM UTC, we proceeded to mitigate the issue by reviewing our rate limits and further investigated the endpoints that started being rate-limited due to the token maintenance. A combination of both of these approaches brought our rate limited calls back to the normal ratio.
The incident was closed out at 3:54 PM UTC, the rate limited API calls having returned to the normal ratio.
We've updated our monitoring to include alerting on sudden spikes in rate-limited API calls. This will ensure faster detection and response in the future.