In contrast with the rest of our job execution fleet, we have a fixed capacity of hardware to run macOS jobs and so when that capacity is reached customers experienced increased queuing and longer spin-up times for jobs.
We had hoped to onboard the latest batch of machines before customers jobs were affected but supply chain issues cause our delivery to be delayed by several weeks and so on April 20 we were running with insufficient capacity for our growing demand.
Starting from 20:10 UTC on April 20th we experienced increased demand for macOS jobs running on Mac Gen2 resources. When we were alerted to additional queuing the team worked to relieve the pressure on the system. We did this via four parallel tracks of action:
Optimizing the throughput of the existing fleet
- There was a pre-existing issue causing provisioning to be slower when capacity was low which impacted the spin-up time for customer jobs. The team was able to find a solution that meant that the spin-up time was restored to normal levels even when capacity was low. This had the effect of us using our capacity more efficiently and so more jobs flowed through the system. We were able to enable that quickly and so we were able to see some recovery on the first day which meant that we were able to handle our peak usage a little better on the following days.
Adding capacity to the Gen2 VM pool from our large
resource fleet
- We configured a number of our older machines from our existing fleet to be able to run Gen2 jobs. This required us to deploy the Gen2 images to each of the hosts. The images have sizes around 90Gb and so this is a process that requires several hours to complete. We chose the two most popular images given that we were seeing the highest load there. These additional hosts were ready the next day and we saw reduced queuing time as a result.
- This extra capacity has now been disabled due to fact that some newly-provisioned Gen2 hosts (see #4) are giving us the capacity to handle peak usage with room to spare. We are keeping these older machines configured in reserve so we can deploy them quickly if capacity issues reoccur.
Communications to ensure that customers were aware that they could move their jobs to the the other macOS resource classes, large
and medium
where we were not experiencing capacity issues
- Some of our customers moved resource class which had the effect of reducing the load on our Gen2 resources, this meant that we were able to handle our peak usage a little better.
Work to add extra capacity to the Gen2 fleet
- Our data center provider had received a new batch of machines but had only just started to configure the machines to be added to the fleet. The configuration step involves making changes to the hardware and software of the host and requires hands-on access which makes the process relatively slow. We were able to add 40% of that batch of machines on Friday at 15:00 UTC and saw that job queuing immediately improved.
Now that these actions are complete we are seeing that our Gen2 fleet has plenty of extra capacity at peak times and jobs are starting without delay.
We appreciate everyone’s patience during this time.