I'm trying to run a multi-node cluster (unmanaged spot) on Skypilot. The head node is up, but the worker node provisioning fails with a timeout error. I've attached the resources.yaml
and provided the output from the sky launch
command. Here's the relevant part of the output:
E 02-20 09:04:29 backend_utils.py:1354] Timed out: waited for more than 240 seconds for new workers to be provisioned, but no progress.
I'm not sure how to get past this to use the multi-node cluster. Can anyone help me troubleshoot this issue?
Kenady Inampudi
Asked on Feb 20, 2024
It looks like the version of Skypilot you are using might be outdated. There was a recent fix for using machine images on GCP that could potentially resolve your issue. I recommend upgrading Skypilot to the latest version to see if that fixes the problem. You can upgrade using the following command:
pip install -U skypilot-nightly
After upgrading, try launching your cluster again and check if the worker nodes are provisioning correctly. If you encounter any errors related to the provisioning model, ensure that your resources.yaml
specifies use_spot: true
correctly, as this is required for spot instances on GCP.