I was relaunching a stopped cluster using SkyPilot when my internet connection was interrupted. Now, SkyPilot seems to be stuck in limbo, and the cluster status remains 'INIT' even after 20 minutes. I've tried to sky launch
multiple times, but I keep encountering a sky.exceptions.FetchIPError
. Here's the command I used and the error output:
$ sky launch -c spot-jupyter --use-spot jupyter.yaml
# ... error output ...
subprocess.CalledProcessError: Command 'ray get-head-ip '/var/folders/z1/yww43t6d0vl6qp45379h2n540000gn/T/tmpdu3ug4w9'' returned non-zero exit status 1.
# ... more error output ...
sky.exceptions.FetchIPError
How can I resolve this issue?
Chaskin Saroff
Asked on Dec 18, 2023
Zhanghao Wu from SkyPilot suggested that I try sky launch
again on the cluster, as it often fixes the INIT
status. However, when I attempted this, it resulted in the same IPError each time. Zhanghao then inquired if I had checked the GCP console for preemption issues, which I had not. He also mentioned that a more robust GCP provisioner is being developed for SkyPilot, which should address many current issues, including the one I'm facing. Additionally, Zongheng Yang noted that stopping GCP spot instances has been added in a recent pull request under review.