skypilot-users

How to Resolve SkyPilot's INIT Status and FetchIPError After Internet Disconnection?

I was relaunching a stopped cluster using SkyPilot when my internet connection was interrupted. Now, SkyPilot seems to be stuck in limbo, and the cluster status remains 'INIT' even after 20 minutes. I've tried to sky launch multiple times, but I keep encountering a sky.exceptions.FetchIPError. Here's the command I used and the error output:

$ sky launch -c spot-jupyter --use-spot jupyter.yaml
# ... error output ...
subprocess.CalledProcessError: Command 'ray get-head-ip '/var/folders/z1/yww43t6d0vl6qp45379h2n540000gn/T/tmpdu3ug4w9'' returned non-zero exit status 1.
# ... more error output ...
sky.exceptions.FetchIPError

How can I resolve this issue?

Ch

Chaskin Saroff

Asked on Dec 18, 2023

Zhanghao Wu from SkyPilot suggested that I try sky launch again on the cluster, as it often fixes the INIT status. However, when I attempted this, it resulted in the same IPError each time. Zhanghao then inquired if I had checked the GCP console for preemption issues, which I had not. He also mentioned that a more robust GCP provisioner is being developed for SkyPilot, which should address many current issues, including the one I'm facing. Additionally, Zongheng Yang noted that stopping GCP spot instances has been added in a recent pull request under review.

Dec 18, 2023Edited by