I'm having trouble with submitting a managed Azure spot job for axolotl training using SkyPilot. Despite several attempts, I keep encountering an error indicating either unexpected submission errors or preemption during job submission. The provisioning seems successful initially, but the VM disappears from the Azure portal shortly after the 'head node is up' log entry. I've tried using sky launch --use-spot task.yaml
and it worked fine, but not with the managed spot job. I suspect it might be related to the networking setup that SkyPilot does for the head node to connect to the controller. Here's the error message and some logs:
Failed to successfully submit the job to the launched cluster, due to unexpected submission errors or the cluster being preempted during job submission.
(omni-tune-testy, pid=2731) I 02-07 13:22:54 recovery_strategy.py:210] The cluster is preempted before the job is submitted.
(omni-tune-testy, pid=2731) I 02-07 13:22:54 recovery_strategy.py:351] Failed to successfully submit the job to the launched cluster, due to unexpected submission errors or the cluster being preempted during job submission.
Is there a specific networking setup required for the head node to connect to the controller after the initial provisioning? And how can I debug this issue?
Eko Julianto Salim
Asked on Feb 07, 2024
It seems that the issue might not be related to preemption or networking setup. Zhanghao Wu from SkyPilot suggests that the runtime (ray) on the Azure cluster might take longer to initialize compared to other clouds, which could cause the job to switch to the INIT state and trigger the error. A fix for this issue is being worked on and will be pushed soon. In the meantime, you can continue using sky launch --use-spot task.yaml
as a workaround. Additionally, it's helpful to share the SkyPilot version and commit, as well as the yaml file (with sensitive parts redacted), to assist in debugging the problem.