skypilot-users

Why is my job not running in the cluster even though it's submitted?

I have an echo 'Starting Job' but it never runs. I can find provision.log, sync.log, and setup.log, but nothing is coming from the job. How can I troubleshoot this issue?

Ge

Georgehwp

Asked on Feb 16, 2024

  • Check if there are enough resources available on the cluster for the job to run.
  • Use sky queue <cluster_name> to view the list of jobs and their current status.
  • If the job is stuck in the queue due to resource unavailability, it will not run until resources become available.
  • To remove all jobs from the queue, use sky cancel <cluster-name> --all.
  • If upgrading the skypilot version doesn't resolve the issue, consider restarting the cluster with sky stop + sky start.
  • If the problem persists, starting from scratch with a new cluster may be a viable solution.
Feb 16, 2024Edited by