skypilot-users

Why is my spot job stuck at 'Starting' on AWS?

I am having a hard time running a spot job of any kind on AWS. It always appears to get stuck at 'Starting'. Has anyone else experienced this or know what the issue might be?

Ki

Kishan Bhoopalam

Asked on Jan 12, 2024

There could be several reasons why your spot job is stuck at 'Starting' on AWS. Here are some possible causes and solutions:

  1. Quota Limit: Check if you have sufficient quota in the region where you are trying to run the spot job. If you don't have enough quota, you can reach out to your AWS account manager to get additional quota approved.

  2. Permission Issue: The error message you provided suggests that there might be a permission issue related to the creation of the service-linked role for EC2 Spot Instances. Make sure you have the necessary permissions set up in your IAM policy. You can refer to the Skypilot documentation for the minimal policy setup required.

  3. Missing Permissions: Double-check if you have included all the required permissions in your IAM policy. Specifically, make sure the following actions and resources are included:

    • iam:GetRole, iam:PassRole, iam:CreateRole, iam:AttachRolePolicy for the role skypilot-v1
    • iam:GetInstanceProfile, iam:CreateInstanceProfile, iam:AddRoleToInstanceProfile for the instance profile skypilot-v1
    • iam:CreateServiceLinkedRole with the condition iam:AWSServiceName equal to spot.amazonaws.com
  4. Debugging: You can use the sky launch <spot job yaml> command to debug the spot job from your laptop. If it works from your laptop but not from Skypilot, there might be an issue with missing permissions related to the spot controller.

By checking these possible causes and solutions, you should be able to troubleshoot and resolve the issue with your spot job on AWS.

Jan 12, 2024Edited by