Hamza Tahir
Asked on Oct 26, 2023
The issue you're experiencing with Skypilot getting stuck in an infinite loop when reusing a cluster on AWS is likely due to a lack of permission for the ec2:DescribeInstances
action. This error occurs when the ray cluster on the remote VM encounters an unauthorized operation. To fix this, you need to ensure that your AWS account has the necessary permissions. You can check your IAM user's permissions and make sure it has the ec2:DescribeInstances
permission. Additionally, you can try using the latest skypilot-nightly
version, which includes a new provisioner that may resolve the issue. If you're using a different IAM user or profile on your client, you'll need to ensure that the VM has access to the correct AWS credentials. You can upload your local static AWS credentials to the VM and let the service on the VM use those credentials. If you're using an AWS profile, you may need to set the AWS_PROFILE
environment variable in the VM through Skypilot configuration. Please refer to the Skypilot documentation for more details.