skypilot-users

How to Resolve Skypilot OutOfMemoryError When Provisioning Multiple Nodes?

I'm trying to use Skypilot to provision 75 nodes using a t2.small instance on AWS, but I'm encountering an OutOfMemoryError. Here's the error message I'm getting:

ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

I've checked the GitHub issue related to this, but the resolution isn't clear to me. Could the problem be with the spot controller? Below is the YAML config I'm using:

resources:
  cloud: aws
  instance_type: t2.small
workdir: .
setup: |
  echo "Running setup."
run: |
  echo "Hello, SkyPilot!"
  conda env list

How can I resolve this issue?

Ki

Kishan Bhoopalam

Asked on Jan 24, 2024

It seems that the t2.small instance with 2GB of memory is not sufficient for running Ray as part of the Skypilot runtime, especially when provisioning 75 nodes. Ray is known to be memory-intensive. To resolve this issue, you should try using a larger memory machine. For example, you can use the --memory 4+ option to pick instances with more memory, such as c6i.large on AWS. After switching to a t2.medium instance, the problem was resolved. A good lower bound for memory per node when using Ray is probably at least 4GB, as 2GB has been shown to fail.

Jan 25, 2024Edited by