How to Resolve Skypilot OutOfMemoryError When Provisioning Multiple Nodes?
I'm trying to use Skypilot to provision 75 nodes using a t2.small
instance on AWS, but I'm encountering an OutOfMemoryError
. Here's the error message I'm getting:
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
I've checked the GitHub issue related to this, but the resolution isn't clear to me. Could the problem be with the spot controller? Below is the YAML config I'm using:
resources:
cloud: aws
instance_type: t2.small
workdir: .
setup: |
echo "Running setup."
run: |
echo "Hello, SkyPilot!"
conda env list
How can I resolve this issue?
Kishan Bhoopalam
Asked on Jan 24, 2024
It seems that the t2.small
instance with 2GB of memory is not sufficient for running Ray as part of the Skypilot runtime, especially when provisioning 75 nodes. Ray is known to be memory-intensive. To resolve this issue, you should try using a larger memory machine. For example, you can use the --memory 4+
option to pick instances with more memory, such as c6i.large
on AWS. After switching to a t2.medium
instance, the problem was resolved. A good lower bound for memory per node when using Ray is probably at least 4GB, as 2GB has been shown to fail.