How to give Ray access to all cluster resources when using SkyPilot and Ray Tune for HPO?
I'm using SkyPilot and Ray Tune for hyperparameter optimization (HPO), and I'm encountering an issue where Ray is not recognizing all the resources on my cluster. Specifically, I have a cluster with 2 nodes, each with 4xV100 GPUs, but when I run ray.cluster_resources()
, it only shows 1 node with 4xV100 GPUs. I'm following the example provided in the SkyPilot documentation and using ray.init()
on the head node, but it doesn't seem to work. I'm also using Ray version 2.9.2 within a conda environment that I create and activate at setup time. Here's the output from ray.cluster_resources()
:
{
'CPU': 32.0,
'memory': 168859253351.0,
'node:172.31.18.251': 1.0,
'object_store_memory': 76653965721.0,
'GPU': 4.0,
'accelerator_type:V100': 1.0
}
How can I ensure that Ray has access to all the resources on my cluster?
Jason Krone
Asked on Feb 20, 2024
I was having trouble getting Ray to recognize all the resources on my cluster when using SkyPilot and Ray Tune for HPO. After discussing with Zongheng Yang from SkyPilot, it turns out that I needed to ensure that the worker node had joined the custom Ray cluster and that I was using a compatible version of Ray. Initially, I was using Ray version 2.9.2, which was causing compatibility issues. By downgrading to Ray version 2.8.1 and ensuring that I passed the correct Ray address (with my custom port) to ray.init(..)
, I was able to resolve the issue. Now, when I run ray.status()
and ray.cluster_resources()
, Ray correctly shows all the nodes and resources available in the cluster.