I'm trying to scale my SkyPilot deployment to over 75 nodes, but I'm encountering an OSError: [Errno 24] Too many open files
. Here's the error message I received:
E 01-19 02:58:48 provisioner.py:492] *** Failed setting up cluster. ***
... (truncated stack trace) ...
D 01-19 02:58:48 provisioner.py:493] OSError: [Errno 24] Too many open files
I've already scaled down from 100 nodes due to instance limits, but now this error is preventing me from scaling up. What does this error imply and how can I fix it?
Kishan Bhoopalam
Asked on Jan 19, 2024
The 'Too many open files' error suggests a ulimit
issue on your system, which limits the number of open file descriptors. To resolve this, you can increase the ulimits by creating a new shell and running ulimit -n 65535
. Here's what I did:
ulimit -n 65535
# In the same shell, run sky launch
sky launch --num-nodes=75 --cpus 2+ --use-spot --down -c test --cloud aws
This workaround is effective for dealing with parallelism issues when launching many workers. It's also being tracked in the SkyPilot GitHub issues and I've submitted a PR to update the documentation to help others who might face the same problem.