I'm following the 'Skypilot in 1 minute' tutorial and encountered an error related to libcudnn_cnn_train.so.8
when running a training task. The error message indicates an undefined symbol and a RuntimeError related to the backward pass in PyTorch. How can I troubleshoot and resolve this issue?
Caleb Welton
Asked on Feb 01, 2024
The issue might be due to a mismatch between the CUDA Deep Neural Network library (cuDNN) installed on the cluster and the PyTorch version. To resolve this, try downgrading PyTorch to the last version before 2.2 by running pip install -U torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121
. Additionally, if you're reusing an existing cluster and running exec
, you need to perform the downgrade in the run
section, as setup
is ignored with exec
. If you encounter further issues, you can SSH into the machine to interact with it directly and investigate the problem.