Installation procedure fails on NHR Grete
Following the instructions for the installation success, but the test script fails with the error: libnccl.so.2: cannot open shared object file
The content of the output file is this: Submitting job with sbatch from directory: /scratch-grete/usr/gzfbklued/deep-learning-with-gpu-cores/code Home directory: /home/gzfbklued Working directory: /scratch-grete/usr/gzfbklued/deep-learning-with-gpu-cores/code Current node: ggpu136 Python 3.8.19 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Fri_Jan__6_16:45:21_PST_2023 Cuda compilation tools, release 12.0, V12.0.140 Build cuda_12.0.r12.0/compiler.32267302_0
The full traceback from the error file can be found below: Traceback (most recent call last): File "/home/gzfbklued/.conda/envs/dl-gpu/lib/python3.8/runpy.py", line 185, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/home/gzfbklued/.conda/envs/dl-gpu/lib/python3.8/runpy.py", line 111, in _get_module_details import(pkg_name) File "/home/gzfbklued/.local/lib/python3.8/site-packages/torch/init.py", line 229, in from torch._C import * # noqa: F403 ImportError: libnccl.so.2: cannot open shared object file: No such file or directory Traceback (most recent call last): File "test.py", line 13, in import torch File "/home/gzfbklued/.local/lib/python3.8/site-packages/torch/init.py", line 229, in from torch._C import * # noqa: F403 ImportError: libnccl.so.2: cannot open shared object file: No such file or directory