GPU & CPU singularity: Potential Pitfalls building/using images #189
Unanswered
gregorex333
asked this question in
General
Replies: 1 comment
-
Thank you for sharing the useful information with all users, and excellent job to build the GPU version of the container! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I had some issues running the available main tvb-ukbb singularity image at the time, so I tried to build my own image with both CPU-only and GPU support. This ran into some errors or issues which were difficult to solve, so I am noting them here in case anyone runs into similar problems & is looking for potential solutions.
Singularity builder failing to mount
If building from a definition file cannot get started due to a failure to mount, the issue can be the compiler. For myself it was discovered the newest GO 1.22.0 compiler introduced a bug making it incompatible with my 20.04 Ubuntu. Uninstalling & rebuilding Singularity with the 1.21.7 from https://go.dev/dl/ solved this issue. So changing compiler can help
TVB-pipeline script issues
Nibabel has at times introduced errors through deprecating commands and your current pipeline scripts might not be adapted to them.
A) get_header() / get_affine() changed to --> .header
the get functions were removed and replaced with simpler .header or .affine.
bb_file_manager.py or bb_mask_negatives_4D.py were examples of old usage at the time. You can use a command in terminal like grep -r "header" * to find similar commands if needed
dim.append(epi_img.get_header()["dim"][4])
This line appeared twice in bb_file_manager.py for example and needed to be changed to ".header["dim"][4])"
B) get_data() finally deprecated fully causing error
"nibabel.deprecator.ExpiredDeprecationError: get_data() is deprecated in favor of get_fdata(), which has a more predictable return type."
This change was 3 years coming, but had finally recently occurred.
This required changing "tvb_createDTImasks.py" script in the diffusion pipeline around line 58 & substituting get_data() with get_fdata() at 2 points:
Interface_roied_data= Interface_roied_img.get_data()
Interface_inDTI_data= Interface_inDTI_img.get_data()
This sub will create connectivity, though I'm not yet sure if its output is equal to the other possible substitution:
Interface_roied_data = np.asanyarray(Interface_roied_img.dataobj)
Interface_inDTI_data= np.asanyarray(Interface_inDTI_img.dataobj)
C) Local program installations into /home may interfere
As run singularity will mount your home directory, it's possible it might use other installations instead of the one inside the image. This may include FSL, or it could instead include the Python version installed inside of FSL (my case)!!! This is important, as if you see "torch module not found" this can be due to a Python version above 3.7 being installed, which is not compatible with pytorch <v2.0 and potentially other dependencies in the environment "init_vars" loads.
Adding "sudo" before singularity run can fix this problem if you get it, or removing / renaming for FSL folder to avoid the link to its Python if its above 3.7.
*** script/python version issues on custom image
If you are using a custom singularity image like I was, though, the FSL INSIDE the image in "/opt/soft/env" may hold a Python version >3.7 thanks to the fslinstaller.py. In this case, you will need to either recreate the image and remove this Python OR (somewhat hacky solution) change all calls to "python" in the scripts to "python3.7" so that dependencies like torch will load. This is a little tedious, but it works.
A) "probtrackx2_gpu: error while loading shared libraries: libcudart.so.10.2: cannot open shared object file: No such file or directory"
if you see this message, the cuda toolkit or driver installed in your image differs too much from the one installed on your (local?) workstation. This happened to me with cudatoolkit 10.2 on the image and 2.X toolkit and driver on the system, so I downgraded the system to 11.X
B) ""terminate called without an active exception" in diffusion pipeline using GPU
If this occurs after bedpostx ends in "bb_diffusion_pipeline" log, then check "tvb_probtrackx" log to see if you find
"...................Allocated GPU 0...................
Device memory available (MB): 1812 ---- Total device memory(MB): 1993
Memory required for allocating data (MB): 2491
Not enough Memory available on device. Exiting ..."
This means your VRAM is insufficient with current code to process the data & you need to replace the GPU.
C) Incompatible dependencies in init_vars environment
If you want to create a GPU image, you may have to adjust the environment env_sing.yml file in the tvb-pipeline installer called at the end of init_vars
https://pytorch.org/get-started/previous-versions/
most important is ensuring gpu-enabled versions of pytorch, torchaudio, torchvision are called -- and that cpuonly is removed from the environment. I am attaching one potential environment, though the above link can list cuda vs pytorch compatible combinations. Note there is no pytorch-cuda 1.8. Not all versions have a matching pytorch-cuda.
GPU Version 1 (what I used)
python=3.7
pytorch==1.13.1
torchvision==0.14.1
torchaudio==0.13.1
pytorch-cuda=11.7 / 11.6
(^ remove cudatoolkit from the conda environment)
GPU Version 2 (useful for lower versions of cuda toolkit/drivers)
python=3.7
pytorch==1.12.1
torchvision==0.13.1
torchaudio==0.12.1
cudatoolkit=10.2 / 11.3 / 11.6
These are some potential problems & solutions.
I am pretty sure my GPU image will work after I soon upgrade GPU, if not there will be another post here :)
Below is an example environment .yml file for cuda enabled GPU for 11.X cuda. CUDA >= 2.0 is not supported as this requires Python >= 3.8, which may not work allow all the dependencies to resolve together (gradunwarp was an issue finding a match for me, I think). Pytorch also has Python version restrictions as I've mentioned.
env_sing.txt
Beta Was this translation helpful? Give feedback.
All reactions