STIR and optimising GPU support (i.e. parallelproj and NiftyPET) #1239

KrisThielemans · 2023-08-31T11:14:20Z

KrisThielemans
Aug 31, 2023
Maintainer

STIR currently supports CUDA-enabled projectors from NiftyPET (for mMR only due to hard-wiring) and Parallelproj, but performance is suboptimal. We're thinking on how to speed this up. As I'm more familiar with parallelproj, we can start there. That STIR code was based on @rijobro's work on NiftyPET anyway.

The current STIR strategy is to call the GPU code for all projection data (and then sort things out on subsets afterwards. This is sub-optimal as well, but let's concentrate on how to speed the projection up first.)

Example is in the forward projection. Current strategy:

copy STIR array to std::vector (to get continuous memory)

STIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx

Lines 148 to 149 in 4408419

    
           std::vector<float> image_vec(density.size_all()); 
        
           std::copy(_density_sptr->begin_all(), _density_sptr->end_all(), image_vec.begin());

copy std::vector data to all GPUs

STIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx

Lines 169 to 170 in 4408419

    
           float** image_on_cuda_devices; 
        
           image_on_cuda_devices = copy_float_array_to_all_devices(image_vec.data(), num_image_voxel);

call parallelproj with image in the GPU memory, giving a pointer to data (stored contiguously by ProjDataInMemory, i.e. CPU). Note that parallelproj will do the copy of projection data from GPU to host

STIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx

Lines 181 to 189 in 4408419

    
           joseph3d_fwd_cuda(_helper->xstart.data() + 3*offset, 
        
                             _helper->xend.data() + 3*offset, 
        
                             image_on_cuda_devices, 
        
                             _helper->origin.data(), 
        
                             _helper->voxsize.data(), 
        
                             _projected_data_sptr->get_data_ptr() + offset, 
        
                             num_lors_per_chunk, 
        
                             _helper->imgdim.data(), 
        
                             /*threadsperblock*/ 64);

.

copy projection data from ProjDataInMemory to ProjData object that is actually given by the user (which could be in memory, but could also sit on disk, as set_viewgram etc are overloaded accordingly)

Some reasons for this are:

step 1&2: stir::Array is currently not contiguously stored. This is being addressed in update Array hierarchy and allocate nD arrays in a contiguous block by default #1236, which means that step 1 could be avoided.
step 3: GPU memory might be too small to store complete projection data, so you can ask parallelproj to do this in chunks.
step 4: generally STIR is written to be able to handle projection data of any size, including LAFOV PET, it doesn't even need to fit in CPU memory. However, this is currently broken by the above strategy (and in the sensitivity calculation as well).

Similar steps (in reverse) happen in the back-projection

Some points from a discussion held with @casperdcl @markus-jehl @evgueni-ovtchinnikov and others on 31Aug2023:

@casperdcl suggests that using cudaMalloc in a few places might avoid the explicit transfer (of images only?) between CPU and GPU (CUDA will take care of it). One option could be to use cudaMalloc for all Arrays. This could be extended to numerical operations by either using libraries, or by using #ifdefs. Example of this is in https://github.com/AMYPAD/NumCu/blob/main/numcu/src/elemwise.cu
@casperdcl wrote CuVec to be able to auto-magically transport vectors between CUDA/C/Python (via SWIG), such that other libraries in Python can be used to do the operations on the arrays. This could be useful for STIR as well.

Some anticipated difficulties:

cudaMalloc will probably fail when asking for more memory than available on the GPU. I don't know if this is a problem when allocating multiple blocks (i.e. do they all need to fit in GPU together?)
projection data can be very big (Siemens Vision sinograms are 2.8GB, LAFOV data will be a lot bigger). Avoiding step 4 means taking care of the "chunking" directly without going via ProjDataInMemory, but this needs a refactor of the ForwardProjectorByBin class.
without doing timings on these various steps, we don't know which ones we should address first. For example,
- according to some timings in add utility to perform timings and some performance improvements #1237 copying an image (step 1) seems to take only ~2 ms.
- update Array hierarchy and allocate nD arrays in a contiguous block by default #1236 seems to make no difference in timings of current STIR (seee discussion there)

Comments/suggestions/PRs welcome!

KrisThielemans · 2023-08-31T11:18:15Z

KrisThielemans
Aug 31, 2023
Maintainer Author

@casperdcl, just replacing the std::vector in step 1 with a cudaMalloced memory block (or CuVec vector) is presumably not going to make any difference at all, as it will need to upload the data to GPU anyway. Correct? (By the way, when using multiple GPUs, does a "shared" array get copied to automatically to whatever GPU needs the data as well?)

1 reply

casperdcl Aug 31, 2023
Maintainer

I haven't stress tested multi-GPU behaviour, but happy to do so. TL;DR calling cudaSetDevice(int) or cuvec.dev_set(int) should be enough :)

KrisThielemans · 2023-08-31T11:19:31Z

KrisThielemans
Aug 31, 2023
Maintainer Author

Note that being able to expose the GPU arrays directly to Python would presumably speed-up pytorch and tensorflow a lot, although possibly that needs a lot more work, which might need a different Discussion in that case.

2 replies

casperdcl Aug 31, 2023
Maintainer

CuVec python objects can be directly understood by numpy, tensorflow, pytorch, cupy, etc (i.e. all these common libs will work without memcopies).

KrisThielemans Sep 1, 2023
Maintainer Author

That's great! We'd need to get our head round how to create a "rich" object (with all the meta-info AND the data) that can be used transparently, or maybe that can't be done.

casperdcl · 2023-08-31T20:44:53Z

casperdcl
Aug 31, 2023
Maintainer

btw it's cudaMallocManaged, not cudaMalloc

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STIR and optimising GPU support (i.e. parallelproj and NiftyPET) #1239

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

STIR and optimising GPU support (i.e. parallelproj and NiftyPET) #1239

KrisThielemans Aug 31, 2023 Maintainer

Replies: 3 comments · 3 replies

KrisThielemans Aug 31, 2023 Maintainer Author

casperdcl Aug 31, 2023 Maintainer

KrisThielemans Aug 31, 2023 Maintainer Author

casperdcl Aug 31, 2023 Maintainer

KrisThielemans Sep 1, 2023 Maintainer Author

casperdcl Aug 31, 2023 Maintainer

KrisThielemans
Aug 31, 2023
Maintainer

Replies: 3 comments 3 replies

KrisThielemans
Aug 31, 2023
Maintainer Author

casperdcl Aug 31, 2023
Maintainer

KrisThielemans
Aug 31, 2023
Maintainer Author

casperdcl Aug 31, 2023
Maintainer

KrisThielemans Sep 1, 2023
Maintainer Author

casperdcl
Aug 31, 2023
Maintainer