STIR and optimising GPU support (i.e. parallelproj and NiftyPET) #1239
Replies: 3 comments 3 replies
-
@casperdcl, just replacing the |
Beta Was this translation helpful? Give feedback.
-
Note that being able to expose the GPU arrays directly to Python would presumably speed-up |
Beta Was this translation helpful? Give feedback.
-
btw it's |
Beta Was this translation helpful? Give feedback.
-
STIR currently supports CUDA-enabled projectors from NiftyPET (for mMR only due to hard-wiring) and Parallelproj, but performance is suboptimal. We're thinking on how to speed this up. As I'm more familiar with parallelproj, we can start there. That STIR code was based on @rijobro's work on NiftyPET anyway.
The current STIR strategy is to call the GPU code for all projection data (and then sort things out on subsets afterwards. This is sub-optimal as well, but let's concentrate on how to speed the projection up first.)
Example is in the forward projection. Current strategy:
std::vector
(to get continuous memory)STIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx
Lines 148 to 149 in 4408419
std::vector
data to all GPUsSTIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx
Lines 169 to 170 in 4408419
ProjDataInMemory
, i.e. CPU). Note thatparallelproj
will do the copy of projection data from GPU to hostSTIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx
Lines 181 to 189 in 4408419
ProjDataInMemory
toProjData
object that is actually given by the user (which could be in memory, but could also sit on disk, asset_viewgram
etc are overloaded accordingly)Some reasons for this are:
stir::Array
is currently not contiguously stored. This is being addressed in update Array hierarchy and allocate nD arrays in a contiguous block by default #1236, which means that step 1 could be avoided.parallelproj
to do this in chunks.Similar steps (in reverse) happen in the back-projection
Some points from a discussion held with @casperdcl @markus-jehl @evgueni-ovtchinnikov and others on 31Aug2023:
cudaMalloc
in a few places might avoid the explicit transfer (of images only?) between CPU and GPU (CUDA will take care of it). One option could be to usecudaMalloc
for allArray
s. This could be extended to numerical operations by either using libraries, or by using#ifdef
s. Example of this is in https://github.com/AMYPAD/NumCu/blob/main/numcu/src/elemwise.cuSome anticipated difficulties:
cudaMalloc
will probably fail when asking for more memory than available on the GPU. I don't know if this is a problem when allocating multiple blocks (i.e. do they all need to fit in GPU together?)ProjDataInMemory
, but this needs a refactor of theForwardProjectorByBin
class.Comments/suggestions/PRs welcome!
Beta Was this translation helpful? Give feedback.
All reactions