Optimisation of array/image/projdata algebra #1545

KrisThielemans · 2024-11-22T12:21:55Z

KrisThielemans
Nov 22, 2024
Maintainer

Currently, we have simple loops for numerical operations, e.g.

STIR/src/include/stir/VectorWithOffset.inl

Lines 692 to 693 in 2eb11a9

    
           for (int i = v.get_min_index(); i <= v.get_max_index(); i++) 
        
             num[i] += v.num[i];

A few Array operations were recently parallelised, e.g.

STIR/src/include/stir/Array.inl

Lines 335 to 341 in 2eb11a9

    
           #ifdef STIR_OPENMP 
        
           #  if _OPENMP >= 201107 
        
           #    pragma omp parallel for reduction(+ : acc) 
        
           #  endif 
        
           #endif 
        
             for (int i = this->get_min_index(); i <= this->get_max_index(); i++) 
        
               acc += this->num[i].sum();

There are multiple steps on this:

OpenMP parallisations/use of SIMD of loops, see e.g. https://stackoverflow.com/questions/48711367/how-to-use-omp-parallel-for-and-omp-simd-together and https://stackoverflow.com/questions/61154047/pragma-omp-for-simd-does-not-generate-vector-instructions-in-gcc
Use of an external library, e.g. BLAS interface
Integration with CUDA via managed pointers etc, which will likely be based on https://github.com/AMYPAD/cuvec

I thought I'd create this Discussion to get some ideas/experiences together.

KrisThielemans · 2024-11-25T15:05:36Z

KrisThielemans
Nov 25, 2024
Maintainer Author

Note that the stir_timings utility is useful to test some of this, although it could use with extra tests. Check the source to see what is being timed. On my VM (running on my HP FireFly laptop), I get for instance for mMR data

stir_timings  --skip-PMRT 1 --skip-priors 1 --skip-PP 1 --template-projdata sinospan11_f1g1d0b0.hs
Using 5 threads.
	copy_image                      	                  17.500	                  17.490
	copy_add_image                  	                  27.167	                  27.178
	copy_mult_image                 	                  28.000	                  27.959
	create_vector_of_size_projdata  	                  30.000	                1160.625
	copy_std_vector_of_size_projdata	                  28.333	                  28.138
	create_proj_data_in_mem_no_init 	                  86.667	                 662.662
	create_proj_data_in_mem_init    	                  86.667	                 228.744
	copy_proj_data_mem_to_mem       	                  28.333	                  29.436
	create_copy_proj_data_mem_to_mem	                 131.667	                 245.869
	create_copy_proj_data_mem_to_file	                 595.000	                1247.209
	create_copy_proj_data_file_to_mem	                 613.333	                 965.206
	create_copy_proj_data_file_to_file	                 403.333	                 886.731
	copy_add_proj_data_mem          	                 186.667	                 319.918
	copy_mult_proj_data_mem         	                 195.000	                 312.258

Timings are reported to stdout as:

name	timing_name	CPU_time_in_ms	wall-clock_time_in_ms

Some of the code is multi-threaded, in which case the wall-clock time will/should be lower than the CPU time (which sums time over all threads).

The wall-clock time includes “system time” and "waiting time" though, for instance spent doing memory allocation. Presumably, that explains why the “create_proj_data_in_mem_no_init” wall-clock is longer than the “init” one, as in the current code, they do exactly the same. My guess is that at the first call, the OS spends time allocating the memory, while a “delete” doesn’t return to the OS, but keeps it for the process, such that in subsequent calls, the allocation is a lot faster. (Could be tested by swapping those 2 tests around, and by reading up on memory allocation!).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisation of array/image/projdata algebra #1545

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Optimisation of array/image/projdata algebra #1545

KrisThielemans Nov 22, 2024 Maintainer

Replies: 1 comment

KrisThielemans Nov 25, 2024 Maintainer Author

KrisThielemans
Nov 22, 2024
Maintainer

KrisThielemans
Nov 25, 2024
Maintainer Author