Release 0.5.0
PtyPy 0.5 release notes
We're excited to bring you a new release, with new engines, GPU accelerations and many smaller improvements.
Engine Updates
New abstraction layer for most engines, new engines.
- generalised projectional engine with derived engines DM, RAAR
- generalised stochastic engine with derived engines EPIE, SDR
Engines that are based on global projections now all derive from a generalized base engine that is able to express most common projection algorithms with 4 scalar parameters. DM and RAAR are two such derived classes. Similarly, algorithms based on a stochastic sequence of local projections (SDR, EPIE) now inherit from a common base engine.
GPU acceleration
- GPU-acceleration for all major engines DM, ML, EPIE, SDR, RAAR
- accelerated engines needs to be imported explicitly with
import ptypy ptypy.load_gpu_engines('cuda')
We accelerated three engines (projectional, stochastic and ML) using the PyCUDA
and Reikna
library and a whole collection of custom kernels.
All GPU engines leverage a "streaming" model which means that the primary locations of all objects are on the host (CPU) memory. Diffraction data arrays and all other arrys that scale linearly with the number of shifts/positons are segmented into blocks (of frames). The idea is that these blocks are moved on and off the device (GPU) during engine iteration if the GPU does not have enough memory to store all blocks. The number of frames per block can be adjusted with the new top-level [frames_per_block
] https://ptycho.github.io/ptypy/rst/parameters.html#ptycho.frames_per_block) parameter. This parameter has little influence for smaller problem size, but needs to be adjusted if your GPU has too little memory to fit even a single block.
Each engine iteration will cycle through all blocks, DM needs to even cycle once for each projection. We therefore recommend to make the block size small enough such that at least a couple of blocks fit on the GPU to hide the latency of data transfers. For best performance, we employ a mirror scheme such that each cycle reverses the block order and reduces the host to device copies (and vice versa) to the absolute minimum.
GPU engines work in parallel when each MPI rank takes one GPU. For sending data between ranks, PtyPy will perform a host copy first in most cases or use whatever the underlying MPI implementations does for CUDA-aware MPI (only tested for OpenMPI). Unfortunately, this mapping of one rank per GPU will leave CPU cores idle if there are more cores on the system than GPUs.
Within a node, PtyPy can use nccl (requires a CuPy install and setting PTYPY_USE_NCCL=1
) for passing data between ranks/GPUs.
Breaking changes
Ptyscan classes (experiment) need to be imported explicitly
Most derived PtyScan classes (all those in the /experiment
folder) now need to be imported explicitly. We took this step to separate the user space more clearly from the base package and to avoid dependency creep from user-introduced code. At the beginning of your script, you now need to import your module explicitly or use one of the helper functions.
import ptypy
ptypy.load_ptyscan_module(module='')
ptypy.load_all_ptyscan_modules()
Any PtyScan derived class in these modules that is decorated with the ptypy.experiment.register()
function will now be included in the parameter tree and selectable by name.
If you prefer the old way of importing ptypy "fully loaded", just use
import ptypy
ptypy.load_all()
which attempts to load all optional PtyScan classes and all engines.
Other updates
- Code for
utils.parallel.bcast_dict
andgather_dict
has been simplified and should be backwards compatible. - The
fourier_power_bound
that was previously calculated internally from thefourier_relax_factor
can now be set explicitly and we recommend that from now on. The recommended value for thefourier_power_bound
is 0.25 for Poisson statistics (see supplementary ofthis paper
) - Position correction now supports an alternate search scheme, i.e. along a fixed grid. This scheme is more accurate than a stochastic search and the overhead incurred for this brute force search is acceptable for GPU engines.
- We switched to a pip install within a conda environment as the main supported way of installation
Roadmap
- Automatic adjustment of the block sizes.
- Improve scaling behavior across multiple nodes and high frame counts.
- Better support for live processing (on a continuous detector data stream).
- More tests.
- Branch cleaning.
Contributors
Thanks to the efforts at the Diamond Light Source that made this update possible.
- Aaron Parsons
- Bjoern Enders
- Benedikt Daurer
- Joerg Lotze