v2.0.0
-
stage_dataset
has been reimplemented from scratch. The new engine is
expected to be much faster in most cases. -
__getitem__
on staged datasets used to never cache data when reading from
unmodified datasets (before the first call to__setitem__
orresize()
) and
used to cache the whole loaded area on modified datasets (where the user had
previously changed a single point anywhere within the same staged version).This has now been changed to always use the libhdf5 cache. As such cache is very
small by default, users on slow disk backends may observe a slowdown in
read-update-write use cases that don't overwrite whole chunks, e.g.ds[::2] += 1
.
They should experiment with sizing the libhdf5 cache so that it's larger than the
work area, e.g.:with h5py.File(path, "r+", rdcc_nbytes=2**30, rdcc_nslots=100_000) as f: vf = VersionedHDF5File(f) with vf.stage_version("r123") as sv: sv["some_ds"][::2] += 1
(this recommendation applies to plain h5py datasets too).
Note that this change exclusively impacts
stage_dataset
;current_version
,
get_version_by_name
, andget_version_by_timestamp
are not impacted and
continue not to cache anything regardless of libhdf5 cache size. -
Added support for Ellipsis (...) in indexing.