-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support chunk cache in versioned-hdf5 (PyInf#13103) #357
Comments
This question will be a good one for Guido - if you're okay with this, let's consult him once he becomes available. |
#386 completely disregards the virtual dataset, always reading from raw_data (whereas master reads from raw_data if there are any changes, and from the virtual dataset if there are none). Importantly, the same PR also disables cache on read, which increases the reliance on the hdf5 C cache. With #386 we see how if you size the hdf5 C cache to be as large as the whole dataset, you get almost the same performance as caching with numpy (but, interestingly, you pay dearly to initialise the cache): with h5py.File(path, "w") as f:
vf = VersionedHDF5File(f)
with vf.stage_version("r0") as sv:
sv.create_dataset("value", data=np.random.random((20000, 20000)), chunks=(100, 100))
with h5py.File(path, "r+") as f:
vf = VersionedHDF5File(f)
with vf.stage_version("r1") as sv:
%time sv["value"][()] # 1.38 s
%time sv["value"][()] # 1.18s
with vf.stage_version("r2") as sv:
%time sv["value"][()] # 1.3 s
with h5py.File(path, "r+", rdcc_nbytes=3*2**30, rdcc_nslots=400_000) as f: # 5% cache collision rate
vf = VersionedHDF5File(f)
with vf.stage_version("r3") as sv:
%time sv["value"][()] # 2.07 s
%time sv["value"][()] # 944 ms
with vf.stage_version("r4") as sv:
%time sv["value"][()] # 1.09 s
# Load everything into numpy cache
sv["value"].dataset.staged_changes.load()
%time sv["value"][()] # 779 ms |
Closed by #386 |
Reopening as |
This is probably more of an
h5py
orlibhdf5
issue, but it mainly impacts versioned-hdf5 so I'm opening it here.HDF5 maintains a chunk cache to satisfy repeated accesses to a chunk without having to go to the file, see hdf5 docs.
If I understand it correctly that chunk caching does never apply to virtual datasets so is not used in versioned-hdf5? Here are some benchmarks that are much slower in versioned-hdf5 than without versioning:
Is this slowness due to the (missing) chunk cache? If yes, how can we add support for chunk caching for virtual datasets?
The text was updated successfully, but these errors were encountered: