Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support chunk cache in versioned-hdf5 (PyInf#13103) #357

Open
ArvidJB opened this issue Jul 23, 2024 · 4 comments
Open

Support chunk cache in versioned-hdf5 (PyInf#13103) #357

ArvidJB opened this issue Jul 23, 2024 · 4 comments
Assignees

Comments

@ArvidJB
Copy link
Collaborator

ArvidJB commented Jul 23, 2024

This is probably more of an h5py or libhdf5 issue, but it mainly impacts versioned-hdf5 so I'm opening it here.

HDF5 maintains a chunk cache to satisfy repeated accesses to a chunk without having to go to the file, see hdf5 docs.

If I understand it correctly that chunk caching does never apply to virtual datasets so is not used in versioned-hdf5? Here are some benchmarks that are much slower in versioned-hdf5 than without versioning:

>>> import h5py
>>> import numpy as np
>>> from versioned_hdf5 import VersionedHDF5File
>>> # END AUTO-GENERATED BLOCK


>>> # pick chunks and h5py cache sizes
>>> chunks = (1000,)
>>> rdcc_nbytes = 10 * (2 ** 20) # 10 MiB, enough to fit 1000 * 8192 byte chunks
>>> rdcc_nslots = 10007 # prime number large enough to fit roughly 10x number of chunks


>>> # slice to read
>>> slc = slice(None, None, 10)


>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
...     f.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))


>>> # disable chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=0) as f:
...     print('first access:')
...     %time f['value'][slc]
...     print()
...     print('subsequent accesses:')
...     %timeit f['value'][slc]

first access:
CPU times: user 11.6 ms, sys: 26.8 ms, total: 38.3 ms
Wall time: 38.4 ms

subsequent accesses:
35.8 ms ± 835 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
...     f.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))


>>> # open file with 10 MiB chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=rdcc_nbytes, rdcc_nslots=rdcc_nslots) as f:
...     print('first access actually reads from file:')
...     %time f['value'][slc]
...     print()
...     print('subsequent accesses will read from cache:')
...     %timeit f['value'][slc]

first access actually reads from file:
CPU times: user 5.87 ms, sys: 1.89 ms, total: 7.76 ms
Wall time: 7.72 ms

subsequent accesses will read from cache:
5.09 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
...     vf = VersionedHDF5File(f)
...     with vf.stage_version('r0') as sv:
...         sv.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))


>>> # open file with 10 MiB chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=rdcc_nbytes, rdcc_nslots=rdcc_nslots) as f:
...     vf = VersionedHDF5File(f)
...     cv = vf[vf.current_version]
...     print('first access actually reads from file:')
...     %time cv['value'][slc]
...     print()
...     print('for versioned files no accesses will ever read from chunk cache:')
...     %timeit cv['value'][slc]

first access actually reads from file:
CPU times: user 374 ms, sys: 16.7 ms, total: 390 ms
Wall time: 391 ms

for versioned files no accesses will ever read from chunk cache:
492 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is this slowness due to the (missing) chunk cache? If yes, how can we add support for chunk caching for virtual datasets?

@ArvidJB ArvidJB changed the title Support chunk cache in versioned-hdf5 Support chunk cache in versioned-hdf5 (PyInf#13103) Jul 23, 2024
@peytondmurray
Copy link
Collaborator

This question will be a good one for Guido - if you're okay with this, let's consult him once he becomes available.

@crusaderky crusaderky self-assigned this Oct 22, 2024
@crusaderky
Copy link
Collaborator

#386 completely disregards the virtual dataset, always reading from raw_data (whereas master reads from raw_data if there are any changes, and from the virtual dataset if there are none).

Importantly, the same PR also disables cache on read, which increases the reliance on the hdf5 C cache.
It is feasible, but not trivial, to add it back at a later stage.

With #386 we see how if you size the hdf5 C cache to be as large as the whole dataset, you get almost the same performance as caching with numpy (but, interestingly, you pay dearly to initialise the cache):

with h5py.File(path, "w") as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version("r0") as sv:
        sv.create_dataset("value", data=np.random.random((20000, 20000)), chunks=(100, 100))

with h5py.File(path, "r+") as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version("r1") as sv:
        %time sv["value"][()]  # 1.38 s
        %time sv["value"][()]  # 1.18s
    with vf.stage_version("r2") as sv:
        %time sv["value"][()]  # 1.3 s

with h5py.File(path, "r+", rdcc_nbytes=3*2**30, rdcc_nslots=400_000) as f:  # 5% cache collision rate
    vf = VersionedHDF5File(f)
    with vf.stage_version("r3") as sv:
        %time sv["value"][()]  # 2.07 s
        %time sv["value"][()]  # 944 ms
    with vf.stage_version("r4") as sv:
        %time sv["value"][()]  # 1.09 s
        # Load everything into numpy cache
        sv["value"].dataset.staged_changes.load()
        %time sv["value"][()]  # 779 ms

@crusaderky
Copy link
Collaborator

Closed by #386

@crusaderky
Copy link
Collaborator

Reopening as current_version, get_version_by_name, and get_version_by_timestamp are still impacted

@crusaderky crusaderky reopened this Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants