Support chunk cache in versioned-hdf5 (PyInf#13103) #357

ArvidJB · 2024-07-23T20:55:15Z

This is probably more of an h5py or libhdf5 issue, but it mainly impacts versioned-hdf5 so I'm opening it here.

HDF5 maintains a chunk cache to satisfy repeated accesses to a chunk without having to go to the file, see hdf5 docs.

If I understand it correctly that chunk caching does never apply to virtual datasets so is not used in versioned-hdf5? Here are some benchmarks that are much slower in versioned-hdf5 than without versioning:

>>> import h5py
>>> import numpy as np
>>> from versioned_hdf5 import VersionedHDF5File
>>> # END AUTO-GENERATED BLOCK


>>> # pick chunks and h5py cache sizes
>>> chunks = (1000,)
>>> rdcc_nbytes = 10 * (2 ** 20) # 10 MiB, enough to fit 1000 * 8192 byte chunks
>>> rdcc_nslots = 10007 # prime number large enough to fit roughly 10x number of chunks


>>> # slice to read
>>> slc = slice(None, None, 10)


>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
...     f.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))


>>> # disable chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=0) as f:
...     print('first access:')
...     %time f['value'][slc]
...     print()
...     print('subsequent accesses:')
...     %timeit f['value'][slc]

first access:
CPU times: user 11.6 ms, sys: 26.8 ms, total: 38.3 ms
Wall time: 38.4 ms

subsequent accesses:
35.8 ms ± 835 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
...     f.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))


>>> # open file with 10 MiB chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=rdcc_nbytes, rdcc_nslots=rdcc_nslots) as f:
...     print('first access actually reads from file:')
...     %time f['value'][slc]
...     print()
...     print('subsequent accesses will read from cache:')
...     %timeit f['value'][slc]

first access actually reads from file:
CPU times: user 5.87 ms, sys: 1.89 ms, total: 7.76 ms
Wall time: 7.72 ms

subsequent accesses will read from cache:
5.09 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
...     vf = VersionedHDF5File(f)
...     with vf.stage_version('r0') as sv:
...         sv.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))


>>> # open file with 10 MiB chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=rdcc_nbytes, rdcc_nslots=rdcc_nslots) as f:
...     vf = VersionedHDF5File(f)
...     cv = vf[vf.current_version]
...     print('first access actually reads from file:')
...     %time cv['value'][slc]
...     print()
...     print('for versioned files no accesses will ever read from chunk cache:')
...     %timeit cv['value'][slc]

first access actually reads from file:
CPU times: user 374 ms, sys: 16.7 ms, total: 390 ms
Wall time: 391 ms

for versioned files no accesses will ever read from chunk cache:
492 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is this slowness due to the (missing) chunk cache? If yes, how can we add support for chunk caching for virtual datasets?

The text was updated successfully, but these errors were encountered:

peytondmurray · 2024-07-26T19:11:40Z

This question will be a good one for Guido - if you're okay with this, let's consult him once he becomes available.

crusaderky · 2024-10-22T16:06:43Z

#386 completely disregards the virtual dataset, always reading from raw_data (whereas master reads from raw_data if there are any changes, and from the virtual dataset if there are none).

Importantly, the same PR also disables cache on read, which increases the reliance on the hdf5 C cache.
It is feasible, but not trivial, to add it back at a later stage.

With #386 we see how if you size the hdf5 C cache to be as large as the whole dataset, you get almost the same performance as caching with numpy (but, interestingly, you pay dearly to initialise the cache):

with h5py.File(path, "w") as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version("r0") as sv:
        sv.create_dataset("value", data=np.random.random((20000, 20000)), chunks=(100, 100))

with h5py.File(path, "r+") as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version("r1") as sv:
        %time sv["value"][()]  # 1.38 s
        %time sv["value"][()]  # 1.18s
    with vf.stage_version("r2") as sv:
        %time sv["value"][()]  # 1.3 s

with h5py.File(path, "r+", rdcc_nbytes=3*2**30, rdcc_nslots=400_000) as f:  # 5% cache collision rate
    vf = VersionedHDF5File(f)
    with vf.stage_version("r3") as sv:
        %time sv["value"][()]  # 2.07 s
        %time sv["value"][()]  # 944 ms
    with vf.stage_version("r4") as sv:
        %time sv["value"][()]  # 1.09 s
        # Load everything into numpy cache
        sv["value"].dataset.staged_changes.load()
        %time sv["value"][()]  # 779 ms

crusaderky · 2024-12-05T10:15:56Z

Closed by #386

crusaderky · 2024-12-05T11:16:30Z

Reopening as current_version, get_version_by_name, and get_version_by_timestamp are still impacted

ArvidJB changed the title ~~Support chunk cache in versioned-hdf5~~ Support chunk cache in versioned-hdf5 (PyInf#13103) Jul 23, 2024

crusaderky self-assigned this Oct 22, 2024

crusaderky closed this as completed Dec 5, 2024

crusaderky reopened this Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support chunk cache in versioned-hdf5 (PyInf#13103) #357

Support chunk cache in versioned-hdf5 (PyInf#13103) #357

ArvidJB commented Jul 23, 2024

peytondmurray commented Jul 26, 2024

crusaderky commented Oct 22, 2024

crusaderky commented Dec 5, 2024

crusaderky commented Dec 5, 2024

Support chunk cache in versioned-hdf5 (PyInf#13103) #357

Support chunk cache in versioned-hdf5 (PyInf#13103) #357

Comments

ArvidJB commented Jul 23, 2024

peytondmurray commented Jul 26, 2024

crusaderky commented Oct 22, 2024

crusaderky commented Dec 5, 2024

crusaderky commented Dec 5, 2024