Parquet Dataset cache not reliable #800

fjetter · 2024-01-24T12:57:03Z

The parquet dataset cache, see https://github.com/dask-contrib/dask-expr/blob/13af21d1be9e0a3393f4971a0d95382188f6f248/dask_expr/io/parquet.py#L54-L57
is currently identified with a token that is deterministic given the user input arguments but it is not sensitive to any external information. Particularly this means that there is no user accessible way to invalidate this cache. Any kind of mutation of the dataset whether this is to schema, metadata or just an append operation would go unnoticed.

This problem is amplified by us additionally using a cached property for everything derived from this making it impossible to reuse an existing instance correctly once the dataset is mutated (this can cause errors but also data loss).

Both of these caching layers would need to be removed to enable Expr objects becoming singletons as proposed in #798

This scenario is actually tested in our unit tests, see test_to_parquet. This test is writing a dataset, reading it and overwriting it again. The only reason why this test currently works is because the cache is invalidated if the overwrite keyword is used, see here. However, this kind of cache invalidation will not work reliably in any kind of multi interpreter environment, let alone a complex system where we are just ingesting a dataset.

The global cache should use a modified at timestamp or similar that allows invalidation. If something like this is not possible we need at the very least a mechanism that allows us to ignore or invalidate the cache.

The text was updated successfully, but these errors were encountered:

mrocklin · 2024-01-24T13:40:10Z

I think that maybe we thought about this in early fsspec days in operations like read_bytes. On posix systems you have a modified date. In cloud object stores usually objects have a hash of their contents stored in metadata. I'm not proposing anything here, just adding historical context.

…

On Wed, Jan 24, 2024 at 6:57 AM Florian Jetter ***@***.***> wrote: The parquet dataset cache, see https://github.com/dask-contrib/dask-expr/blob/13af21d1be9e0a3393f4971a0d95382188f6f248/dask_expr/io/parquet.py#L54-L57 is currently identified with a token that is deterministic given the user input arguments but it is not sensitive to any external information. Particularly this means that there is no user accessible way to invalidate this cache. Any kind of mutation of the dataset whether this is to schema, metadata or just an append operation would go unnoticed. This problem is amplified by us additionally using a cached property for everything derived from this making it *impossible* to reuse an existing instance correctly once the dataset is mutated (this can cause errors but also data loss). Both of these caching layers would need to be removed to enable Expr objects becoming singletons as proposed in #798 <#798> This scenario is actually tested in our unit tests, see test_to_parquet <https://github.com/dask-contrib/dask-expr/blob/13af21d1be9e0a3393f4971a0d95382188f6f248/dask_expr/io/tests/test_io.py#L339-L357>. This test is writing a dataset, reading it and overwriting it again. The only reason why this test currently works is because the cache is invalidated if the overwrite keyword is used, see here <https://github.com/dask-contrib/dask-expr/blob/13af21d1be9e0a3393f4971a0d95382188f6f248/dask_expr/io/parquet.py#L259-L260>. However, this kind of cache invalidation will not work reliably in any kind of multi interpreter environment, let alone a complex system where we are just ingesting a dataset. The global cache should use a *modified at* timestamp or similar that allows invalidation. If something like this is not possible we need at the very least a mechanism that allows us to ignore or invalidate the cache. — Reply to this email directly, view it on GitHub <#800>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCI6G7PJX5QMJSF7BTYQEAKVAVCNFSM6AAAAABCIW2DH6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TQMRRGMYTIMY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

fjetter · 2024-01-24T15:13:24Z

The tricky problem will be the ordinary user API where a user just points to a directory. Without any metadata files we'll need to walk over all files and collect the hashes/etags for every file. This is what _collect_dataset_info already does and (kind of) but this is also relatively expensive. If we have to do this to verify the cache we could just as well not cache it

rjzamora · 2024-01-24T15:20:01Z

I think it probably makes sense to add and document a mechanism to clear all file system and data-source caching.

I think most dask users are not frequently mutating their dataset between read-operations, and so we probably don't want to add a lot of complexity to inspect the dataset all the time.

fjetter · 2024-01-29T15:23:54Z

I'm currently looking a little closer into the S3FS implementation and as it turns out it is all but guaranteed to walk over the entire bucket even before we try to read any data. This should be fixed... but it also tells me that collecting etags/last modified for all/most/some of the files wouldn't be catastrophic since we're already doing it...

rjzamora · 2024-01-29T22:57:34Z

but it also tells me that collecting etags/last modified for all/most/some of the files wouldn't be catastrophic since we're already doing it...

Thanks for looking into this. It makes sense that this wouldn't be catastrophic as long as we don't need to gather new information every optimization step. However, I'm worried that we would need to do exactly that to be completely robust.

I wonder how reasonable it would be to simply add an optional "reset_cache" kwarg to read_parquet (and other IO APIs) for the special case that someone is reading from and writing to the same dataset within the same workflow. We could then ignore/replace a matching key in Expr._instances (and other caches).

fjetter · 2024-01-30T10:51:11Z

If we implemented any checksum/hash/etag/"modified since" cache invalidation logic this would only be relevant on a user read_parquet call. We would still cache that kind of information such that the optimizer can utilize the cache. We would create a unique token in the user facing read_parquet call and just pass this as an argument to the expression.

I'm rather strongly against a reset_cache kwarg or any other manual cache invalidation since I think this violates the principle of least surprise for user APIs. If I perform a read_parquet call I expect this to represent the dataset at the time I called that API. From that perspective, a simple solution to get started would be a simple UUID that is being generated in read_parquet and passed as an argument to ReadParquet(Expr) which is used for the name/token. Eventually, this could be replaced with a checksum/etag token.

rjzamora · 2024-01-30T12:45:34Z

We would create a unique token in the user facing read_parquet call and just pass this as an argument to the expression.

Okay, that makes sense to me. I was originally going to recommend the same thing. I started to get worried that scanning the file system outside the expression would be too inconsistent with the current programming model. However, I’m starting to come back around.

fjetter mentioned this issue Feb 1, 2024

Expr as singleton #798

Merged

fjetter self-assigned this Feb 1, 2024

phofl closed this as completed in #798 Feb 2, 2024

rjzamora mentioned this issue Feb 27, 2024

Introduce basic "cudf" backend for Dask Expressions rapidsai/cudf#14805

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet Dataset cache not reliable #800

Parquet Dataset cache not reliable #800

fjetter commented Jan 24, 2024

mrocklin commented Jan 24, 2024 via email

fjetter commented Jan 24, 2024

rjzamora commented Jan 24, 2024

fjetter commented Jan 29, 2024 •

edited

Loading

rjzamora commented Jan 29, 2024 •

edited

Loading

fjetter commented Jan 30, 2024

rjzamora commented Jan 30, 2024

Parquet Dataset cache not reliable #800

Parquet Dataset cache not reliable #800

Comments

fjetter commented Jan 24, 2024

mrocklin commented Jan 24, 2024 via email

fjetter commented Jan 24, 2024

rjzamora commented Jan 24, 2024

fjetter commented Jan 29, 2024 • edited Loading

rjzamora commented Jan 29, 2024 • edited Loading

fjetter commented Jan 30, 2024

rjzamora commented Jan 30, 2024

fjetter commented Jan 29, 2024 •

edited

Loading

rjzamora commented Jan 29, 2024 •

edited

Loading