Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwrite image and audio cached assets when necessary #1981

Closed
lhoestq opened this issue Oct 13, 2023 · 8 comments
Closed

Overwrite image and audio cached assets when necessary #1981

lhoestq opened this issue Oct 13, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@lhoestq
Copy link
Member

lhoestq commented Oct 13, 2023

Currently we never overwrite image and audio cached assets, which can lead to outdated data in the viewer like here:
https://huggingface.co/datasets/ccmusic-database/instrument_timbre_eval/discussions/1

In particular we have overwrite=False here:

https://github.com/huggingface/datasets-server/blob/b84c8210ec023664c4780ab348b0232468afe116/libs/libapi/src/libapi/response.py#L36-L42

cc @AndreaFrancis @severo

@lhoestq lhoestq added the bug Something isn't working label Oct 13, 2023
@AndreaFrancis
Copy link
Contributor

It depends on the TTL for cached-assets (which is 1 day) so, the files will be refreshed after that time.
If we are forced to overwrite each time, it isn't worth having a CloudFront layer right?
If the dataset had a change in a lower interval than the TTL, I think we should recreate all related assets to each processing step:

  • first-rows should delete all the assets before processing (so that the obsolete files will be refreshed)
  • config-parquet should delete all the cached-assets before processing

@lhoestq
Copy link
Member Author

lhoestq commented Oct 13, 2023

I think first-rows is fine since it has overwrite=True:

https://github.com/huggingface/datasets-server/blob/b84c8210ec023664c4780ab348b0232468afe116/services/worker/src/worker/job_runners/split/first_rows_from_streaming.py#L283-L289

https://github.com/huggingface/datasets-server/blob/b84c8210ec023664c4780ab348b0232468afe116/services/worker/src/worker/job_runners/split/first_rows_from_parquet.py#L203-L209

And yes config-parquet-metadata (which is the job that provides /rows) can take care of removing the outdated cached assets, good idea

@AndreaFrancis
Copy link
Contributor

Related to #1823
Probably, if we recreate the dataset it won't work because https://github.com/huggingface/datasets-server/blob/main/services/admin/src/admin/routes/recreate_dataset.py#L61 is pointing still to the old cached-assets EFS but it should delete all files in S3.
Maybe I can add it as part of #1976

@AndreaFrancis
Copy link
Contributor

https://huggingface.co/datasets/ccmusic-database/instrument_timbre_eval is now serving the correct files. What I saw is that there is a time period for CloudFront https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Expiration.html
By default, each file automatically expires after 24 hours

So, if we change/update an S3 file in a lower time interval, for assets and cached-assets, we will have to invalidate the cache or something like this https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Invalidation.html

@lhoestq
Copy link
Member Author

lhoestq commented Oct 16, 2023

If I understand correctly we can fix this issue by including the dataset revision hash in the cached asset path no ? It corresponds to "versioning file names" in the aws docs

@AndreaFrancis
Copy link
Contributor

If I understand correctly we can fix this issue by including the dataset revision hash in the cached asset path no ? It corresponds to "versioning file names" in the aws docs

Sound a great idea! I can work on it

@severo
Copy link
Collaborator

severo commented Oct 17, 2023

If I understand correctly we can fix this issue by including the dataset revision hash in the cached asset path no ? It corresponds to "versioning file names" in the aws docs

good idea

@AndreaFrancis
Copy link
Contributor

Fixed in #1988. However, we will still be having the issue with mixed cache that will be handled in #1823

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants