Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use placeholder revision in urls in cached responses #2966

Merged
merged 7 commits into from
Jul 15, 2024
Merged

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Jul 3, 2024

Previously, we were using the dataset revision in the URLs of image/audio files of cached responses of /first-rows.
However when a dataset gets its README updated, we update the dataset_git_revision of the cache entries and the location of the image/audio files on S3 but we don't modify the revision in the URLs in the cached response.
This resulted in the Viewer not showing the images after modifying the readme of a dataset.

image

I fixed that for future datasets by not using the revision in the URLs anymore and use a placeholder that is replaced by dataset_git_revision when the cached response is accessed

Implementation details

I modified the URL Signer logic to also insert the revision in the URL and renamed it to a URL Preparator. It takes care of inserting the revision and signing the URLs.

close #2965

@lhoestq lhoestq marked this pull request as ready for review July 3, 2024 12:42
Comment on lines 47 to +49
object_key = storage_client.generate_object_key(
dataset=dataset,
revision=revision,
revision=DATASET_GIT_REVISION_PLACEHOLDER,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main change is here

Comment on lines +64 to +66
return ImageSource(
src=storage_client.get_url(object_key, revision=revision), height=image.height, width=image.width
)
Copy link
Member Author

@lhoestq lhoestq Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few lines later, get_url applies the right revision if there is an URL preparator associated to the storage client.

In practice:

  • /rows, /search and /filter have an URL preparator that inserts the revision and signs the urls from the parquet/duckdb data
  • split-first-rows worker has no URL preparator: URLs are stored in the cached as unprepared (i.e. with the revision placeholder and unsigned)
  • /first-rows has an URL preparator that inserts the revision and signs the urls that come form the cache

@lhoestq lhoestq requested a review from AndreaFrancis July 3, 2024 13:22
libs/libcommon/src/libcommon/url_preparator.py Outdated Show resolved Hide resolved
libs/libcommon/src/libcommon/viewer_utils/asset.py Outdated Show resolved Hide resolved
@lhoestq lhoestq merged commit 57008ce into main Jul 15, 2024
25 checks passed
@lhoestq lhoestq deleted the set-revision-in-url branch July 15, 2024 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Viewer doesn't show images properly after a smart update
2 participants