Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache ZIM metadata on library-gen? #209

Open
rgaudin opened this issue Jun 26, 2024 · 2 comments
Open

Cache ZIM metadata on library-gen? #209

rgaudin opened this issue Jun 26, 2024 · 2 comments
Labels
question Further information is requested

Comments

@rgaudin
Copy link
Member

rgaudin commented Jun 26, 2024

Currently, the library generator script which is used both for library and dev-library (different source folders) spends most of its time reading metadata from ZIM files on the filesystem.

On library, this is ~6,800 files. This can be completed within ~6mn but if the disk is busy (reminder: the server is using mechanical drives), this can take 3 hours.

This script is ran every 30mn on library and every 10mn for dev-library.

While this will all be obsolete once the CMS takes over, a quick and easy improvement would be to cache this information and only read metadata for new files. It's actually already cached (in previously written library xml) so it's just a matter of skipping/reusing data for existing entries.

The only drawback is that it wont update metadata of a file that has been overwritten but that's already a scenario we've excluded and we could implement a simple file-flag that triggers a full re-read if present.

@rgaudin rgaudin added the question Further information is requested label Jun 26, 2024
@kelson42
Copy link
Contributor

I don't think we should do anything in the meantime (before CMS is published) but otherwise I would recommend to save in the libary a kind of publishing date (see this comment: kiwix/libkiwix#702 (comment)) which would be the same as the ZIM file last modified date. Based on the comparison, I would use the last library.xml as cache if the file has not been renewed.

@rgaudin
Copy link
Member Author

rgaudin commented Jun 26, 2024

Yes, the problem being the XML file is public so we should not come up with anything ourselves and wait for that libkiwix ticket first…

Let's keep that ticket open as an option until the CMS arrives or something else pressures us to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants