Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: serde + caching #28

Merged
merged 20 commits into from
Oct 11, 2024
Merged

feat: serde + caching #28

merged 20 commits into from
Oct 11, 2024

Conversation

cmdoret
Copy link
Member

@cmdoret cmdoret commented Oct 10, 2024

Context

Loading and filtering the ontology when instantiating a TermMatcher object
is by far the slowest operation in (py)fuzon.

Summary

This PR adds efficient (de)serialization support to fuzon::TermMatcher using postcard, a low overhead binary serializer.

Changes

  • Implement SerDe on TermMatcher
  • Implement caching in CLI to prevent re-downloading the same ontologies over and over again.
    • cache keys are built from the path, last-modified timestamp and ETag(url) / size(file) of ontologies
    • if there are multiple ontologies, their data are sorted+concatenated before hashing
  • Implement pyo3 bindings to TermMatcher::load/dump in pyfuzon
  • Expose load/dump functionality in python API

Notes

Basic tests with the uberon ontology showed the following:

  • The full ontology is 89MB
  • Once filtered (by the TermMatcher constructor), we are down to ~2MB
  • Creating a TermMatcher from scratch with locally stored UBERON took 2.765 (on NVMe).
  • WIth postcard, deserializing a TermMatcher takes 65ms, the object dump takes 2.1MB

Current limitations

  • There is currently no way to cleanup the cache (besides deleting the cache directory). Would it make sense to add it as a CLI flag?

@cmdoret cmdoret linked an issue Oct 10, 2024 that may be closed by this pull request
3 tasks
@cmdoret cmdoret self-assigned this Oct 11, 2024
@cmdoret cmdoret requested a review from marftn October 11, 2024 12:23
Copy link
Member

@marftn marftn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@cmdoret cmdoret merged commit de0fe52 into main Oct 11, 2024
11 checks passed
@cmdoret cmdoret deleted the feat/serde branch October 22, 2024 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add serde support to TermMatcher
2 participants