-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local data persistence & caching #740
Comments
There are a couple places to do caching, and we might want several:
Internally, datasets to call the QCFractal/qcportal/qcportal/dataset_models.py Line 655 in e8d9cba
I think I also agree that this could be exploited for quickly downloading datasets. Some care always has to be taken with cache invalidation. There already is some checking that is done when fetching records, but would need some functionality for merging/purging data that is changed on the server. The existing (very prototype) code for this ("views") stores info in an sqlite database, with entry/specification being keys, and the record data itself being stored as zstandard-compressed blobs. For a general cache, where the user is also doing the compression, we might want to turn the compression level down, but zstd is very fast in both directions. That being said, I was not aware of sqldict. Is this what you are using? https://github.com/RaRe-Technologies/sqlitedict. With that, it could be possible to basically remove the idea of a "view" and just make everything the cache instead. |
Yes, that is the package I had used for my persistent caching of the records I downloaded. It worked well and really simplified access since I am not very experienced with sqlite. Storing the records ends up be really just as simple as:
I could imagine having something like
Where iterate records would check that specified database for a record before fetching from the server and storing anything it fetched into the database (probably not too much different than what is already in there, just allowing the user to set where to store it, without capping the number of records). |
I have some very preliminary code with basic functionality if you would like to try it. It's in the Basically:
Issues:
|
Excellent. I will check it out and report back. |
I've been testing this out. So far so good. A few small notes. This line: QCFractal/qcportal/qcportal/client.py Line 147 in 8dc5342
I think this line needs to be modified to be:
to make sure it uses the sanitized version of the address that sticks https:// on it if not provided. As it is now, if I were to pass, say: "ml.qcarchive.molssi.org", the server fingerprint ends up be "None_None". In terms of the speed compared to a local dictionary cache, I got the same performance on my machine. When implemented the sqlitedict wrapping in my code, I found that converting the keys in the sqlite database to a set substantially sped things up (since sqlitedict just emulates a dictionary like interface, but not the performance you get for lookups). However, this might not be a huge issue worth worrying about, as it will still be faster than fetching fresh, and this time is pretty minimal for the larger datasets (the 2 minutes rather than 30seconds it takes to search through sqlitedict keys is not a big deal when it takes over an hour to get the records anyway). |
The big PR is up for testing #802 . Give it a shot and let me know what you think/how it works |
This issue is to sketch out some ideas and start a discussion related to retrieving and saving datasets. This follows from some prior discussion during working group meetings.
Local caching of records:
When accessing records from the archive, it would be very helpful to be able to store this data locally in a cache. It seems like this could come in two distinct flavors:
Automatic caching.
I'm looking at the current source code and there appears to be some framework already in place (but maybe not yet implemented?) that relies upon DBM for the automatic caching. If implemented, this would allow QCPortal to check the local cache to see if a given record from a specified server has already been retrieved, and if so, use the local version. This would certainly be very beneficial since it would mean that for many users, rerunning a python script or restarting a notebook kernel would not require re-downloading data. However, the actual performance will depend upon the amount of memory allocated to the cache and the size of a given dataset a user is working with.
User-defined caching.
This would povide the same basic functionality as the automatic cacheing, but allowing a user to define the location to store a database, where by default, the cache does not have a maximum size limit. This would be beneficial to users that are working with, say, entire datasets. For example, if say, working with the QM9 dataset, I would only like to basically download the records once and be able to store them locally for ease of access later; I don't want to have to worry about the dataset records being purged (due to downloading other data from qcarchive) or just simply having the dataset being larger than the default memory allocation. In my own work, I've implemented a simple wrapper around the calls to QCPortal where each record is saved into an SQLdict database and this has been very helpful, especially in cases where I lose connection to the database.
Ability to download entire datasets:
Some of the datasets in the older version included HDF5 files (that could be downloaded either via the portal or from zenodo). This allowed an entire dataset to be downloaded very efficiently. As an example, it would take about 5 minutes to download QM9 in the hdf5 format (~160 mb when gzipped) for ~133K records; fetching these records one at a time (using the new code) took > 12 hours. Having a way to download an entire dataset in one file would be very helpful.
The text was updated successfully, but these errors were encountered: