SQLite is back #155

sapetnioc · 2023-10-25T16:07:57Z

sapetnioc
Oct 25, 2023
Maintainer

We have had so many problems in BrainVISA when using SQLite database in parallel that we decided to stop using it for Capsul. So, I looked for a server-less solution to store/retrieve information in a single file using several processes or threads running in parallel. These tests were limited to key/value information but a solution offering more data structure as well a a kind of query language would be much appreciated. I looked at several technologies: HDF5, FUSE, lmdb, monetdbe and apsw. I finally wrote a simple test code whose goal is to write 40000 Python dictionaries of various size in the file using several parallel workers. Workers coordination (to avoid two workers writing the same value) is done using the "database" technology in order to assess that atomic read and write are possible. I had run the test using threads or processes and using both local file system or Neurospin's distributed file system.

It finally happens that the good old built-in sqlite3 Python module is a very good backend for this test if we take good care of transaction management. I think we are not using sqlite3 module the right way in BrainVISA and that explains the problems we have. I also believe that SQLite is still the first candidate for storing, modifying and querying data using file system. To check that SQLite is good when properly used, I will probably implement a new SQLite version of the Capsul v3 database as an alternative to Redis. If it is a viable alternative, we would be able to keep using SQLite wherever we need a file system based server-less database system.

denisri · 2023-10-30T09:31:29Z

denisri
Oct 30, 2023
Maintainer

Interesting - so there is actually a working locking/transaction mechanism in SQlite in a serverless network filesystem ?
You will have to show us how it should be used, and what we did not do well in BrainVisa/Axon...?
Otherwise it's a good news: sqlite is lightweight, portable and rather standard.

0 replies

ylep · 2023-11-15T14:59:27Z

ylep
Nov 15, 2023
Maintainer

Considering that we are using shared filesystems for virtually all our data storage internally, we should be extra careful to choose solutions that works over NFS. I am afraid that sqlite is not appropriate, and in fact, there is probably no file-based database engine that can be, considering that file locking is the only mechanism available to ensure the atomicity of transactions, and guarantee the integrity of the database under concurrent writes.

From the sqlite FAQ https://www.sqlite.org/faq.html#q5:

SQLite uses reader/writer locks to control access to the database. [...] use caution: this locking mechanism might not work correctly if the database file is kept on an NFS filesystem. This is because fcntl() file locking is broken on many NFS implementations. You should avoid putting SQLite database files on NFS if multiple processes might try to access the file at the same time.

It seems to me that a client/server solution is the only reliable way to support remote filesystems. Depending of the use-case, it may be a viable solution or we may need to rethink the design... I am open to ☕ discussions on that topic :-)

0 replies

sapetnioc · 2023-11-15T16:01:52Z

sapetnioc
Nov 15, 2023
Maintainer Author

I would like to have more details about what is meant by "fcntl() is broken". This 20 years old message gives interesting information : https://sqlite-users.sqlite.narkive.com/5UmR4gG1/sqlite-unix-file-locking-is-badly-broken

I am not sure that file locking over NFS is still broken. I red some comments saying that recent Systems (2003+) should have trusted NFS locks. But yet nothing more than sporadic comments.

I believe that it will not be so easy to find a reliable backend storing data in file system. In Capsul, we are using Redis but it is not a good alternative for huge databases since everything is put in memory (and can be sync in a single file). When using a server, if it is not a permanent server with already configured accounts and authorizations, we have to deal with security issues. Ideally, we would like to rely on the access rights of the file system but it is not easy to securely replicate that in a temporary server started by a user.

Therefore, I have in mind to use populse_db as an API between single file storage and multiprocessing Python code. To date there is only a SQLite backend but I already done a Postgresql one in earlier versions (Postgresql is not a good alternative to store several small databases in various directories, it can create a lot of files; more than direct strorage in json files). It is fairly easy to create a new backend thanks to the good test coverage. I plan to extend the API to make it even simpler to perform json-based storage/retreive without the need of any schema. It will be possible to deal with a kind of simple schema if storage or querying optimization is necessary. But querying will be very limited anyway.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQLite is back #155

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

SQLite is back #155

sapetnioc Oct 25, 2023 Maintainer

Replies: 3 comments

denisri Oct 30, 2023 Maintainer

ylep Nov 15, 2023 Maintainer

sapetnioc Nov 15, 2023 Maintainer Author

sapetnioc
Oct 25, 2023
Maintainer

denisri
Oct 30, 2023
Maintainer

ylep
Nov 15, 2023
Maintainer

sapetnioc
Nov 15, 2023
Maintainer Author