Updated plan for storing raw message data #218

chylex · 2023-07-18T19:40:23Z

chylex
Jul 18, 2023
Maintainer

Continuation of #166.

I have decided on a different approach for storing raw message data with SQLite. Messages will be stored in bundles of say, 100 messages per bundle, and the entire bundle will be compressed at once. This should lead to a massive reduction in redundancy, and possibly be even smaller than the current database format despite having a lot more data.

This will be a completely separate database format, possibly a separate app for now. Migrating between database formats might come later.

I will write separate comments for each part of the design.

chylex · 2023-07-18T21:54:51Z

chylex
Jul 18, 2023
Maintainer Author

Storing Messages

The tracking endpoint will receive raw JSON messages, extract basic metadata, and push the messages into a queue. A background thread processes the queue, figures out which messages are new and which are edited, and periodically packs, compresses, and stores them in the database.

The basic storage design comprises two tables, one that stores pack_id -> pack and one that stores message_id -> pack_id.

The big question is how the packing should work.

One option is to wait for either a certain amount of messages or a certain amount of time, and then simply create a pack from those, but that could lead to fragmentation and bad data locality. The expectation is that messages in the same channel that were posted around the same time will be accessed together, so they should be in the same pack.

A more advanced approach could search existing packs and insert new messages into them, but that would add a lot of complexity (especially if multiple instances were writing to the same database).

The best balance might be to only allow new packs to be created while processing the queue, but have a "vacuum" process that could be initiated manually (or automatically) that would find bad packs and optimize them. This would require a lock either over the entire database, or if a pack was for example restricted to one channel, a lock over all packs for that channel.

There also needs to be a way to tell whether the database already has the most recent version of a message. By storing the timestamp of the last edit and having it be part of the primary key, it would allow storing multiple versions of a message and solve another long-standing issue.

To summarize, this is how I want the initial implementation to work:

Packs are identified by random UUID, and also restricted to a specific channel id. It doesn't make sense to mix messages from different channels in one pack anyway.
For packs, store the pack id, channel id, pack contents, and amount of messages.
For messages, store the message id, channel id, last edit timestamp, and pack id.
The message queue filters out messages that are already stored and haven't been edited.
The message queue then groups new messages by channel id. Once a group has enough messages, or once it has existed for maybe 10 seconds, sort its messages by date and create new packs out of them.

Once this works, I will analyze efficiency and fragmentation. A vacuum process can be implemented later.

0 replies

chylex · 2023-07-18T22:29:59Z

chylex
Jul 18, 2023
Maintainer Author

Database Imports / Exports

Since the packing and compression will make it impossible to directly use SQL with the database file to parse the message data, I would like to support configurable exports.

At minimum, there should be a way to import and export an uncompressed SQLite database, which would have a single table with the raw JSON data. Modern SQLite supports JSON operators, so that should be flexible enough to do any kind of data manipulation and analysis.

In the future, additional formats could be supported. Ideas:

Plain JSONL text file
SQLite database with a custom structure, instead of just 1 table with JSON

Filters would be useful for people who don't want to export the entire database, since the uncompressed exports will be much larger. The database will need to store some metadata about servers, channels, etc. so the UI can display proper names and not just IDs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated plan for storing raw message data #218

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Updated plan for storing raw message data #218

chylex Jul 18, 2023 Maintainer

Replies: 2 comments

chylex Jul 18, 2023 Maintainer Author

Storing Messages

chylex Jul 18, 2023 Maintainer Author

Database Imports / Exports

chylex
Jul 18, 2023
Maintainer

chylex
Jul 18, 2023
Maintainer Author

chylex
Jul 18, 2023
Maintainer Author