Skip to content
This repository has been archived by the owner on Sep 16, 2024. It is now read-only.

Chunked mapping for storages #26

Open
kirillt opened this issue Sep 6, 2022 · 2 comments
Open

Chunked mapping for storages #26

kirillt opened this issue Sep 6, 2022 · 2 comments

Comments

@kirillt
Copy link
Member

kirillt commented Sep 6, 2022

A storage is a subfolder of .ark, e.g. .ark/index or .ark/tags. It represents a mapping from ResourceId to some T.

For .ark/index, the T is Path. And for .ark/tags, the T is Set<String>. Each entry can be represented by a file .ark/<storage>/<resource_id> with a single line content. This kind of storage should give us the least amount of read/write conflicts, but not very efficient for syncing and reading. Old chunks could be batched into bigger multi-line files.

So, chunked storage would be a set of files like this:

.ark/<storage_name>/<batch_id1>
|-- <resource_id1> -> <value1>
|-- <resource_id2> -> <value2>

.ark/<storage_name>/<resource_id3>
|-- <value3>

.ark/<storage_name>/<batch_id2>
|-- <resource_id4> -> <value4>
|-- <resource_id5> -> <value5>
|-- <resource_id6> -> <value6>

.ark/<storage_name>/<resource_id7>
|-- <value7>
@kirillt
Copy link
Member Author

kirillt commented Nov 4, 2022

It should be possible to finely tune each storage according to expected size of values. Keys are expected to be ResourceId always. Value could range from i8 for scores to Set<String> for tags and Map<String, String> for metadata.

It should be rational to keep scores and tags in single file, a line per map entry. The only motivation to use chunked storage here is to reduce amount of conflicts. A "line-per-entry" case still can be implemented using chunked storage with some setting like chunk_size = 10000.

For metadata, each entry should generate separate file which should be achievable using chunk_size = 1.

We could use also something in between for both. E.g. tags storage with chunk_size = 100 would be split into multiple files which are less likely to conflict due to writes on different devices. Again, metadata storage with chunk_size = 10 would give us 10 times less files with maybe almost the same conflict frequency, but easier synchronization across devices (this is necessary to verify, of course).

@kirillt
Copy link
Member Author

kirillt commented Nov 4, 2022

Timestamps of modification of all chunks should be taken into account. Probably, each value should be tracked by which chunk it came from in order to invalidate it when that chunk is updated from outside.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant