Chunked mapping for storages #26

kirillt · 2022-09-06T08:25:06Z

A storage is a subfolder of .ark, e.g. .ark/index or .ark/tags. It represents a mapping from ResourceId to some T.

For .ark/index, the T is Path. And for .ark/tags, the T is Set<String>. Each entry can be represented by a file .ark/<storage>/<resource_id> with a single line content. This kind of storage should give us the least amount of read/write conflicts, but not very efficient for syncing and reading. Old chunks could be batched into bigger multi-line files.

So, chunked storage would be a set of files like this:

.ark/<storage_name>/<batch_id1>
|-- <resource_id1> -> <value1>
|-- <resource_id2> -> <value2>

.ark/<storage_name>/<resource_id3>
|-- <value3>

.ark/<storage_name>/<batch_id2>
|-- <resource_id4> -> <value4>
|-- <resource_id5> -> <value5>
|-- <resource_id6> -> <value6>

.ark/<storage_name>/<resource_id7>
|-- <value7>

The text was updated successfully, but these errors were encountered:

kirillt · 2022-11-04T16:52:31Z

It should be possible to finely tune each storage according to expected size of values. Keys are expected to be ResourceId always. Value could range from i8 for scores to Set<String> for tags and Map<String, String> for metadata.

It should be rational to keep scores and tags in single file, a line per map entry. The only motivation to use chunked storage here is to reduce amount of conflicts. A "line-per-entry" case still can be implemented using chunked storage with some setting like chunk_size = 10000.

For metadata, each entry should generate separate file which should be achievable using chunk_size = 1.

We could use also something in between for both. E.g. tags storage with chunk_size = 100 would be split into multiple files which are less likely to conflict due to writes on different devices. Again, metadata storage with chunk_size = 10 would give us 10 times less files with maybe almost the same conflict frequency, but easier synchronization across devices (this is necessary to verify, of course).

kirillt · 2022-11-04T17:01:24Z

Timestamps of modification of all chunks should be taken into account. Probably, each value should be tracked by which chunk it came from in order to invalidate it when that chunk is updated from outside.

kirillt added the performance label Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked mapping for storages #26

Chunked mapping for storages #26

kirillt commented Sep 6, 2022

kirillt commented Nov 4, 2022

kirillt commented Nov 4, 2022

Chunked mapping for storages #26

Chunked mapping for storages #26

Comments

kirillt commented Sep 6, 2022

kirillt commented Nov 4, 2022

kirillt commented Nov 4, 2022