Outcome of ingest speed improvement discussion #994

matko · 2022-02-16T12:40:38Z

matko
Feb 16, 2022
Maintainer

The following is the outcome of a short discussion between @GavinMendelGleason and @matko regarding ways to speed up terminusdb ingest.

Before anything is implemented though, we first need to get benchmarks back in order so we know what any improvement actually did for us.

improvements in store

On-demand indexing

We currently build all indexes during layer build. This may often not really be required, and it certainly isn't required to say we safely
stored the data. It'd be good if we had some way to postpone index building either until it's first needed, or explicitely when requested.

This'd require quite a refactor in our querying though, as some operations would be either impossible or prohibitively slow without
indexes.

Build in-memory and move to disk only when we decide to keep

Currently we build layers on disk, then we do a schema check on them, and then decide whether to keep them around or not. It'd be better if we only wrote layers to disk after we're sure we want to keep them.

Plan is to build everything in memory first, and then promote it to persistent storage only when we're sure we want to keep it. Even then, there may be cases where we may figure out later that we don't want to keep it after all, but this will still be faster for the majority of cases.

label read lock

Whenever we open a descriptor, a label is read from disk to get the latest state of its corresponding metadata graph. While doing this, a read lock is acquired. This read lock may actually be unnecessary, as label files are so small that reading them is probably transactional on most platforms, and especially linux.

Not getting a label read lock my speed up things, especially on an NFS cluster, but we need a way to properly measure this before trying to do so.

improvements in schema checking

Do schema checking on json objects before inserting

If we knew for sure all json objects are valid, our subsequent schema check could be much cheaper. We'd have to refactor our schema checking in two parts - one that runs on json objects and is able to tell that a json object is valid according to our schema, and one that runs on a data layer after those objects have been inserted, which does all the referential integrity checks.

As an added bonus doing this refactor would give us a component that could possibly be used to validate json objects in other databases as well.

layer pinning

Layers now leave memory when their atom is garbage collected by swi. If we keep the atom around, we can prevent it from being removed from memory. This'd be beneficial for often used layers like our schema graphs, and the system database.

Can probably be done entirely in prolog.

transaction retrying without throwing away all done work

When a transaction fails, we redo the entire thing. There are however various circumstances in which a transaction fails, but a lot of work that was done can be reused. This happens when one of the metadata graphs changed in a way that does not affect our commit, such as when another repository or branch changed, or such as when the metadata graph is optimized.

We should find a way to make transaction retrying detect these circumstances and do the minimum amount of work it needs to resume.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TerminusDB

Outcome of ingest speed improvement discussion #994

{{title}}

Replies: 0 comments

Select a reply

TerminusDB

Outcome of ingest speed improvement discussion #994

matko Feb 16, 2022 Maintainer

improvements in store

On-demand indexing

Build in-memory and move to disk only when we decide to keep

label read lock

improvements in schema checking

Do schema checking on json objects before inserting

layer pinning

transaction retrying without throwing away all done work

Replies: 0 comments

matko
Feb 16, 2022
Maintainer