You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following is the outcome of a short discussion between @GavinMendelGleason and @matko regarding ways to speed up terminusdb ingest.
Before anything is implemented though, we first need to get benchmarks back in order so we know what any improvement actually did for us.
improvements in store
On-demand indexing
We currently build all indexes during layer build. This may often not really be required, and it certainly isn't required to say we safely
stored the data. It'd be good if we had some way to postpone index building either until it's first needed, or explicitely when requested.
This'd require quite a refactor in our querying though, as some operations would be either impossible or prohibitively slow without
indexes.
Build in-memory and move to disk only when we decide to keep
Currently we build layers on disk, then we do a schema check on them, and then decide whether to keep them around or not. It'd be better if we only wrote layers to disk after we're sure we want to keep them.
Plan is to build everything in memory first, and then promote it to persistent storage only when we're sure we want to keep it. Even then, there may be cases where we may figure out later that we don't want to keep it after all, but this will still be faster for the majority of cases.
label read lock
Whenever we open a descriptor, a label is read from disk to get the latest state of its corresponding metadata graph. While doing this, a read lock is acquired. This read lock may actually be unnecessary, as label files are so small that reading them is probably transactional on most platforms, and especially linux.
Not getting a label read lock my speed up things, especially on an NFS cluster, but we need a way to properly measure this before trying to do so.
improvements in schema checking
Do schema checking on json objects before inserting
If we knew for sure all json objects are valid, our subsequent schema check could be much cheaper. We'd have to refactor our schema checking in two parts - one that runs on json objects and is able to tell that a json object is valid according to our schema, and one that runs on a data layer after those objects have been inserted, which does all the referential integrity checks.
As an added bonus doing this refactor would give us a component that could possibly be used to validate json objects in other databases as well.
layer pinning
Layers now leave memory when their atom is garbage collected by swi. If we keep the atom around, we can prevent it from being removed from memory. This'd be beneficial for often used layers like our schema graphs, and the system database.
Can probably be done entirely in prolog.
transaction retrying without throwing away all done work
When a transaction fails, we redo the entire thing. There are however various circumstances in which a transaction fails, but a lot of work that was done can be reused. This happens when one of the metadata graphs changed in a way that does not affect our commit, such as when another repository or branch changed, or such as when the metadata graph is optimized.
We should find a way to make transaction retrying detect these circumstances and do the minimum amount of work it needs to resume.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
The following is the outcome of a short discussion between @GavinMendelGleason and @matko regarding ways to speed up terminusdb ingest.
Before anything is implemented though, we first need to get benchmarks back in order so we know what any improvement actually did for us.
improvements in store
On-demand indexing
We currently build all indexes during layer build. This may often not really be required, and it certainly isn't required to say we safely
stored the data. It'd be good if we had some way to postpone index building either until it's first needed, or explicitely when requested.
This'd require quite a refactor in our querying though, as some operations would be either impossible or prohibitively slow without
indexes.
Build in-memory and move to disk only when we decide to keep
Currently we build layers on disk, then we do a schema check on them, and then decide whether to keep them around or not. It'd be better if we only wrote layers to disk after we're sure we want to keep them.
Plan is to build everything in memory first, and then promote it to persistent storage only when we're sure we want to keep it. Even then, there may be cases where we may figure out later that we don't want to keep it after all, but this will still be faster for the majority of cases.
label read lock
Whenever we open a descriptor, a label is read from disk to get the latest state of its corresponding metadata graph. While doing this, a read lock is acquired. This read lock may actually be unnecessary, as label files are so small that reading them is probably transactional on most platforms, and especially linux.
Not getting a label read lock my speed up things, especially on an NFS cluster, but we need a way to properly measure this before trying to do so.
improvements in schema checking
Do schema checking on json objects before inserting
If we knew for sure all json objects are valid, our subsequent schema check could be much cheaper. We'd have to refactor our schema checking in two parts - one that runs on json objects and is able to tell that a json object is valid according to our schema, and one that runs on a data layer after those objects have been inserted, which does all the referential integrity checks.
As an added bonus doing this refactor would give us a component that could possibly be used to validate json objects in other databases as well.
layer pinning
Layers now leave memory when their atom is garbage collected by swi. If we keep the atom around, we can prevent it from being removed from memory. This'd be beneficial for often used layers like our schema graphs, and the system database.
Can probably be done entirely in prolog.
transaction retrying without throwing away all done work
When a transaction fails, we redo the entire thing. There are however various circumstances in which a transaction fails, but a lot of work that was done can be reused. This happens when one of the metadata graphs changed in a way that does not affect our commit, such as when another repository or branch changed, or such as when the metadata graph is optimized.
We should find a way to make transaction retrying detect these circumstances and do the minimum amount of work it needs to resume.
Beta Was this translation helpful? Give feedback.
All reactions