Refactoring: get rid of DatasetSummary
as a repository file, store these key properties in the database
#983
Labels
DatasetSummary
as a repository file, store these key properties in the database
#983
Depends on #981, #342 , partially depends on #978 .
DatasetSummary
represents a set of pre-computed frequently requested properties about the state of the dataset.Before becoming a Node server, the original CLI application heavily relied on summaries to avoid redundant metata chain scans.
Currently it's becoming on the way of performance scaling with the wider application of databases and transactionality.
Storing the same data in the database will not affect CLI behavior much, as it will soon contain a workspace-scoped SQLite database as well, while for the Node server this opens a range of performance optimization opportunities.
In addition,
DatasetSummary
overlaps with the ideas aroundDatasetEntry
indatasets
domain, and we don't need multiple mechanisms serving similar purpose.We should revise the design of dataset summary stored in ODF repo towards storing the important properties in the database, and automatically updating those whenever a HEAD reference changes.
Below is the information about each property of
DatasetSummary
and where it is used:id
- represents dataset identifier taken fromSeed
block.It is highly redundand now, as we have the id stored in the
DatasetEntry
already.To be removed completely (take from already resolved dataset handle) from:
AppendDatasetMetadataBatchUseCase
DatasetRepositoryLocalFs
(identifier resolution by name, should disappear after Unify file structure in repos and workspaces #342)dataset_kind
- represents whether the dataset isRoot
orDerivative, also taken from
Seedblock. The most frequently accessed attribute in the core services. It could be stored in
DatasetEntry, since this also never changes after dataset creation. It could also become a field in
ResolvedDatasetsince it would cost nothing to extract it there, and many contexts obtain
ResolvedDataset` already. To refactor in:ensure_expected_dataset_kind
kind
(src/adapter/graphql/src/queries/datasets/dataset.rs)ensure_valid_push_target
dependencies
- represents list of upstream dataset ids, taken fromSetTransform
nodes.The same data is already stored in dependency graph, so it should be utilized instead.
To refactor in:
detect_remote_datasets
num_records
,data_size
,checkpoints_size
- these are accumulated values that should be stored in a new database table.These should be updated after
set_ref
changes.Currently the data is used in
list
command and GraphQL dataset API.These properties should be read from a new database-backed repository looking at the new database table.
last_pulled
- represents the last time the dataset obtainedAddData
orExecuteTransform
node.Used only in
list
command. This might be expressed simpler, if we'll have another quick link in the database to the last "data" node. Or it could be saved to the same table asnum_records
,data_size
alternatively.last_block_hash
- represents the value of HEAD against which the summary structure was built, and is only used to properly incrementally update the summary itself. Becomes redundant after this refactoring.After this refactoring is complete, remove all related definitions, like
DatasetSummary
struct,get_summary
function, ..We also would likely need a workspace migration that would clean the summary files in each dataset.
(should we do something with S3 too?)
The text was updated successfully, but these errors were encountered: