Refactoring: get rid of `DatasetSummary` as a repository file, store these key properties in the database #983

zaychenko-sergei · 2024-12-10T16:28:35Z

Depends on #981, #342 , partially depends on #978 .

DatasetSummary represents a set of pre-computed frequently requested properties about the state of the dataset.
Before becoming a Node server, the original CLI application heavily relied on summaries to avoid redundant metata chain scans.
Currently it's becoming on the way of performance scaling with the wider application of databases and transactionality.
Storing the same data in the database will not affect CLI behavior much, as it will soon contain a workspace-scoped SQLite database as well, while for the Node server this opens a range of performance optimization opportunities.

In addition, DatasetSummary overlaps with the ideas around DatasetEntry in datasets domain, and we don't need multiple mechanisms serving similar purpose.

We should revise the design of dataset summary stored in ODF repo towards storing the important properties in the database, and automatically updating those whenever a HEAD reference changes.

Below is the information about each property of DatasetSummary and where it is used:

id - represents dataset identifier taken from Seed block.
It is highly redundand now, as we have the id stored in the DatasetEntry already.
To be removed completely (take from already resolved dataset handle) from:

Provenance service
AppendDatasetMetadataBatchUseCase
DatasetRepositoryLocalFs (identifier resolution by name, should disappear after Unify file structure in repos and workspaces #342)

dataset_kind - represents whether the dataset is Root or Derivative, also taken from Seedblock. The most frequently accessed attribute in the core services. It could be stored inDatasetEntry, since this also never changes after dataset creation. It could also become a field in ResolvedDatasetsince it would cost nothing to extract it there, and many contexts obtainResolvedDataset` already. To refactor in:

GraphQL flows API: ensure_expected_dataset_kind
GraphQL dataset API: kind (src/adapter/graphql/src/queries/datasets/dataset.rs)
Ingest command: ensure_valid_push_target
List command
Compaction service
Pull request planner
Verification service
Set watermark service

dependencies - represents list of upstream dataset ids, taken from SetTransform nodes.
The same data is already stored in dependency graph, so it should be utilized instead.
To refactor in:

Verify command: detect_remote_datasets
Dependency graph indexer (here, we must scan the original dataset, it's fine, since it happens once)
Provenance service
Pull request planner

num_records, data_size, checkpoints_size - these are accumulated values that should be stored in a new database table.
These should be updated after set_ref changes.
Currently the data is used in list command and GraphQL dataset API.
These properties should be read from a new database-backed repository looking at the new database table.

last_pulled - represents the last time the dataset obtained AddData or ExecuteTransform node.
Used only in list command. This might be expressed simpler, if we'll have another quick link in the database to the last "data" node. Or it could be saved to the same table as num_records, data_size alternatively.

last_block_hash - represents the value of HEAD against which the summary structure was built, and is only used to properly incrementally update the summary itself. Becomes redundant after this refactoring.

After this refactoring is complete, remove all related definitions, like DatasetSummary struct, get_summary function, ..
We also would likely need a workspace migration that would clean the summary files in each dataset.
(should we do something with S3 too?)

The text was updated successfully, but these errors were encountered:

zaychenko-sergei self-assigned this Dec 10, 2024

zaychenko-sergei added enhancement New feature or request rust Pull requests that update Rust code performance labels Dec 10, 2024

zaychenko-sergei mentioned this issue Dec 10, 2024

Metadata scanning performance kamu-data/kamu-node#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring: get rid of `DatasetSummary` as a repository file, store these key properties in the database #983

Refactoring: get rid of `DatasetSummary` as a repository file, store these key properties in the database #983

zaychenko-sergei commented Dec 10, 2024

Refactoring: get rid of DatasetSummary as a repository file, store these key properties in the database #983

Refactoring: get rid of DatasetSummary as a repository file, store these key properties in the database #983

Comments

zaychenko-sergei commented Dec 10, 2024

Refactoring: get rid of `DatasetSummary` as a repository file, store these key properties in the database #983

Refactoring: get rid of `DatasetSummary` as a repository file, store these key properties in the database #983