Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring: get rid of DatasetSummary as a repository file, store these key properties in the database #983

Open
zaychenko-sergei opened this issue Dec 10, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request performance rust Pull requests that update Rust code

Comments

@zaychenko-sergei
Copy link
Contributor

Depends on #981, #342 , partially depends on #978 .

DatasetSummary represents a set of pre-computed frequently requested properties about the state of the dataset.
Before becoming a Node server, the original CLI application heavily relied on summaries to avoid redundant metata chain scans.
Currently it's becoming on the way of performance scaling with the wider application of databases and transactionality.
Storing the same data in the database will not affect CLI behavior much, as it will soon contain a workspace-scoped SQLite database as well, while for the Node server this opens a range of performance optimization opportunities.

In addition, DatasetSummary overlaps with the ideas around DatasetEntry in datasets domain, and we don't need multiple mechanisms serving similar purpose.

We should revise the design of dataset summary stored in ODF repo towards storing the important properties in the database, and automatically updating those whenever a HEAD reference changes.

Below is the information about each property of DatasetSummary and where it is used:


  1. id - represents dataset identifier taken from Seed block.
    It is highly redundand now, as we have the id stored in the DatasetEntry already.
    To be removed completely (take from already resolved dataset handle) from:

  1. dataset_kind - represents whether the dataset is Root or Derivative, also taken from Seedblock. The most frequently accessed attribute in the core services. It could be stored inDatasetEntry, since this also never changes after dataset creation. It could also become a field in ResolvedDatasetsince it would cost nothing to extract it there, and many contexts obtainResolvedDataset` already. To refactor in:
  • GraphQL flows API: ensure_expected_dataset_kind
  • GraphQL dataset API: kind (src/adapter/graphql/src/queries/datasets/dataset.rs)
  • Ingest command: ensure_valid_push_target
  • List command
  • Compaction service
  • Pull request planner
  • Verification service
  • Set watermark service

  1. dependencies - represents list of upstream dataset ids, taken from SetTransform nodes.
    The same data is already stored in dependency graph, so it should be utilized instead.
    To refactor in:
  • Verify command: detect_remote_datasets
  • Dependency graph indexer (here, we must scan the original dataset, it's fine, since it happens once)
  • Provenance service
  • Pull request planner

  1. num_records, data_size, checkpoints_size - these are accumulated values that should be stored in a new database table.
    These should be updated after set_ref changes.
    Currently the data is used in list command and GraphQL dataset API.
    These properties should be read from a new database-backed repository looking at the new database table.

  1. last_pulled - represents the last time the dataset obtained AddData or ExecuteTransform node.
    Used only in list command. This might be expressed simpler, if we'll have another quick link in the database to the last "data" node. Or it could be saved to the same table as num_records, data_size alternatively.

  1. last_block_hash - represents the value of HEAD against which the summary structure was built, and is only used to properly incrementally update the summary itself. Becomes redundant after this refactoring.

After this refactoring is complete, remove all related definitions, like DatasetSummary struct, get_summary function, ..
We also would likely need a workspace migration that would clean the summary files in each dataset.
(should we do something with S3 too?)

@zaychenko-sergei zaychenko-sergei self-assigned this Dec 10, 2024
@zaychenko-sergei zaychenko-sergei added enhancement New feature or request rust Pull requests that update Rust code performance labels Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance rust Pull requests that update Rust code
Projects
None yet
Development

No branches or pull requests

1 participant