Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement consistency pruning in the builit-in pruner #20702

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

lxfind
Copy link
Contributor

@lxfind lxfind commented Dec 20, 2024

Description

This PR implements consistent pruning within the built-in pruner for concurrent pipelines.
Instead of relying on a separate pipeline to do pruning, this PR adds shared data between the processor and pruner for obj_info.
Note that this PR isn't finished yet, sending our early to make sure this is not in the wrong direction.
To implement this, I also need to make pruner take self so that it has access to shared data. The first checkpoint of indexer also needs special care to make sure for any pipeline that requires processed data to prune, we must start processing at the pruner watermark instead of commit watermark.

Test plan

How did you test the new or updated feature?


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • REST API:

Copy link

vercel bot commented Dec 20, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 26, 2024 6:04am
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Dec 26, 2024 6:04am
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Dec 26, 2024 6:04am
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Dec 26, 2024 6:04am

@lxfind lxfind temporarily deployed to sui-typescript-aws-kms-test-env December 20, 2024 01:28 — with GitHub Actions Inactive
@lxfind lxfind marked this pull request as draft December 20, 2024 01:28
@lxfind lxfind marked this pull request as ready for review December 20, 2024 01:28
@lxfind lxfind temporarily deployed to sui-typescript-aws-kms-test-env December 20, 2024 01:28 — with GitHub Actions Inactive
Comment on lines +22 to +20
use crate::consistent_pruning::{PruningInfo, PruningLookupTable};

#[derive(Default)]
pub(crate) struct ObjInfo {
pruning_lookup_table: Arc<PruningLookupTable>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might be missing something, but I don't see the consistent_pruning changes in this PR, or on main?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah forgot to include the file

Comment on lines +112 to +109
async fn prune(&self, from: u64, to: u64, conn: &mut db::Connection<'_>) -> Result<usize> {
use sui_indexer_alt_schema::schema::obj_info::dsl;

let to_prune = self.pruning_lookup_table.take(from, to)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems sensible enough for pruner to take self. The handler will know best the shape of the data it needs to do any consistent pruning, would likely be unwieldy otherwise

@lxfind lxfind force-pushed the indexer-alt-add-pruner-data-to-obj-info branch from 9196bde to e2b70f4 Compare December 20, 2024 02:45
@lxfind lxfind temporarily deployed to sui-typescript-aws-kms-test-env December 20, 2024 02:45 — with GitHub Actions Inactive
@lxfind lxfind force-pushed the indexer-alt-add-pruner-data-to-obj-info branch from e2b70f4 to cb2f77e Compare December 20, 2024 02:46
@lxfind lxfind temporarily deployed to sui-typescript-aws-kms-test-env December 20, 2024 02:46 — with GitHub Actions Inactive
@lxfind lxfind force-pushed the indexer-alt-add-pruner-data-to-obj-info branch from cb2f77e to 0e652b0 Compare December 20, 2024 02:47
@lxfind lxfind temporarily deployed to sui-typescript-aws-kms-test-env December 20, 2024 02:48 — with GitHub Actions Inactive
Copy link
Member

@amnn amnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it overall! The only bit that I want to think about is how it interacts with resumption and the first checkpoint logic, but I think that is more to do with me being unsatisfied with how that bit of code works already -- I think this approach is pretty nice and clean.

@@ -27,6 +27,10 @@ pub trait Processor {
/// How much concurrency to use when processing checkpoint data.
const FANOUT: usize = 10;

/// Whether the pruner requires processed values in order to prune.
/// This will determine the first checkpoint to process when we start the pipeline.
const PRUNING_REQUIRES_PROCESSED_VALUES: bool = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of an unfortunate consequence of this approach, but I'm coming around to it, with some small tweaks:

  • This feature is unique to concurrent pipelines (as is pruning), so we should put it on the concurrent::Handler trait (although I see now above why you've put it on Processor based on where in the codebase you fetch the watermark).
  • I would name it for exactly what it controls (resuming from the pruner watermark), rather than what we might use it for (back-channeling values from the processor to the pruner). How about RESUME_FROM_PRUNER_HI or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me "resuming from the pruner watermark" is an implementation detail that would actually be difficult for pipeline builders to understand. What they care about is the fact that they want to use processed data to do pruning, and the need to resume ingestion from pruner hi is a side-effect.

@lxfind lxfind force-pushed the indexer-alt-add-pruner-data-to-obj-info branch from 0e652b0 to 52ea442 Compare December 26, 2024 06:00
@lxfind lxfind temporarily deployed to sui-typescript-aws-kms-test-env December 26, 2024 06:00 — with GitHub Actions Inactive
@lxfind lxfind changed the title [RFC] Implement consistency pruning in the builit-in pruner Implement consistency pruning in the builit-in pruner Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants