Add domain snap sync algorithm #3027

shamil-gadelshin · 2024-09-16T11:44:08Z

This PR introduces several important pieces for domain snap sync implementation (#3026): snap sync orchestrator and domain snap sync algorithm. This PR continues the discussion on the algorithm implementation highlighting the current decisions. It lacks the final integration in the code both at consensus and domain chain sides, proper configuration changes, and several security guarantees discussed previously.

The first commit introduces the SnapSyncOrchestrator - a synchronization manager that arranges correctly processes in both consensus and domain chains. 2-4 commits modify the existing code and add a structure to pass to the domain snap-sync algorithm introduced in the commit 5. The last (6) commit has an updated Cargo.lock placed separately to simplify the review.

Known future algorithm changes

change state block acquisition from downloading from remote peers to local derivation from consensus block
simple acquisition of the last confirmed domain block execution receipt must be replaced with the correct consensus protocol similar to segment headers acquisition
MMR data should be verified against the state (MMR roots)
introduce changes for MMR sync algorithm to work with "five segments lag" (see commit 3 and FINALIZATION_DEPTH_IN_SEGMENTS constant): modify MMR gadget to use archived blocks instead of finalized blocks.

Code contributor checklist:

I have read, understood and followed contributing guide

teor2345

Looks good, but I don’t understand the snap sync algorithm enough to fully review it.

Cargo.lock

domains/client/domain-operator/src/snap_sync.rs

Co-authored-by: teor <[email protected]>

teor2345

Thanks for the updates, I don't think I'm familiar enough with the algorithm to approve, but it looks reasonable to me.

crates/subspace-service/src/sync_from_dsn/snap_sync.rs

nazar-pc · 2024-09-23T14:03:09Z

crates/subspace-service/src/domains/snap_sync_orchestrator.rs

+    }
+
+    /// Wait for the allowing signal for the consensus chain snap sync.
+    pub async fn consensus_snap_sync_unblocked(&self) {


SubspaceLink::block_importing_notification_stream()

AFAIR, we discussed offline a possibility of changing the current algorithm with sync orchestrator from blocking to reactive approach by utilizing SubspaceLink::block_importing_notification_stream() and its ability to acknowledge blocks. I tried to use that approach and deleted several initial synchronization points instead.

Why not deleting this one then? My point was that we ideally wouldn't need this orchestrator at all.

Consensus chain snap sync is a part of the more complex domain snap sync process. In this form, it must start after we acquire the correct target block. I removed other blocking orchestrator points after our conversation: for example, we don't need to send signals from consensus snap sync anymore, however, it's not clear to me how to remove the dependency completely. Let's wait until the full solution is merged and return to this, I'm open for a change here.

Can you just have a mutex or oneshot channel or something that is passed down to subspace-service that prevents sync as such from starting? I don't think you need to block/unblock it many times anyway, just pause until something happens on domain side and it is not necessarily specific to Snap sync either.

What else is this orchestrator needed for?

The current implementation will contain the target block provider that conceals the orchestrator in the full version:

pub trait SnapSyncTargetBlockProvider: Send + Sync { async fn target_block(&self) -> Option<BlockNumber>; }

The default non-blocking implementation returns None, which is close to what you proposed. I tried to limit the scope of the PR, but it seems the whole solution will provide more context and will be easier to review despite its size.

NingLin-P

Make sense in general, will take another look.

NingLin-P · 2024-09-23T22:11:52Z

domains/client/domain-operator/src/snap_sync.rs

+            network_request,
+        )?;
+
+        let last_block_from_sync_result = sync_engine.download_state().await;


It may be possible that the state is too large to fit into the memory and cause OOM, not an immediate issue to fix but we need to resolve it in the long term.

There is an upstream issue for this: paritytech/polkadot-sdk#4

domains/client/domain-operator/src/snap_sync.rs

NingLin-P · 2024-09-23T22:31:12Z

domains/client/domain-operator/src/snap_sync.rs

+    if last_confirmed_block_receipt.domain_block_number == 0u32.into() {
+        return Err(sp_blockchain::Error::Application(
+            "Can't snap sync from genesis.".into(),
+        ));
+    }


This can happen if the domain is instantiated but has produced less than 14_400 blocks, especially when we enable the domain in the mainnet phase 2, the operator node will have to use full sync and sync from genesis.

Is it possible to avoid this error? by either downloading the genesis state from other peers or deriving the genesis state from the consensus chain (after consensus sync is finished) as the domain bootstrapper currently does.

AFAIK, it's not possible to insert the state after the initialization at the moment. However, we could try to investigate a hybrid "consensus-snap and domain-full" mode for such a case. Further research is required.

# Conflicts: # crates/subspace-service/src/sync_from_dsn.rs # domains/client/domain-operator/Cargo.toml

nazar-pc · 2024-10-03T21:38:41Z

crates/subspace-service/src/domains/snap_sync_orchestrator.rs

+    }
+
+    /// Wait for the allowing signal for the consensus chain snap sync.
+    pub async fn consensus_snap_sync_unblocked(&self) {


Why not deleting this one then? My point was that we ideally wouldn't need this orchestrator at all.

nazar-pc · 2024-10-03T21:40:48Z

domains/client/domain-operator/src/snap_sync.rs

+            // Skip last `FINALIZATION_DEPTH_IN_SEGMENTS` archived segments
+            .and_then(|max_segment_index| {
+                max_segment_index.checked_sub(FINALIZATION_DEPTH_IN_SEGMENTS)
+            })


We discussed that it is not actually needed to download older segment at FINALIZATION_DEPTH_IN_SEGMENTS because it is equivalent to downloading the latest segment if responder can do a little bit of custom logic composing necessary data from technically not yet "finalized" from Substrate's point of view blocks. What happened to that?

This is my current task and I will issue a separate PR similar to other MMR-sync updates.,

crates/subspace-service/src/sync_from_dsn/snap_sync.rs

nazar-pc · 2024-10-04T13:13:26Z

crates/subspace-service/src/domains/snap_sync_orchestrator.rs

+    }
+
+    /// Wait for the allowing signal for the consensus chain snap sync.
+    pub async fn consensus_snap_sync_unblocked(&self) {


Can you just have a mutex or oneshot channel or something that is passed down to subspace-service that prevents sync as such from starting? I don't think you need to block/unblock it many times anyway, just pause until something happens on domain side and it is not necessarily specific to Snap sync either.

What else is this orchestrator needed for?

shamil-gadelshin · 2024-10-10T14:55:50Z

Superseded by #3115

shamil-gadelshin added 6 commits September 16, 2024 15:02

Introduce snap sync orchestrator.

bd4829a

Introduce consensus sync params struct.

53d1cad

Export FINALIZATION_DEPTH_IN_SEGMENTS constant.

aea7712

Modify wait_for_import function.

c65ffa6

Add domain snap sync algorithm.

7919e62

Update Cargo.lock

0ec127d

shamil-gadelshin requested review from NingLin-P, nazar-pc and rg3l3dr as code owners September 16, 2024 11:44

teor2345 reviewed Sep 16, 2024

View reviewed changes

Cargo.lock Outdated Show resolved Hide resolved

domains/client/domain-operator/src/snap_sync.rs Outdated Show resolved Hide resolved

Base automatically changed from modify-mmr-sync to main September 17, 2024 09:24

shamil-gadelshin and others added 2 commits September 17, 2024 14:06

Fix Cargo.toml for domain-operator crate

8d4747d

Update domains/client/domain-operator/src/snap_sync.rs

38d885b

Co-authored-by: teor <[email protected]>

shamil-gadelshin requested a review from teor2345 September 17, 2024 10:10

teor2345 reviewed Sep 20, 2024

View reviewed changes

nazar-pc reviewed Sep 23, 2024

View reviewed changes

NingLin-P reviewed Sep 23, 2024

View reviewed changes

shamil-gadelshin added 4 commits October 3, 2024 19:52

Refactor peer discovery code for domain snap sync.

3b542b0

Refactor wait_for_block_import function

7fe7136

Merge branch 'main' into add-domain-snap-sync-algorithm

0f81e12

# Conflicts: # crates/subspace-service/src/sync_from_dsn.rs # domains/client/domain-operator/Cargo.toml

Fix merge changes.

06a038c

shamil-gadelshin requested review from nazar-pc, NingLin-P and teor2345 October 3, 2024 16:46

shamil-gadelshin marked this pull request as draft October 3, 2024 16:47

Refactor snap sync orchestrator.

8ee2d64

shamil-gadelshin marked this pull request as ready for review October 3, 2024 17:07

nazar-pc reviewed Oct 3, 2024

View reviewed changes

nazar-pc reviewed Oct 4, 2024

View reviewed changes

shamil-gadelshin mentioned this pull request Oct 10, 2024

Full domain chain snap sync. #3115

Closed

1 task

shamil-gadelshin closed this Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add domain snap sync algorithm #3027

Add domain snap sync algorithm #3027

shamil-gadelshin commented Sep 16, 2024 •

edited

Loading

teor2345 left a comment

teor2345 left a comment

nazar-pc Sep 23, 2024

shamil-gadelshin Oct 3, 2024

nazar-pc Oct 3, 2024

shamil-gadelshin Oct 4, 2024

nazar-pc Oct 4, 2024

shamil-gadelshin Oct 9, 2024

NingLin-P left a comment

NingLin-P Sep 23, 2024

nazar-pc Sep 23, 2024

NingLin-P Sep 23, 2024

shamil-gadelshin Oct 3, 2024

nazar-pc Oct 3, 2024

nazar-pc Oct 3, 2024

shamil-gadelshin Oct 4, 2024 •

edited

Loading

nazar-pc Oct 4, 2024

shamil-gadelshin commented Oct 10, 2024

Add domain snap sync algorithm #3027

Add domain snap sync algorithm #3027

Conversation

shamil-gadelshin commented Sep 16, 2024 • edited Loading

Known future algorithm changes

Code contributor checklist:

teor2345 left a comment

Choose a reason for hiding this comment

teor2345 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NingLin-P left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shamil-gadelshin Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shamil-gadelshin commented Oct 10, 2024

shamil-gadelshin commented Sep 16, 2024 •

edited

Loading

shamil-gadelshin Oct 4, 2024 •

edited

Loading