Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(warp): added warp update to madara and other changes #393

Open
wants to merge 79 commits into
base: main
Choose a base branch
from

Conversation

Trantorian1
Copy link
Collaborator

@Trantorian1 Trantorian1 commented Nov 21, 2024

Pull Request type

  • Feature
  • Refactoring (no functional changes, no API changes)
  • Documentation content changes

What is the current behavior?

There currently is no way to migrate the Madara db on a breaking change. This pr fixes this, adding some more quality of life features in the process.

What is the new behavior?

  • Added warp update, which allows Madara to migrate its database at speeds much faster than a re-sync from genesis. This works by running a local fgw node and a second node which syncs from it with higher than normal parallelization. This has been documented in the README, where you can find more detailed information on how to test this feature.

  • Added cli parameter presets. This is quite primitive atm, and only functions by overriding other cli args after they have been parsed. This sadly means the user cannot currently override options set in the presets (also this is all very manual). Imo it is still a good start though in making some common args configuration such as rpc, gateway (or just lengthy ones like warp sync) easier to use.

  • Db flush no longer occurs on a fixed-interval timer of 5s. Instead, maybe_flush has been renamed to flush and is called every n blocks in the l2 sync block storage process. This can be set through the cli arg --flush-every-n-blocks and defaults to 1000.

  • Added the option to increase sync parallelism. This is used by warp sync and is available under the --sync-parallelism cli flag, which defaults to 10 (meaning 10 blocks will be fetched in parallel during l2 sync).

Some important changes have also been made to the rpc admin endpoints:

  • Added a new admin endpoint, stop_node, which can be used to stop madara at a distance. This is used by warp update to stop the warp update sender once the receiver has finished synchronizing.

  • Added a new admin endpoint, ping, which just returns the unix time at which it was received. This can be used to check if a node is active or calculate network latency.

  • Added various new admin endpoints which allow to control the capabilities of the node (for example stopping / restarting sync). This is quite basic atm and has only been implemented for sync and user-facing rpc services. It is sadly not possible to start services which were not available on startup atm.

Note

There are currently two main issues I have with the current service architecture:

  1. It is not possible to start a service later on in the execution of the node
  2. Interruption with graceful shutdown can occur at any time in the execution. Other than that, we have no hook to simply pause the execution of a service.

I am thinking of how to best approach this. Currently, I am favoring a monitor solution where ServiceGroupd also keeps in memory the necessary information to start (and restart) a service. There also needs to be some mechanism for the ServiceGroup to be able to cleanly pause the execution of a service but I am not sure of the best way to go about this yet.

On the rpc front:

  • Updated the rpc versioning middleware to accept rpc methods sent with jsonrpsee. This is because (due to our versioning scheme) jsonrpsee will send requests of the form namespace_version_method, whereas our version middleware only accepts request of the form namespace_method, with version being specified in the header. This has been changed to allow for both formats. This is only useful for us to be able to make rpc calls from one node to the other, but the alternative would have been to do this manually. I think this is a better solution as jsonrpsee provides type checking and function signatures for each of our methods, which should reduce errors.

Refactor

  • Finally, the various methods in the l2 sync section of madara have been refactored to be a bit more readable.

Does this introduce a breaking change?

No.

Trantorian1 and others added 30 commits November 18, 2024 12:01
This should be handled by an external proxy anyways.
This currently only contains the `MadaraWriteRpc` methods but will be
used to encapsulate any sensitive admin methods in the future.
@Trantorian1 Trantorian1 marked this pull request as ready for review November 26, 2024 09:07
Ok(will_flush)
self.db.flush_cfs_opt(&columns, &opts).context("Flushing database")?;

Ok(())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not mandatory now but we should add metrics to follow the flush frequency and behavior

crates/client/eth/src/l1_gas_price.rs Show resolved Hide resolved
@antiyro
Copy link
Member

antiyro commented Nov 27, 2024

Capture d’écran 2024-11-27 à 10 02 13

I get infinite freeze when running with: `--full --network mainnet --l1-endpoint <> --base-path /tmp/madara-b --warp-update-receiver``

@Trantorian1
Copy link
Collaborator Author

Capture d’écran 2024-11-27 à 10 02 13 I get infinite freeze when running with: `--full --network mainnet --l1-endpoint <> --base-path /tmp/madara-b --warp-update-receiver``

Will be checking that out

Copy link
Member

@jbcaron jbcaron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why switch Admin methods to release?

You can simplify the fetch task handling by using the run_until_cancelled method provided by tokio_util::sync::CancellationToken. This will cleanly stop the tasks without the need for additional logic.

@antiyro
Copy link
Member

antiyro commented Nov 27, 2024

Capture d’écran 2024-11-27 à 14 20 03

problems when sending sigint to the sync. sometimes not detected or no gracefull shutdown with segfault

@Trantorian1
Copy link
Collaborator Author

Capture d’écran 2024-11-27 à 14 20 03 problems when sending sigint to the sync. sometimes not detected or no gracefull shutdown with segfault

This has been fixed

@cchudant
Copy link
Member

I get infinite freeze when running with: `--full --network mainnet --l1-endpoint <> --base-path /tmp/madara-b --warp-update-receiver``

was this caught by the ci? if not, what was wrong here and why did the e2e pass? the idea of such a big problem not getting caught by the ci makes me worried

@Trantorian1
Copy link
Collaborator Author

I get infinite freeze when running with: `--full --network mainnet --l1-endpoint <> --base-path /tmp/madara-b --warp-update-receiver``

was this caught by the ci? if not, what was wrong here and why did the e2e pass? the idea of such a big problem not getting caught by the ci makes me worried

This was in fact not an infinite freeze but just tracing being disabled after #401

Copy link
Member

@cchudant cchudant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, I'm not sure i understand where the difference between what's checked during regular block import and what's checked when importing a block in warp updates is

are the block hashes and stuff checked in warp mode too?

README.md Outdated
## ✔ Supported Features
## 🔄 Migration

When migration to a newer version of Madara which introduces breaking changes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When migrating to a newer version of Madara [that introduces breaking changes]
i would probably simply remove the later part of the sentence tbh, people are not expected to migrate to new versions that dont require upgrading the db?

@@ -176,11 +170,6 @@ impl BlockImporter {
validation: BlockValidationContext,
) -> Result<BlockImportResult, BlockImportError> {
let result = self.verify_apply.verify_apply(block, validation).await?;
// Flush step.
let force = self.always_force_flush;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaah!! no, we really need to flush every block, always, in block_production mode

Copy link
Collaborator Author

@Trantorian1 Trantorian1 Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has no importance anymore as maybe_flush has been removed and will always flush anyways, so always_force_flush is no longer needed. This is after I noticed the only part of the code we were performing this check was in the block import, so instead we just do the check there while other parts of the code just call flush directly.

@@ -238,7 +239,13 @@ pub async fn handle_get_block_traces(
traces: Vec<TransactionTraceWithHash>,
}

let traces = v0_7_1_trace_block_transactions(&Starknet::new(backend, add_transaction_provider), block_id).await?;
// TODO: we should probably use the actual service context here instead of
// creating a new one!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably not implement trace block like this at all actually, i dont see why gateway would depend on rpc

Copy link
Member

@cchudant cchudant Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a hack afaik

self.backend
.maybe_flush(true)
.map_err(|err| BlockImportError::Internal(format!("DB flushing error: {err:#}").into()))?;
self.backend.flush().map_err(|err| BlockImportError::Internal(format!("DB flushing error: {err:#}").into()))?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, just a note here
flushing is the responsability of the block import crate, this flushing here is a hack for pending blocks.
i'd like the flushing to be moved back into block import and be removed from sync l2.rs once again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can keep this hack here of course since it was already present

Self {
backend: Arc::clone(&self.backend),
add_transaction_provider: Arc::clone(&self.add_transaction_provider),
ctx: self.ctx.branch(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean that cloning a starknet instance will make it a parent to a newly created child?
this feels weird.. i wouldnt expect Clone to do stuff like that

how about just wrapping ServiceContext into an Arc so that cloning a Starknet actually shares the cancellation token instead of forking it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum. branch only clones the CancellationToken though, which is itself behind an Arc, so I don't think this is necessary.

pub struct CancellationToken {
    inner: Arc<tree_node::TreeNode>,
}

Copy link
Collaborator Author

@Trantorian1 Trantorian1 Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also ServiceContext::branch does not create any child CancellationToken, that is handled by ServiceContext::branch_id

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i did not realize branch was clone sorry

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I've update it to clone to make this more obvious :)

};

if client.shutdown().await.is_err() {
tracing::error!("❗ Failed to shutdown warp update sender");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this an unrecoverable error?

Copy link
Collaborator Author

@Trantorian1 Trantorian1 Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will make it one 👍

UpTo(u64),
}

async fn sync_blocks(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some docs here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im unsure where the parallel catching up vs polling phase are here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's SyncStatus that i don't understand

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SyncStatus is just used to express if we have reached the tip of the chain. This is significant since we are actually synchronizing from two chains during warp updates:

  1. The warp update sender
  2. Whatever feeder gateway we have set with --gateway-url

The issue is that --n-blocks-to-sync will cause sync to finish early, so we need to be able to know if we have reached the tip of the chain or else we do not display '🥳 The sync process has caught up with the tip of the chain'.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync_n_blocks is actually only used during parallel fetching: this includes warp updates and normal parallel sync, and does not include post-sync catch up. So sync_n_blocks is not called for the polling phase.

let BlockImportResult { header, block_hash } = block_import.verify_apply(block, validation.clone()).await?;

if header.block_number - last_block_n >= flush_every_n_blocks as u64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: i really think this should not be here, flushing should def not be the concern of l2.rs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you suggest we move this instead? This was already called indirectly in the verify_apply method if I remember correctly. The issue is that if we are running a full node with only sync services enabled we still need to have a point where new blocks are flushed. Also I don't quite see how we can share information across services in an elegant way. Imo it makes most sense to call flush were information about the block store is directly available, be that sync, block_production or the mempool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it makes more sense to call it in block_import? this seems way too low level to be a concern of l2.rs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean sharing info in this case?

.multiple(false)
)
)]
pub struct ArgsPresetParams {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this preset thing and how does it relate to the older presets feature

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So. Basically I realized some configs were starting to take up a lot of arguments. Initially, warp updates were just set up as a series of predefined arguments (this is still the case for --warp-update-sender), and it seemed really tedious and error-prone for the user to have to remember so many options. All this does is that it sets various options in the RunCmd struct after parsing has occurred. This has its downsides, most notably the user cannot override args presets this way as they are evaluated after user inputs. The advantage is that this streamlines more complex cli setups with many options and flags (this mostly applies to --warp-update-sender).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be adding this in a doc comment

@@ -201,7 +197,7 @@ async fn main() -> anyhow::Result<()> {

app.start_and_drive_to_end().await?;

tracing::info!("Shutting down analytics");
tracing::info!("🔌 Shutting down analytics...");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this log entirely, it's actually not useful tbh

@Trantorian1
Copy link
Collaborator Author

Trantorian1 commented Nov 29, 2024

so, I'm not sure i understand where the difference between what's checked during regular block import and what's checked when importing a block in warp updates is

are the block hashes and stuff checked in warp mode too?

Yes, the block hashes and state root are checked. This is a bit of a shame, as there clearly are lots of performance gains to be had in performing a single state root computation since we know we are synchronizing from a trusted source. This was not implemented as it felt like this was going a bit out of scope (at that point I would be implementing warp sync as well). Still, we are nonetheless cutting down on network latency and increasing the number of block synchronized by maximizing CPU parallelism. Some initial benchmarks have shown a 33% percent improvement in sync time through this alone, so I think this is good enough for now.

In the future, we have some low hanging fruits left which could make this a lot faster, mainly computing the state root in a single go from genesis to full sync. But for now I think it'd be better to move on to other features, most importantly RPC v0.13.3, especially since this feature has already been divided between this and another upcoming pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

4 participants