feat(warp): added `warp update` to madara and other changes #393

Trantorian1 · 2024-11-21T17:35:24Z

Pull Request type

Feature
Refactoring (no functional changes, no API changes)
Documentation content changes

What is the current behavior?

There currently is no way to migrate the Madara db on a breaking change. This pr fixes this, adding some more quality of life features in the process.

What is the new behavior?

Added warp update, which allows Madara to migrate its database at speeds much faster than a re-sync from genesis. This works by running a local fgw node and a second node which syncs from it with higher than normal parallelization. This has been documented in the README, where you can find more detailed information on how to test this feature.
Added cli parameter presets. This is quite primitive atm, and only functions by overriding other cli args after they have been parsed. This sadly means the user cannot currently override options set in the presets (also this is all very manual). Imo it is still a good start though in making some common args configuration such as rpc, gateway (or just lengthy ones like warp sync) easier to use.
Db flush no longer occurs on a fixed-interval timer of 5s. Instead, maybe_flush has been renamed to flush and is called every n blocks in the l2 sync block storage process. This can be set through the cli arg --flush-every-n-blocks and defaults to 1000.
Added the option to increase sync parallelism. This is used by warp sync and is available under the --sync-parallelism cli flag, which defaults to 10 (meaning 10 blocks will be fetched in parallel during l2 sync).

Some important changes have also been made to the rpc admin endpoints:

Added a new admin endpoint, stop_node, which can be used to stop madara at a distance. This is used by warp update to stop the warp update sender once the receiver has finished synchronizing.
Added a new admin endpoint, ping, which just returns the unix time at which it was received. This can be used to check if a node is active or calculate network latency.
Added various new admin endpoints which allow to control the capabilities of the node (for example stopping / restarting sync). This is quite basic atm and has only been implemented for sync and user-facing rpc services. It is sadly not possible to start services which were not available on startup atm.

Note

There are currently two main issues I have with the current service architecture:

It is not possible to start a service later on in the execution of the node
Interruption with graceful shutdown can occur at any time in the execution. Other than that, we have no hook to simply pause the execution of a service.

I am thinking of how to best approach this. Currently, I am favoring a monitor solution where ServiceGroupd also keeps in memory the necessary information to start (and restart) a service. There also needs to be some mechanism for the ServiceGroup to be able to cleanly pause the execution of a service but I am not sure of the best way to go about this yet.

On the rpc front:

Updated the rpc versioning middleware to accept rpc methods sent with jsonrpsee. This is because (due to our versioning scheme) jsonrpsee will send requests of the form namespace_version_method, whereas our version middleware only accepts request of the form namespace_method, with version being specified in the header. This has been changed to allow for both formats. This is only useful for us to be able to make rpc calls from one node to the other, but the alternative would have been to do this manually. I think this is a better solution as jsonrpsee provides type checking and function signatures for each of our methods, which should reduce errors.

Refactor

Finally, the various methods in the l2 sync section of madara have been refactored to be a bit more readable.

Does this introduce a breaking change?

No.

This should be handled by an external proxy anyways.

This currently only contains the `MadaraWriteRpc` methods but will be used to encapsulate any sensitive admin methods in the future.

…endpoint

…edundant mp_block::BlockId

This currently only includes warp update sender fgw and admin rpc ports

antiyro · 2024-11-27T09:49:43Z

crates/client/db/src/lib.rs

-        Ok(will_flush)
+        self.db.flush_cfs_opt(&columns, &opts).context("Flushing database")?;
+
+        Ok(())


Not mandatory now but we should add metrics to follow the flush frequency and behavior

crates/client/eth/src/l1_gas_price.rs

antiyro · 2024-11-27T10:02:53Z

I get infinite freeze when running with: `--full --network mainnet --l1-endpoint <> --base-path /tmp/madara-b --warp-update-receiver``

Trantorian1 · 2024-11-27T10:08:01Z

I get infinite freeze when running with: `--full --network mainnet --l1-endpoint <> --base-path /tmp/madara-b --warp-update-receiver``

Will be checking that out

jbcaron

Why switch Admin methods to release?

You can simplify the fetch task handling by using the run_until_cancelled method provided by tokio_util::sync::CancellationToken. This will cleanly stop the tasks without the need for additional logic.

antiyro · 2024-11-27T14:20:41Z

problems when sending sigint to the sync. sometimes not detected or no gracefull shutdown with segfault

…efault

…f threads

Trantorian1 · 2024-11-28T13:50:39Z

problems when sending sigint to the sync. sometimes not detected or no gracefull shutdown with segfault

This has been fixed

cchudant · 2024-11-28T16:41:51Z

I get infinite freeze when running with: `--full --network mainnet --l1-endpoint <> --base-path /tmp/madara-b --warp-update-receiver``

was this caught by the ci? if not, what was wrong here and why did the e2e pass? the idea of such a big problem not getting caught by the ci makes me worried

Trantorian1 · 2024-11-28T16:43:27Z

I get infinite freeze when running with: `--full --network mainnet --l1-endpoint <> --base-path /tmp/madara-b --warp-update-receiver``

was this caught by the ci? if not, what was wrong here and why did the e2e pass? the idea of such a big problem not getting caught by the ci makes me worried

This was in fact not an infinite freeze but just tracing being disabled after #401

cchudant

so, I'm not sure i understand where the difference between what's checked during regular block import and what's checked when importing a block in warp updates is

are the block hashes and stuff checked in warp mode too?

cchudant · 2024-11-28T16:44:54Z

README.md

-## ✔ Supported Features
+## 🔄 Migration
+
+When migration to a newer version of Madara which introduces breaking changes,


When migrating to a newer version of Madara [that introduces breaking changes]
i would probably simply remove the later part of the sentence tbh, people are not expected to migrate to new versions that dont require upgrading the db?

cchudant · 2024-11-28T16:48:38Z

crates/client/block_import/src/lib.rs

@@ -176,11 +170,6 @@ impl BlockImporter {
        validation: BlockValidationContext,
    ) -> Result<BlockImportResult, BlockImportError> {
        let result = self.verify_apply.verify_apply(block, validation).await?;
-        // Flush step.
-        let force = self.always_force_flush;


aaah!! no, we really need to flush every block, always, in block_production mode

This has no importance anymore as maybe_flush has been removed and will always flush anyways, so always_force_flush is no longer needed. This is after I noticed the only part of the code we were performing this check was in the block import, so instead we just do the check there while other parts of the code just call flush directly.

cchudant · 2024-11-28T16:51:17Z

crates/client/gateway/server/src/handler.rs

@@ -238,7 +239,13 @@ pub async fn handle_get_block_traces(
        traces: Vec<TransactionTraceWithHash>,
    }

-    let traces = v0_7_1_trace_block_transactions(&Starknet::new(backend, add_transaction_provider), block_id).await?;
+    // TODO: we should probably use the actual service context here instead of
+    // creating a new one!


we should probably not implement trace block like this at all actually, i dont see why gateway would depend on rpc

this was a hack afaik

cchudant · 2024-11-28T16:53:22Z

crates/client/mempool/src/block_production.rs

-        self.backend
-            .maybe_flush(true)
-            .map_err(|err| BlockImportError::Internal(format!("DB flushing error: {err:#}").into()))?;
+        self.backend.flush().map_err(|err| BlockImportError::Internal(format!("DB flushing error: {err:#}").into()))?;


so, just a note here
flushing is the responsability of the block import crate, this flushing here is a hack for pending blocks.
i'd like the flushing to be moved back into block import and be removed from sync l2.rs once again

you can keep this hack here of course since it was already present

cchudant · 2024-11-28T16:56:02Z

crates/client/rpc/src/lib.rs

+        Self {
+            backend: Arc::clone(&self.backend),
+            add_transaction_provider: Arc::clone(&self.add_transaction_provider),
+            ctx: self.ctx.branch(),


does that mean that cloning a starknet instance will make it a parent to a newly created child?
this feels weird.. i wouldnt expect Clone to do stuff like that

how about just wrapping ServiceContext into an Arc so that cloning a Starknet actually shares the cancellation token instead of forking it

Hum. branch only clones the CancellationToken though, which is itself behind an Arc, so I don't think this is necessary.

pub struct CancellationToken { inner: Arc<tree_node::TreeNode>, }

Also ServiceContext::branch does not create any child CancellationToken, that is handled by ServiceContext::branch_id

yes i did not realize branch was clone sorry

No worries, I've update it to clone to make this more obvious :)

cchudant · 2024-11-28T17:10:54Z

crates/client/sync/src/fetch/mod.rs

+        };
+
+        if client.shutdown().await.is_err() {
+            tracing::error!("❗ Failed to shutdown warp update sender");


isn't this an unrecoverable error?

Yep, will make it one 👍

cchudant · 2024-11-28T17:12:29Z

crates/client/sync/src/fetch/mod.rs

+    UpTo(u64),
+}
+
+async fn sync_blocks(


can you add some docs here?

im unsure where the parallel catching up vs polling phase are here

i think it's SyncStatus that i don't understand

SyncStatus is just used to express if we have reached the tip of the chain. This is significant since we are actually synchronizing from two chains during warp updates:

The warp update sender

Whatever feeder gateway we have set with --gateway-url

The issue is that --n-blocks-to-sync will cause sync to finish early, so we need to be able to know if we have reached the tip of the chain or else we do not display '🥳 The sync process has caught up with the tip of the chain'.

sync_n_blocks is actually only used during parallel fetching: this includes warp updates and normal parallel sync, and does not include post-sync catch up. So sync_n_blocks is not called for the polling phase.

cchudant · 2024-11-28T17:16:36Z

crates/client/sync/src/l2.rs

        let BlockImportResult { header, block_hash } = block_import.verify_apply(block, validation.clone()).await?;

+        if header.block_number - last_block_n >= flush_every_n_blocks as u64 {


re: i really think this should not be here, flushing should def not be the concern of l2.rs

Where do you suggest we move this instead? This was already called indirectly in the verify_apply method if I remember correctly. The issue is that if we are running a full node with only sync services enabled we still need to have a point where new blocks are flushed. Also I don't quite see how we can share information across services in an elegant way. Imo it makes most sense to call flush were information about the block store is directly available, be that sync, block_production or the mempool.

i think it makes more sense to call it in block_import? this seems way too low level to be a concern of l2.rs

what do you mean sharing info in this case?

cchudant · 2024-11-28T17:19:25Z

crates/node/src/cli/mod.rs

+            .multiple(false)
+    )
+)]
+pub struct ArgsPresetParams {


what's this preset thing and how does it relate to the older presets feature

So. Basically I realized some configs were starting to take up a lot of arguments. Initially, warp updates were just set up as a series of predefined arguments (this is still the case for --warp-update-sender), and it seemed really tedious and error-prone for the user to have to remember so many options. All this does is that it sets various options in the RunCmd struct after parsing has occurred. This has its downsides, most notably the user cannot override args presets this way as they are evaluated after user inputs. The advantage is that this streamlines more complex cli setups with many options and flags (this mostly applies to --warp-update-sender).

Will be adding this in a doc comment

cchudant · 2024-11-28T17:21:27Z

crates/node/src/main.rs

@@ -201,7 +197,7 @@ async fn main() -> anyhow::Result<()> {

    app.start_and_drive_to_end().await?;

-    tracing::info!("Shutting down analytics");
+    tracing::info!("🔌 Shutting down analytics...");


can you remove this log entirely, it's actually not useful tbh

Trantorian1 · 2024-11-29T08:15:34Z

so, I'm not sure i understand where the difference between what's checked during regular block import and what's checked when importing a block in warp updates is

are the block hashes and stuff checked in warp mode too?

Yes, the block hashes and state root are checked. This is a bit of a shame, as there clearly are lots of performance gains to be had in performing a single state root computation since we know we are synchronizing from a trusted source. This was not implemented as it felt like this was going a bit out of scope (at that point I would be implementing warp sync as well). Still, we are nonetheless cutting down on network latency and increasing the number of block synchronized by maximizing CPU parallelism. Some initial benchmarks have shown a 33% percent improvement in sync time through this alone, so I think this is good enough for now.

In the future, we have some low hanging fruits left which could make this a lot faster, mainly computing the state root in a single go from genesis to full sync. But for now I think it'd be better to move on to other features, most importantly RPC v0.13.3, especially since this feature has already been divided between this and another upcoming pr.

…d `with_id`

Trantorian1 and others added 30 commits November 18, 2024 12:01

feat(rate-limit): removed rpc rate limiting in Madara

135a3c0

This should be handled by an external proxy anyways.

feat(endpoint): added a separate admin endpoint

5e4f779

This currently only contains the `MadaraWriteRpc` methods but will be used to encapsulate any sensitive admin methods in the future.

feat(changelog): updated CHANGELOG.md

5e8ee41

fix(lint)

75e3d91

feat(endpoint): seperated user and admin rpc method versions

c4a0055

fix(tests): invalid version import in tests

b0c456d

fix(test)

f8ab3e4

Merge branch 'main' into feat/engine_endpoint

f774276

feat(enpoint): updated rpc-methods cli arg to work better with new …

b29f886

…endpoint

docs(endpoint): renamed rpc-methods cli argument to rpc-enpoints

fbd89b6

Merge branch 'main' into feat/engine_endpoint

a72deba

build(profiling): added profiling build target

d2afdf2

fix(fgw): include l1_to_l2_consumed_message in L1 handler receipt

3fbe996

fix comment: MAX_H160

7263c8d

update dependencies starknet-rs, starknet-types, blockifier, cairo

6ac5b72

add FIXME for FeePayment conversion

3141d73

rename FeederClient to GatewayProvider

7d10c97

add add_transaction gateway client

555c2d8

fix mock fgw

3acaa86

fix comment, remove crate error

8f969bd

refactor: replace starknet-rs BlockId with types-rs BlockId, remove r…

14c8209

…edundant mp_block::BlockId

remove type FetchBlockId

912e0c9

revert get_storage_at

c841913

Merge branch 'fix/fgw-l1l2msg' into feat/engine_endpoint

31baf41

Merge branch 'feat/fgw-add-tx' into feat/engine_endpoint

d564466

Merge branch 'refactor/rpc-types' into feat/engine_endpoint

340171d

Merge branch 'main' into feat/engine_endpoint

7c7031b

Merge branch 'main' into feat/engine_endpoint

0813c0e

fix(comments)

bc7fe96

fix(comments)

6f52fce

Trantorian1 added 4 commits November 25, 2024 12:37

feat(service): inter-process service communication

93455d9

feat(service): impled cross-service shitdown for rpc and sync

2d88efa

feat(warp): warp update clip args

46fa8cf

This currently only includes warp update sender fgw and admin rpc ports

Merge branch 'main' into feat/warp_update

be2f41c

Trantorian1 marked this pull request as ready for review November 26, 2024 09:07

docs(readme): updated readme

ab5759a

Trantorian1 requested a review from antiyro November 26, 2024 16:09

Merge branch 'main' into feat/warp_update

fa020c1

antiyro reviewed Nov 27, 2024

View reviewed changes

feat(admin): reset admin rpc versions to v0.1.0

ada9a64

jbcaron reviewed Nov 27, 2024

View reviewed changes

fix(log): fixed log level

556d640

Trantorian1 and others added 5 commits November 27, 2024 15:22

fix(e2e): tests now use ServiceContext with services activated by d…

dc83e43

…efault

fix(l2): channel close error in fetch/mod.rs

5e9ff44

feat(l2): sync_parallelism in warp update now set based on number o…

1e23f46

…f threads

fix(tests)

529561c

Merge branch 'main' into feat/warp_update

64743aa

jbcaron approved these changes Nov 28, 2024

View reviewed changes

cchudant requested changes Nov 28, 2024

View reviewed changes

Trantorian1 added 4 commits November 29, 2024 09:32

docs(admin): updated admin rpc methods docs

0d71ca1

refactor(service): replaced ServiceContext::branch with clone, adde…

602d705

…d `with_id`

refactor(serive): renamed capabilities to services

7448c4b

fix(comments)

f5754d6

		let BlockImportResult { header, block_hash } = block_import.verify_apply(block, validation.clone()).await?;

		if header.block_number - last_block_n >= flush_every_n_blocks as u64 {

feat(warp): added warp update to madara and other changes #393

Are you sure you want to change the base?

feat(warp): added warp update to madara and other changes #393

Conversation

Trantorian1 commented Nov 21, 2024 • edited Loading

Pull Request type

What is the current behavior?

What is the new behavior?

Some important changes have also been made to the rpc admin endpoints:

On the rpc front:

Refactor

Does this introduce a breaking change?

Choose a reason for hiding this comment

antiyro commented Nov 27, 2024

Trantorian1 commented Nov 27, 2024

jbcaron left a comment

Choose a reason for hiding this comment

antiyro commented Nov 27, 2024

Trantorian1 commented Nov 28, 2024

cchudant commented Nov 28, 2024

Trantorian1 commented Nov 28, 2024

cchudant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Trantorian1 Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cchudant Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Trantorian1 Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Trantorian1 Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Trantorian1 commented Nov 29, 2024 • edited Loading

feat(warp): added `warp update` to madara and other changes #393

feat(warp): added `warp update` to madara and other changes #393

Trantorian1 commented Nov 21, 2024 •

edited

Loading

Trantorian1 Nov 29, 2024 •

edited

Loading

cchudant Nov 28, 2024 •

edited

Loading

Trantorian1 Nov 29, 2024 •

edited

Loading

Trantorian1 Nov 29, 2024 •

edited

Loading

Trantorian1 commented Nov 29, 2024 •

edited

Loading