-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(warp): added warp update
to madara and other changes
#393
base: main
Are you sure you want to change the base?
Conversation
This should be handled by an external proxy anyways.
This currently only contains the `MadaraWriteRpc` methods but will be used to encapsulate any sensitive admin methods in the future.
…edundant mp_block::BlockId
This currently only includes warp update sender fgw and admin rpc ports
Ok(will_flush) | ||
self.db.flush_cfs_opt(&columns, &opts).context("Flushing database")?; | ||
|
||
Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not mandatory now but we should add metrics to follow the flush frequency and behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why switch Admin methods to release?
You can simplify the fetch task handling by using the run_until_cancelled method provided by tokio_util::sync::CancellationToken. This will cleanly stop the tasks without the need for additional logic.
was this caught by the ci? if not, what was wrong here and why did the e2e pass? the idea of such a big problem not getting caught by the ci makes me worried |
This was in fact not an infinite freeze but just tracing being disabled after #401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, I'm not sure i understand where the difference between what's checked during regular block import and what's checked when importing a block in warp updates is
are the block hashes and stuff checked in warp mode too?
README.md
Outdated
## ✔ Supported Features | ||
## 🔄 Migration | ||
|
||
When migration to a newer version of Madara which introduces breaking changes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When migrating to a newer version of Madara [that introduces breaking changes]
i would probably simply remove the later part of the sentence tbh, people are not expected to migrate to new versions that dont require upgrading the db?
@@ -176,11 +170,6 @@ impl BlockImporter { | |||
validation: BlockValidationContext, | |||
) -> Result<BlockImportResult, BlockImportError> { | |||
let result = self.verify_apply.verify_apply(block, validation).await?; | |||
// Flush step. | |||
let force = self.always_force_flush; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aaah!! no, we really need to flush every block, always, in block_production mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has no importance anymore as maybe_flush
has been removed and will always flush anyways, so always_force_flush
is no longer needed. This is after I noticed the only part of the code we were performing this check was in the block import, so instead we just do the check there while other parts of the code just call flush
directly.
@@ -238,7 +239,13 @@ pub async fn handle_get_block_traces( | |||
traces: Vec<TransactionTraceWithHash>, | |||
} | |||
|
|||
let traces = v0_7_1_trace_block_transactions(&Starknet::new(backend, add_transaction_provider), block_id).await?; | |||
// TODO: we should probably use the actual service context here instead of | |||
// creating a new one! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably not implement trace block like this at all actually, i dont see why gateway would depend on rpc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a hack afaik
self.backend | ||
.maybe_flush(true) | ||
.map_err(|err| BlockImportError::Internal(format!("DB flushing error: {err:#}").into()))?; | ||
self.backend.flush().map_err(|err| BlockImportError::Internal(format!("DB flushing error: {err:#}").into()))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, just a note here
flushing is the responsability of the block import crate, this flushing here is a hack for pending blocks.
i'd like the flushing to be moved back into block import and be removed from sync l2.rs once again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can keep this hack here of course since it was already present
crates/client/rpc/src/lib.rs
Outdated
Self { | ||
backend: Arc::clone(&self.backend), | ||
add_transaction_provider: Arc::clone(&self.add_transaction_provider), | ||
ctx: self.ctx.branch(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean that cloning a starknet instance will make it a parent to a newly created child?
this feels weird.. i wouldnt expect Clone to do stuff like that
how about just wrapping ServiceContext into an Arc so that cloning a Starknet actually shares the cancellation token instead of forking it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum. branch
only clones the CancellationToken
though, which is itself behind an Arc
, so I don't think this is necessary.
pub struct CancellationToken {
inner: Arc<tree_node::TreeNode>,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also ServiceContext::branch
does not create any child CancellationToken
, that is handled by ServiceContext::branch_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i did not realize branch was clone sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I've update it to clone
to make this more obvious :)
}; | ||
|
||
if client.shutdown().await.is_err() { | ||
tracing::error!("❗ Failed to shutdown warp update sender"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this an unrecoverable error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, will make it one 👍
UpTo(u64), | ||
} | ||
|
||
async fn sync_blocks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some docs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im unsure where the parallel catching up vs polling phase are here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's SyncStatus that i don't understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SyncStatus
is just used to express if we have reached the tip of the chain. This is significant since we are actually synchronizing from two chains during warp updates:
- The warp update sender
- Whatever feeder gateway we have set with
--gateway-url
The issue is that --n-blocks-to-sync
will cause sync to finish early, so we need to be able to know if we have reached the tip of the chain or else we do not display '🥳 The sync process has caught up with the tip of the chain'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync_n_blocks
is actually only used during parallel fetching: this includes warp updates and normal parallel sync, and does not include post-sync catch up. So sync_n_blocks
is not called for the polling phase.
let BlockImportResult { header, block_hash } = block_import.verify_apply(block, validation.clone()).await?; | ||
|
||
if header.block_number - last_block_n >= flush_every_n_blocks as u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: i really think this should not be here, flushing should def not be the concern of l2.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you suggest we move this instead? This was already called indirectly in the verify_apply
method if I remember correctly. The issue is that if we are running a full node with only sync services enabled we still need to have a point where new blocks are flushed. Also I don't quite see how we can share information across services in an elegant way. Imo it makes most sense to call flush
were information about the block store is directly available, be that sync
, block_production
or the mempool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it makes more sense to call it in block_import? this seems way too low level to be a concern of l2.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean sharing info in this case?
.multiple(false) | ||
) | ||
)] | ||
pub struct ArgsPresetParams { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this preset thing and how does it relate to the older presets feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So. Basically I realized some configs were starting to take up a lot of arguments. Initially, warp updates were just set up as a series of predefined arguments (this is still the case for --warp-update-sender
), and it seemed really tedious and error-prone for the user to have to remember so many options. All this does is that it sets various options in the RunCmd
struct after parsing has occurred. This has its downsides, most notably the user cannot override args presets this way as they are evaluated after user inputs. The advantage is that this streamlines more complex cli setups with many options and flags (this mostly applies to --warp-update-sender
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be adding this in a doc comment
@@ -201,7 +197,7 @@ async fn main() -> anyhow::Result<()> { | |||
|
|||
app.start_and_drive_to_end().await?; | |||
|
|||
tracing::info!("Shutting down analytics"); | |||
tracing::info!("🔌 Shutting down analytics..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove this log entirely, it's actually not useful tbh
Yes, the block hashes and state root are checked. This is a bit of a shame, as there clearly are lots of performance gains to be had in performing a single state root computation since we know we are synchronizing from a trusted source. This was not implemented as it felt like this was going a bit out of scope (at that point I would be implementing warp sync as well). Still, we are nonetheless cutting down on network latency and increasing the number of block synchronized by maximizing CPU parallelism. Some initial benchmarks have shown a 33% percent improvement in sync time through this alone, so I think this is good enough for now. In the future, we have some low hanging fruits left which could make this a lot faster, mainly computing the state root in a single go from genesis to full sync. But for now I think it'd be better to move on to other features, most importantly RPC v0.13.3, especially since this feature has already been divided between this and another upcoming pr. |
Pull Request type
What is the current behavior?
There currently is no way to migrate the Madara db on a breaking change. This pr fixes this, adding some more quality of life features in the process.
What is the new behavior?
Added warp update, which allows Madara to migrate its database at speeds much faster than a re-sync from genesis. This works by running a local fgw node and a second node which syncs from it with higher than normal parallelization. This has been documented in the
README
, where you can find more detailed information on how to test this feature.Added cli parameter presets. This is quite primitive atm, and only functions by overriding other cli args after they have been parsed. This sadly means the user cannot currently override options set in the presets (also this is all very manual). Imo it is still a good start though in making some common args configuration such as rpc, gateway (or just lengthy ones like warp sync) easier to use.
Db flush no longer occurs on a fixed-interval timer of 5s. Instead,
maybe_flush
has been renamed toflush
and is called everyn
blocks in the l2 sync block storage process. This can be set through the cli arg--flush-every-n-blocks
and defaults to 1000.Added the option to increase sync parallelism. This is used by warp sync and is available under the
--sync-parallelism
cli flag, which defaults to 10 (meaning 10 blocks will be fetched in parallel during l2 sync).Some important changes have also been made to the rpc admin endpoints:
Added a new admin endpoint,
stop_node
, which can be used to stop madara at a distance. This is used by warp update to stop the warp update sender once the receiver has finished synchronizing.Added a new admin endpoint,
ping
, which just returns the unix time at which it was received. This can be used to check if a node is active or calculate network latency.Added various new admin endpoints which allow to control the capabilities of the node (for example stopping / restarting sync). This is quite basic atm and has only been implemented for sync and user-facing rpc services. It is sadly not possible to start services which were not available on startup atm.
Note
There are currently two main issues I have with the current service architecture:
I am thinking of how to best approach this. Currently, I am favoring a monitor solution where
ServiceGroupd
also keeps in memory the necessary information to start (and restart) a service. There also needs to be some mechanism for theServiceGroup
to be able to cleanly pause the execution of a service but I am not sure of the best way to go about this yet.On the rpc front:
namespace_version_method
, whereas our version middleware only accepts request of the formnamespace_method
, withversion
being specified in the header. This has been changed to allow for both formats. This is only useful for us to be able to make rpc calls from one node to the other, but the alternative would have been to do this manually. I think this is a better solution as jsonrpsee provides type checking and function signatures for each of our methods, which should reduce errors.Refactor
Does this introduce a breaking change?
No.