[Bug] Client Node Block Syncing Fails Due to Current/Next Round Comparison #4

fulltimemike · 2024-04-11T15:54:02Z

🐛 Bug Report

Sometimes after a Client Node is restarted, the following error message will pop up: The next block (X) is invalid - Failed to speculate on transactions - Failed to post-ratify - Next round Y must be greater than current round Y. This error causes the client to stop syncing, and restarting the client further does not fix the syncing bug. To allow the client to continue syncing, the client ledger must be modified -- either the ledger must be reset to allow the client to resync from genesis, or a snapshot must be loaded into the client to continue syncing.

I'm uncertain whether this bug is directly in snarkOS, or if it is a problem with snarkVM. The specific error is thrown here.

Logs directly before the bug is thrown.

In this example, interestingly, blocks and rounds much further ahead (block: 185,032, round: 412383) seem to be logged and added to the ledger than the block and round identified in the error thrown (block: 111196, round: 252154). I'm not sure why the store is apparently adding previous rounds and blocks when it has already surpassed this point.

Steps to Reproduce

Across multiple canary net client nodes, we have observed behavior where restarting the node causes syncing to fail. This bug is nondeterministic, but we have seen that restarting a client node enough times will cause the error to pop up. It may be necessary for the client to be actively syncing during restarts to cause this bug, but I can't be certain.

Expected Behavior

Restarting a client node should not cause the client to get stuck permanently when syncing.

Your Environment

This environment is running on an EC2 linux machine, running a fork of snarkOS with commits up to AleoNet@6aba25d.

The text was updated successfully, but these errors were encountered:

Meshiest · 2024-04-22T16:24:45Z

Flat lines in this chart indicate the issue occurring

network topology:

10 validator devnet on AWS c6a.8xlarges
0 clients
no dedicated tx cannon

reproduce with some automation to reset the ledger of the same 2 every 30 minutes.

As early as within the first 500 blocks we frequently run into this issue on either or both of the 2 reset validators after reaching tip.

logs in gdrive

notes:

we are running a wrapper around snarkos to make checkpoints but the core snarkos code is only modified with the canary patch
rebooting from this state usually results in a "missing block hash" corrupted ledger error
we were able to reproduce this by locally running 10 validators on the same machine

raychu86 · 2024-04-23T01:14:11Z

We have a tentative fix for this issue for validators - https://github.com/AleoHQ/snarkOS/pull/3232. The fix is currently undergoing burn-in testing and internal verification.

fulltimemike added the bug Something isn't working label Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Client Node Block Syncing Fails Due to Current/Next Round Comparison #4

[Bug] Client Node Block Syncing Fails Due to Current/Next Round Comparison #4

fulltimemike commented Apr 11, 2024 •

edited

Loading

Meshiest commented Apr 22, 2024

raychu86 commented Apr 23, 2024

[Bug] Client Node Block Syncing Fails Due to Current/Next Round Comparison #4

[Bug] Client Node Block Syncing Fails Due to Current/Next Round Comparison #4

Comments

fulltimemike commented Apr 11, 2024 • edited Loading

🐛 Bug Report

Steps to Reproduce

Expected Behavior

Your Environment

Meshiest commented Apr 22, 2024

raychu86 commented Apr 23, 2024

fulltimemike commented Apr 11, 2024 •

edited

Loading