-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix possible incomplete txpool restore from dump file #7991
Fix possible incomplete txpool restore from dump file #7991
Conversation
LOG.debug("Cancelling in progress read operation"); | ||
isCancelled.set(true); | ||
try { | ||
readInProgress.get().get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not very clear in the code why this line is cancelling the read operation. Maybe it's worth it to put this call into a method with a descriptive name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readInProgress.get().get()
wait for the completable future to finish, introduced a method to describe its meaning
fb3556d
to
ec5f053
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you don't just keep the current dump if the restore is canceled?
The current dump is kept, but already restored txs are removed from it, to avoid duplicates. In detail when a restore is cancelled, a subset of txs is added to the pool, along with other txs that have arrived via api/p2p, so when the restore is cancelled because the txpool is disabled, the restored txs are removed from the dump file, that is not deleted, and then the txpool is saved to disk, and in this operation the already restored and the new txs are appended to the existing dump file without duplications. |
ethereum/eth/src/main/java/org/hyperledger/besu/ethereum/eth/transactions/TransactionPool.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a changelog entry
df85577
to
78867d7
Compare
78867d7
to
03e28e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't really comment on the meaningfulness of the changes, but LGTM, with some nits about var
usage
@@ -645,15 +646,16 @@ public CompletableFuture<Void> setDisabled() { | |||
isPoolEnabled.set(false); | |||
subscribeConnectId.ifPresent(ethContext.getEthPeers()::unsubscribeConnect); | |||
pendingTransactionsListenersProxy.unsubscribe(); | |||
final PendingTransactions pendingTransactionsToSave = pendingTransactions; | |||
final var saveOperation = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer to see a concrete type here rather than var
. It's not entirely clear what type will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, I always hope the variable name is talkative enough to avoid the verbosity
@@ -839,41 +864,77 @@ private void executeLoadFromDisk() { | |||
LOG.info("Loading transaction pool content from file {}", saveFile); | |||
try (final BufferedReader br = | |||
new BufferedReader(new FileReader(saveFile, StandardCharsets.US_ASCII))) { | |||
final IntSummaryStatistics stats = | |||
final var stats = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer to see the concrete type here rather than var
as it's not clear what the type would be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
03e28e7
to
c04fd4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} catch (ExecutionException ee) { | ||
// nothing to do | ||
} | ||
} | ||
|
||
isCancelled.set(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be an atomic boolean still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess yes because it's still used elsewhere
Signed-off-by: Fabio Di Fabio <[email protected]>
c04fd4a
to
9f6a23b
Compare
Signed-off-by: Fabio Di Fabio <[email protected]>
PR description
it could happen, especially at startup, that the node switch from in sync to out of sync quickly, and if that happens during the txpool restore from the dump file, the restore could be incomplete.
This is an log example of this issue:
So at the beginning the node was restoring the txpool (1.),
but while this restore was still in progress, the node went out of sync and the txpool was disabled(2.),
the issue manifest itself at this moment, since with the txpool disabled, restore txs are just ignored but technically the restore was completed successfully and the dump file removed, with only a subset of the txs restored (3.),
last part of disabling the txpool, consist in saving its content to the dump file (4.),
so the dump file is created with the current content of the txpool, that only has the subset of txs loaded from point 3 (5.)
Then after the node get in sync again, the process restart, but this time the dump file contains only a subset of the original txs set (6., 7., 8.)
To fix this issue, when disabling the txpool, we make sure that any in progress restore operation is cancelled correctly, removing from the dump file only the txs that have been restored, and changing the save operation to append to an existing dump file, so that no save txs are lost in case of a partial restore.
Below the new log with the fix
Note that restore operation is correctly stopped when the node goes out of sync (3.),
the save operation appends to the partially restored dump file (4., 5.)
and when the node goes in sync again the restore operation read the full set of txs from the dump, and deleted the file at the end (8.)
Fixed Issue(s)
Thanks for sending a pull request! Have you done the following?
doc-change-required
label to this PR if updates are required.Locally, you can run these tests to catch failures early:
./gradlew build
./gradlew acceptanceTest
./gradlew integrationTest
./gradlew ethereum:referenceTests:referenceTests