Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible incomplete txpool restore from dump file #7991

Merged
merged 1 commit into from
Dec 19, 2024

Conversation

fab-10
Copy link
Contributor

@fab-10 fab-10 commented Dec 5, 2024

PR description

it could happen, especially at startup, that the node switch from in sync to out of sync quickly, and if that happens during the txpool restore from the dump file, the restore could be incomplete.

This is an log example of this issue:

1. Loading transaction pool content from file /data/besu/txpool.dump
2. Node out of sync, disabling transaction handling
3. Added 88 transactions of 442 loaded from file /data/besu/txpool.dump
4. Saving 88 transactions to file /data/besu/txpool.dump
5. Saved 88 transactions to file /data/besu/txpool.dump
6. Node is in sync, enabling transaction handling
7. Loading transaction pool content from file /data/besu/txpool.dump
8. Added 87 transactions of 88 loaded from file /data/besu/txpool.dump

So at the beginning the node was restoring the txpool (1.),
but while this restore was still in progress, the node went out of sync and the txpool was disabled(2.),
the issue manifest itself at this moment, since with the txpool disabled, restore txs are just ignored but technically the restore was completed successfully and the dump file removed, with only a subset of the txs restored (3.),
last part of disabling the txpool, consist in saving its content to the dump file (4.),
so the dump file is created with the current content of the txpool, that only has the subset of txs loaded from point 3 (5.)
Then after the node get in sync again, the process restart, but this time the dump file contains only a subset of the original txs set (6., 7., 8.)

To fix this issue, when disabling the txpool, we make sure that any in progress restore operation is cancelled correctly, removing from the dump file only the txs that have been restored, and changing the save operation to append to an existing dump file, so that no save txs are lost in case of a partial restore.

Below the new log with the fix

1. Loading transaction pool content from file /data/besu/txpool.dump
2. Node out of sync, disabling transaction handling
3. Added 94 transactions of 94 loaded from file /data/besu/txpool.dump, before operation was cancelled
4. Appending 94 transactions to file /data/besu/txpool.dump
5. Appended 94 transactions to file /data/besu/txpool.dump
6. Node is in sync, enabling transaction handling
7. Loading transaction pool content from file /data/besu/txpool.dump
8. Added 248 transactions of 348 loaded from file /data/besu/txpool.dump, deleting file

Note that restore operation is correctly stopped when the node goes out of sync (3.),
the save operation appends to the partially restored dump file (4., 5.)
and when the node goes in sync again the restore operation read the full set of txs from the dump, and deleted the file at the end (8.)

Fixed Issue(s)

Thanks for sending a pull request! Have you done the following?

  • Checked out our contribution guidelines?
  • Considered documentation and added the doc-change-required label to this PR if updates are required.
  • Considered the changelog and included an update if required.
  • For database changes (e.g. KeyValueSegmentIdentifier) considered compatibility and performed forwards and backwards compatibility tests

Locally, you can run these tests to catch failures early:

  • unit tests: ./gradlew build
  • acceptance tests: ./gradlew acceptanceTest
  • integration tests: ./gradlew integrationTest
  • reference tests: ./gradlew ethereum:referenceTests:referenceTests

@fab-10 fab-10 self-assigned this Dec 5, 2024
@fab-10 fab-10 marked this pull request as ready for review December 5, 2024 13:36
LOG.debug("Cancelling in progress read operation");
isCancelled.set(true);
try {
readInProgress.get().get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not very clear in the code why this line is cancelling the read operation. Maybe it's worth it to put this call into a method with a descriptive name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readInProgress.get().get() wait for the completable future to finish, introduced a method to describe its meaning

@fab-10 fab-10 force-pushed the fix-restoring-txpool-from-dump branch from fb3556d to ec5f053 Compare December 5, 2024 14:41
Copy link
Contributor

@pinges pinges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you don't just keep the current dump if the restore is canceled?

@fab-10
Copy link
Contributor Author

fab-10 commented Dec 6, 2024

Is there a reason why you don't just keep the current dump if the restore is canceled?

The current dump is kept, but already restored txs are removed from it, to avoid duplicates.

In detail when a restore is cancelled, a subset of txs is added to the pool, along with other txs that have arrived via api/p2p, so when the restore is cancelled because the txpool is disabled, the restored txs are removed from the dump file, that is not deleted, and then the txpool is saved to disk, and in this operation the already restored and the new txs are appended to the existing dump file without duplications.

Copy link
Contributor

@macfarla macfarla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a changelog entry

@fab-10 fab-10 force-pushed the fix-restoring-txpool-from-dump branch from df85577 to 78867d7 Compare December 9, 2024 11:22
@fab-10 fab-10 force-pushed the fix-restoring-txpool-from-dump branch from 78867d7 to 03e28e7 Compare December 17, 2024 13:28
Copy link
Contributor

@Matilda-Clerke Matilda-Clerke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't really comment on the meaningfulness of the changes, but LGTM, with some nits about var usage

@@ -645,15 +646,16 @@ public CompletableFuture<Void> setDisabled() {
isPoolEnabled.set(false);
subscribeConnectId.ifPresent(ethContext.getEthPeers()::unsubscribeConnect);
pendingTransactionsListenersProxy.unsubscribe();
final PendingTransactions pendingTransactionsToSave = pendingTransactions;
final var saveOperation =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to see a concrete type here rather than var. It's not entirely clear what type will be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, I always hope the variable name is talkative enough to avoid the verbosity

@@ -839,41 +864,77 @@ private void executeLoadFromDisk() {
LOG.info("Loading transaction pool content from file {}", saveFile);
try (final BufferedReader br =
new BufferedReader(new FileReader(saveFile, StandardCharsets.US_ASCII))) {
final IntSummaryStatistics stats =
final var stats =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to see the concrete type here rather than var as it's not clear what the type would be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@macfarla macfarla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

} catch (ExecutionException ee) {
// nothing to do
}
}

isCancelled.set(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be an atomic boolean still?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess yes because it's still used elsewhere

@fab-10 fab-10 force-pushed the fix-restoring-txpool-from-dump branch from c04fd4a to 9f6a23b Compare December 19, 2024 10:55
@fab-10 fab-10 enabled auto-merge (squash) December 19, 2024 10:56
@fab-10 fab-10 merged commit ec79c17 into hyperledger:main Dec 19, 2024
43 checks passed
garyschulte pushed a commit that referenced this pull request Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants