Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backport] [v23.2.x] rm_stm: fix fence_pid_epoch cleanup #17880 #18120

Closed
wants to merge 2 commits into from

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Apr 27, 2024

fence_pid_epoch maps a producer id to its latest epoch. Current cleanup code does not do a epoch check before cleaning up the pid state. This can result in removing the state related to the latest epoch. Consider the following series of events..

[x, y] = pid[id=x, epoch=y]

[1, 0] begin_tx - fence_pid_epoch[1] = 0
[1, 1] begin_tx - fence_pid_epoch[1] = 1
evict [1, 0]
erase(fence_pid[1]) ==> removes (1)

This results in a messed up state stalling the state of the transaction because the partition cannot make progress until it verifies the epoch.

This is a long pending bug that was exposed by racy evictions.

note: this whole code is going to be revamped soon and the plan is to add a self contained unit test fixture that supports transactions end-to-end, that should have better test coverage.

Fixes #17891

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

fence_pid_epoch maps a producer id to its latest epoch. Current cleanup
code does not do a epoch check before cleaningup the pid state. This can
result in removing the state related to the latest epoch. Consider the
following series of events..

[x, y] = pid[id=x, epoch=y]

[1, 0] begin_tx - fence_pid_epoch[1] = 0
[1, 1] begin_tx - fence_pid_epoch[1] = 1
evict [1, 0]
erase(fence_pid[1]) ==> removes (1)

This results in a messed up state stalling the state of the transaction
because the partition cannot make progress until it verifies the epoch.

This is a long pending bug that was exposed by racy evictions.

(cherry picked from commit 996e138)
@piyushredpanda piyushredpanda requested a review from ztlpn May 5, 2024 18:07
@piyushredpanda piyushredpanda added this to the v23.2.x-next milestone May 5, 2024
@bharathv bharathv requested a review from mmaslankaprv May 6, 2024 05:30
rpk = RpkTool(self.redpanda)
rpk.cluster_config_set("max_concurrent_producer_ids",
str(max_concurrent_pids))
sleep(5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we wait for at least 10s to guarantee that eviction ran at least once? (is the relevant config property abort_timed_out_transactions_interval_ms)?

@mmaslankaprv
Copy link
Member

i think we can close this as v23.2.x is not longer supported

@piyushredpanda
Copy link
Contributor

i think we can close this as v23.2.x is not longer supported

v23.2.x is supported for about a month or so more.

@piyushredpanda
Copy link
Contributor

Closing as v23.2.x goes end-of-support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants