-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transaction do_abort_tx request: not found, assuming already aborted. #24468
Comments
@bharathv replication team? |
@vsarunas Based on my reading of the logs, everything is working as expected. There was an expired group transaction which triggered an internal abort of the transaction. Internal aborts (initiated by Redpanda, not client) are treated specially compared to client initiated aborts by bumping the epoch. This is done to fence all further client requests with previous epoch and the client has to init again to make progress. So I believe that resulted in a "FENCED" error when the client attempted to abort again.
What exactly did you restart? Client or the broker? I'd expect the client to init_transactions() again in this case and it should unblock itself. |
@bharathv, the application was restarted but could not complete a transaction. After restart of the broker, application started to work correctly. The issue reproduced again on another system for the same user; but same version v24.2.8. Upgraded now to v24.3.1. Same issue |
Yes, i did, this log line is not a problem.
Thanks, do you have client logs (timestamps and error codes, pre and post RP restart) and a rough pseudo code of what it does? (exception handling) |
Seems I found the reason of such a strange error, i.e. "This instance has been fenced by a newer instance" but still can't understand the root cause. In short, we are reading consumer topic and perform transaction for every received message. From librdkafka logs:
then after 60 seconds (transaction timeout), librdkafka timeout the request and perform re-connect:
That leads to "Local: This instance has been fenced by a newer instance". However, it seems that the main problem is around
and then just lots of heartbeats and attempt of expiration:
@bharathv could you suggest if any other information is required, please? |
qq:
were you able to track this request down in the broker logs? I don't think the snippet pasted above covers it, mind sharing a larger timespan of the broker logs? (starting few minutes before ^^ request reached the broker). As for the "Fenced error", you cannot continue using the same producer instance once the client hits the error, instead it should be something like..
That wouldn't still explain what is causing the timeout in the first place but just FYI. |
I think that re-connection is performed automatically by librdkafka on this timeout and just a cause of request that reached timeout. Btw, after restart of redpanda this issue stopped reproducing again. @bharathv would it be possible to upload the full logs somewhere? |
@blindspotbounty please DM me the logs. |
I looked at the logs (thanks for sharing), there was an (unexpected) exception when processing the request that kept the connection hanging, which resulted in a client timeout eventually. It is still unclear to me what threw it based on the available logging, I still need to stare at the code a bit more. How reproducible is this issue? I know you mentioned its rare but wondering if we can improve logging in some suspect places and you could try to reproduce again? |
Thanks for looking, I am sorry but unfrotunately I don't have any particular scenario that would reproduce this issue. |
you mean seastar exception logger, right? That'd really help, I noticed it was on for the initial part of the log, its too chatty but really helpful in this case. |
Okay, if/when reproduces I will try to enable this layer as well. |
just to be clear it has to be turned on before (or by the time) the issue happens so seastar can catch the exception and log it when it happens. |
Yeah, for sure. When this starts reproducing, that is reproducing all the time until restart. So, I can enable all logs and collect it during the precise start of our application. |
Ah ok.. if it is stuck in that state it makes sense, just set |
Version & Environment
Redpanda version: (use
rpk version
):macOS, Linux VM running a Docker container with rpk 24.2.8.
What went wrong?
rdkafka client was getting rejects "Local: This instance has been fenced by a newer instance" when trying to abort a transaction; (2024-12-06T06:56:02.818413+01:00)
This section of logs look suspicious from the same time as the client reject:
Transaction in
CompleteAbort
state:Restart of rpk cleared up the state and allowed the application to work further.
Full logs
The text was updated successfully, but these errors were encountered: