-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential inconsistency after 'leadership lost while committing log' #2893
Comments
Hi @stefan-wenger I'll think through this and figure out what's happening. This seems like an artifact of a 2 server setup which isn't really tolerant to even a single failure with Raft, but I need to work it through more carefully. |
Hi, We encountered the same here. |
@slackpad We are using 3 nodes 0.7.1 cluster, and found this inconsistency too. For our case, we use consul as a distributed lock. This inconsistency prevent us for further locking using the same key. |
@zhangxin511 thanks for the report. Can you consistently reproduce this? It's something we've never had a reproduction case for and it's not clear if it's even still possible on recent version of Consul (0.7.1. is well over a year behind and a lot of changes behind!). If you can consistently reproduce, please let us know the steps. If not, as much info as you can about what you were doing etc. might help - so far all we have to go on is these 3 reports and one set of logs from the last few years! |
@banks Thank you for your response. Unfortunately, this is not reproducible. I am not sure if it will always has inconsistency whenever this error happens, but based on my research in our application it looks like it will always lead to some inconsistency. This error only happens at one our most heavy traffic endpoint of one micro-service, it happens extremely low chance. But since our key for the lock is "endpoint/{{user_guid}}", whenever this happens all following requests for the same user will timeout until we have next release restart the consul. More specifically:
The error happens at https://github.com/PlayFab/consuldotnet/blob/0.7.0.3/Consul/Client.cs#L1304 most likely based on stacktrace below, but I think their code is handling things correctly:
|
Due to this and @banks' note about a very old version of Consul, I'm going to close this. Feel free to open a new issue if anyone is able to reproduce consistently, hopefully on a newer version of Consul 😄. |
I can consistently reproduce this issue with rqlite. See https://ci.appveyor.com/project/otoolep/rqlite/builds/47444053. I encounter this error:
but my test results show an extra write, even though the write is indicated as failed. I'm running an up-to-date version of Hashicorp Raft. It's not Consul, but this issue is the top-hit on Google. Are the Hashicorp team interested in digging into this? |
@otoolep thanks for the note. I propose we continue discussing this over in this PR on the raft library as this thread is pretty old and Consul-specific! |
consul version
for both Client and ServerClient:
v0.7.5
Server:
v0.7.5
consul info
for both Client and ServerClient:
Server:
Operating system and Environment details
I use 2 centos 7.3 vms named n1 and n2:
Description of the Issue (and unexpected/desired result)
We are developing a distributed system on top of consul for which data consistency is critical.
We run different tests (in this case with 2 consul cluster members) where write a lot of keys using local clients of our application and simulate different network problems. In the end we collect all the data and and verify its consistency (unconfirmed writes/ lost writes etc). We ran into a problem where we disconnected our two cluster nodes and got a HTTP 500 status response from the consul kv api when we tried to write a key:
We assume that this write failed and send an error message back to our client but after reconnecting the cluster we notice that the key has actually been written.
Our problem is that we do not know if the write operation was successful when we receive an HTTP 500 status code from the consul kv API. Is it safe to assume that the write was successful or is it possible that the key is not in the database after the cluster is healthy again? We might also replace one of the nodes after it failed and I can imagine that the key could be lost after doing that.
I would like to know if we have any guarantee about the state of a key if we receive a HTTP-500 status (with message 'leadership lost while committing log') after sending a PUT request to the consul kv API and if there is a way to avoid this situation.
Reproduction steps
Log Fragments or Link to gist
Log from our own application
Consul log from syslog
The text was updated successfully, but these errors were encountered: