Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential inconsistency after 'leadership lost while committing log' #2893

Closed
stefan-wenger opened this issue Apr 12, 2017 · 8 comments
Closed
Labels
type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp

Comments

@stefan-wenger
Copy link

consul version for both Client and Server

Client: v0.7.5
Server: v0.7.5

consul info for both Client and Server

Client:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease = 
	revision = '21f2d5a
	version = 0.7.5
consul:
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 192.168.200.101:8300
	server = true
raft:
	applied_index = 31190
	commit_index = 31190
	fsm_pending = 0
	last_contact = 20.657551ms
	last_log_index = 31190
	last_log_term = 18
	last_snapshot_index = 24964
	last_snapshot_term = 18
	latest_configuration = [{Suffrage:Voter ID:192.168.200.101:8300 Address:192.168.200.101:8300} {Suffrage:Voter ID:192.168.200.102:8300 Address:192.168.200.102:8300}]
	latest_configuration_index = 610
	num_peers = 1
	protocol_version = 1
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 18
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 59
	max_procs = 2
	os = linux
	version = go1.7.5
serf_lan:
	encrypted = false
	event_queue = 0
	event_time = 3
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 5
	members = 2
	query_queue = 0
	query_time = 1
serf_wan:
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 2
	members = 1
	query_queue = 0
	query_time = 1

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease = 
	revision = '21f2d5a
	version = 0.7.5
consul:
	bootstrap = true
	known_datacenters = 1
	leader = true
	leader_addr = 192.168.200.101:8300
	server = true
raft:
	applied_index = 31169
	commit_index = 31169
	fsm_pending = 0
	last_contact = 0
	last_log_index = 31169
	last_log_term = 18
	last_snapshot_index = 24964
	last_snapshot_term = 18
	latest_configuration = [{Suffrage:Voter ID:192.168.200.101:8300 Address:192.168.200.101:8300} {Suffrage:Voter ID:192.168.200.102:8300 Address:192.168.200.102:8300}]
	latest_configuration_index = 610
	num_peers = 1
	protocol_version = 1
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 18
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 65
	max_procs = 2
	os = linux
	version = go1.7.5
serf_lan:
	encrypted = false
	event_queue = 0
	event_time = 3
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 5
	members = 2
	query_queue = 0
	query_time = 1
serf_wan:
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 2
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

I use 2 centos 7.3 vms named n1 and n2:

#uname -a
Linux n1 3.10.0-514.2.2.el7.x86_64 #1 SMP Tue Dec 6 23:06:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
#rpm --query centos-release
centos-release-7-3.1611.el7.centos.x86_64

Description of the Issue (and unexpected/desired result)

We are developing a distributed system on top of consul for which data consistency is critical.
We run different tests (in this case with 2 consul cluster members) where write a lot of keys using local clients of our application and simulate different network problems. In the end we collect all the data and and verify its consistency (unconfirmed writes/ lost writes etc). We ran into a problem where we disconnected our two cluster nodes and got a HTTP 500 status response from the consul kv api when we tried to write a key:

017-04-11T16:32:40.167Z [ClientHandler-92] INFO [ConsulManager] processing message [*3\r\n$3\r\nSET\r\n$20\r\nkey_n1_1491928360134\r\n$16\r\nn1_1491928360134\r\n]
2017-04-11T16:32:40.167Z [ClientHandler-92] INFO [ConsulManager] processing message as [LEADER]
2017-04-11T16:32:42.678Z [ClientHandler-92] DEBUG [ConsulKvStore] setKv failed: key [service/statestore/log/0/000000000000000018100000] value [[B@16f977c0]
com.ecwid.consul.v1.OperationException: OperationException(statusCode=500, statusMessage='Internal Server Error', statusContent='leadership lost while committing log')

We assume that this write failed and send an error message back to our client but after reconnecting the cluster we notice that the key has actually been written.

#curl -s localhost:8500/v1/kv/service/statestore/log/0/000000000000000018100000 | python -m json.tool
[
    {
        "CreateIndex": 383,
        "Flags": 0,
        "Key": "service/statestore/log/0/000000000000000018100000",
        "LockIndex": 0,
        "ModifyIndex": 383,
        "Value": "KjMNCiQzDQpTRVQNCiQyMA0Ka2V5X24xXzE0OTE5MjgzNjAxMzQNCiQxNg0KbjFfMTQ5MTkyODM2MDEzNA0K"
    }
]

Our problem is that we do not know if the write operation was successful when we receive an HTTP 500 status code from the consul kv API. Is it safe to assume that the write was successful or is it possible that the key is not in the database after the cluster is healthy again? We might also replace one of the nodes after it failed and I can imagine that the key could be lost after doing that.
I would like to know if we have any guarantee about the state of a key if we receive a HTTP-500 status (with message 'leadership lost while committing log') after sending a PUT request to the consul kv API and if there is a way to avoid this situation.

Reproduction steps

  • Create two VMs within the same network
  • Setup a two node consul cluster
  • Start sending PUT requests for unique keys on each node (using curl or something else)
  • Disconnect the network between the nodes
  • Check if one of the nodes reported a HTTP-500 status code with the message 'leadership lost while committing log' (otherwise restart the test)
  • Repair the network
  • Check if the key which had a HTTP-500 response is in the consul kv store (We assumed it would not be there but (at least in some cases) it has been written).

Log Fragments or Link to gist

Log from our own application

2017-04-11T16:32:40.167Z [ClientHandler-92] INFO [ConsulManager] processing message [*3\r\n$3\r\nSET\r\n$20\r\nkey_n1_1491928360134\r\n$16\r\nn1_1491928360134\r\n]
2017-04-11T16:32:40.167Z [ClientHandler-92] INFO [ConsulManager] processing message as [LEADER]
2017-04-11T16:32:42.678Z [ClientHandler-92] DEBUG [ConsulKvStore] setKv failed: key [service/statestore/log/0/000000000000000018100000] value [[B@16f977c0]
com.ecwid.consul.v1.OperationException: OperationException(statusCode=500, statusMessage='Internal Server Error', statusContent='leadership lost while committing log')

Consul log from syslog

Apr 11 16:32:42 default-centos-7 consul[7046]: memberlist: Suspect n2 has failed, no acks received
Apr 11 16:32:42 default-centos-7 consul[7046]: raft: Failed to contact 192.168.200.102:8300 in 2.500219397s
Apr 11 16:32:42 default-centos-7 consul[7046]: raft: Failed to contact quorum of nodes, stepping down
Apr 11 16:32:42 default-centos-7 consul[7046]: raft: Node at 192.168.200.101:8300 [Follower] entering Follower state (Leader: "")
Apr 11 16:32:42 default-centos-7 consul[7046]: consul: cluster leadership lost
Apr 11 16:32:42 default-centos-7 consul[7046]: raft: aborting pipeline replication to peer {Voter 192.168.200.102:8300 192.168.200.102:8300}
Apr 11 16:32:42 default-centos-7 consul[7046]: consul.kvs: Apply failed: leadership lost while committing log
Apr 11 16:32:42 default-centos-7 consul[7046]: http: Request PUT /v1/kv/service/statestore/log/0/000000000000000018100000, error: leadership lost while committing log from=127.0.0.1:50176
Apr 11 16:32:45 default-centos-7 consul[7046]: memberlist: Suspect n2 has failed, no acks received
Apr 11 16:32:47 default-centos-7 consul[7046]: memberlist: Marking n2 as failed, suspect timeout reached (0 peer confirmations)
Apr 11 16:32:47 default-centos-7 consul[7046]: serf: EventMemberFailed: n2 192.168.200.102
Apr 11 16:32:47 default-centos-7 consul[7046]: consul: Removing LAN server n2 (Addr: tcp/192.168.200.102:8300) (DC: dc0)
Apr 11 16:32:49 default-centos-7 consul[7046]: serf: attempting reconnect to n2 192.168.200.102:8301
Apr 11 16:32:49 default-centos-7 consul[7046]: memberlist: Suspect n2 has failed, no acks received
Apr 11 16:32:50 default-centos-7 consul[7046]: agent: coordinate update error: No cluster leader
Apr 11 16:32:50 default-centos-7 consul[7046]: raft: Failed to heartbeat to 192.168.200.102:8300: read tcp 192.168.200.101:33154->192.168.200.102:8300: i/o timeout
Apr 11 16:32:51 default-centos-7 consul[7046]: raft: Heartbeat timeout from "" reached, starting election
Apr 11 16:32:51 default-centos-7 consul[7046]: raft: Node at 192.168.200.101:8300 [Candidate] entering Candidate state in term 3
Apr 11 16:32:57 default-centos-7 consul[7046]: raft: Election timeout reached, restarting election
@slackpad slackpad self-assigned this Apr 12, 2017
@slackpad
Copy link
Contributor

Hi @stefan-wenger I'll think through this and figure out what's happening. This seems like an artifact of a 2 server setup which isn't really tolerant to even a single failure with Raft, but I need to work it through more carefully.

@slackpad slackpad added the type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp label May 1, 2017
@slackpad slackpad removed their assignment May 1, 2017
@hehailong5
Copy link

Hi,

We encountered the same here.
This is a 3-nodes consul cluster and we initially register 40000 services via /catalog/register API, all succeeded. In a little while, we go on register 1000 services via the same API, but this time all failed with "leadership lost while committing log"

@zhangxin511
Copy link

@slackpad We are using 3 nodes 0.7.1 cluster, and found this inconsistency too. For our case, we use consul as a distributed lock. This inconsistency prevent us for further locking using the same key.

@banks
Copy link
Member

banks commented Nov 8, 2018

@zhangxin511 thanks for the report. Can you consistently reproduce this? It's something we've never had a reproduction case for and it's not clear if it's even still possible on recent version of Consul (0.7.1. is well over a year behind and a lot of changes behind!).

If you can consistently reproduce, please let us know the steps. If not, as much info as you can about what you were doing etc. might help - so far all we have to go on is these 3 reports and one set of logs from the last few years!

@zhangxin511
Copy link

zhangxin511 commented Nov 8, 2018

@banks Thank you for your response. Unfortunately, this is not reproducible. I am not sure if it will always has inconsistency whenever this error happens, but based on my research in our application it looks like it will always lead to some inconsistency. This error only happens at one our most heavy traffic endpoint of one micro-service, it happens extremely low chance. But since our key for the lock is "endpoint/{{user_guid}}", whenever this happens all following requests for the same user will timeout until we have next release restart the consul.

More specifically:

  1. Our consul cluster is 0.7.1 while the agent is 0.6.4, using windows_amd64 distributions
  2. We are running OWIN hosted IIS webapi applicaiton 4.6.1, using consul .net client 0.7.0.3 https://github.com/PlayFab/consuldotnet/releases/tag/0.7.0.3

The error happens at https://github.com/PlayFab/consuldotnet/blob/0.7.0.3/Consul/Client.cs#L1304 most likely based on stacktrace below, but I think their code is handling things correctly:

Consul.ConsulRequestException: Unexpected response, status code InternalServerError: rpc error: rpc error: leadership lost while committing log
at Consul.PutRequest`2.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Consul.Lock.d__23.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Consul.Lock.d__23.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Consul.ConsulClient.d__51.MoveNext()

@pearkes
Copy link
Contributor

pearkes commented Apr 4, 2019

Unfortunately, this is not reproducible.

Due to this and @banks' note about a very old version of Consul, I'm going to close this. Feel free to open a new issue if anyone is able to reproduce consistently, hopefully on a newer version of Consul 😄.

@pearkes pearkes closed this as completed Apr 4, 2019
@otoolep
Copy link

otoolep commented Jul 1, 2023

I can consistently reproduce this issue with rqlite. See https://ci.appveyor.com/project/otoolep/rqlite/builds/47444053. I encounter this error:

[http] 2023/07/01 20:51:22 execute queue write failed for sequence number 1688244659440662796 on node 127.0.0.1:50720: leadership lost while committing log

but my test results show an extra write, even though the write is indicated as failed.

I'm running an up-to-date version of Hashicorp Raft. It's not Consul, but this issue is the top-hit on Google. Are the Hashicorp team interested in digging into this?

@banks
Copy link
Member

banks commented Jul 3, 2023

@otoolep thanks for the note.

I propose we continue discussing this over in this PR on the raft library as this thread is pretty old and Consul-specific!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp
Projects
None yet
Development

No branches or pull requests

7 participants