-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/etc/nss/mtl-dev/mtl-dev" is empty #1039
Comments
If you start with streaming "store" but no RAFT logs, those may be created empty but the system will detect that there was no RAFT state but a streaming store. When you have copied the raft_log_path content and streaming stores to a safe location, have you copied those back to the system where you have restarted? In other words, you needed both directories to the newly restarted node. I think that you did not do that properly. |
I will try that. Is it not possible to recover a node, if the files (store, raft logs) are corrupted or lost? |
If you have the remaining of the cluster working fine, say 2 out of 3 nodes running with a new leader, technically you could remove the state from the failed node (that means both raft and streaming store) and restart that node which will start from scratch and sync its state with the current leader. |
Here are the results of attempting to restart the failed node using the original logs and store: [root@sv26rdev01 mtl-dev]# su - nss goroutine 1 [running]: |
That means the raft logs are corrupted, so in that case you have no choice but to start that node without any store (raft+streaming). It should be able to ultimately join the existing cluster and sync its state. If there were lots of messages, it may take some time. |
OK, that seems to work. The first time it did not, but I suspect it was a missing directory in the log path, or a permission issue creating the log_path directory. I have one further question: if we were to lose TWO of the 3 nodes, can the 3rd node be started in non-clustered mode? Being able to recover the data is key to our plans to use nss, so any guidance is appreciated. -Tom |
That's tricky. So definitively you should save off the streaming stores (where the messages are), just in case the restarted node cannot fully recover before one of the 2 other servers fails. With a single server, there is no quorum, so that server would stop servicing clients. Restarting in standalone mode to recover the data would work but only for messages, not subscriptions state. The subscriptions are stored in the raft log. |
We do not currently use subscriptions, so bringing up a single surviving node in standalone mode would be our plan, if 2 nodes were to fail. (We are considering the situation where 2 nodes are co-located at a single site that fails for some reason - we would always have a 3rd node located a different site.) Question: is it possible (or advisable?) for nss to detect when it's raft or message logs are corrupted on startup, and then pull them over from the surviving cluster nodes? Basically, eliminate the manual step of having to remove the directories and restart. Thank you for your assistance. |
I would be a bit scared of automatically removing those files. That would not give a chance to the user to make a copy. |
I understand. I will update our procedures to remove (but save) both raft logs and file store before recovering a crashed node. |
Hello. I understand that this issue is regarding a genuine restore scenario, but I am getting the same error on pod creation, in a kubernetes environment, for pod mounting a PVC volume, and gives the same error on first start.
It is strange that since the pvc is just freshly created, it should not have any state to recover. |
@snayakie maybe there was a pvc with the same name created before? Is the stafefulset something like this? https://github.com/nats-io/k8s/blob/master/nats-streaming-server/simple-stan.yml |
The fact that it says:
indicates to me that the directory already existed. Otherwise you would have had something like:
|
@wallyqs Yes, its like that example, but we're creating new pvc, no pvc of that name existed prior. @kozlovic yes, I agree, I looked the nats-streaming-server code a bit and can see the logic, but I'm unable to explain how it finds state in the pvc just create few seconds before. And each of my 3 pods have separate pvcs mounted for themselves. So I'm puzzled. |
Puzzling indeed, but the error you describe would mean really that the streaming store was found while the raft was not. |
I am having the example same problem, I remove the cluster and the pvc, and then I create again that cluster and some nodes fail to boot:
|
Ran into this with two new clusters today and it started working after changing |
Hi, I have seen similar issues reported here, and I'm wondering if I am doing something wrong when trying to recover one node of a 3 node cluster.
We have a 3 node cluster, and one node had an issue, causing nats to stop. (filesystem full in this case). I moved "store" and our log_path directories off to a save location, and restarted. It appears to have recovered the streaming state, but not the cluster log path.
It DID create the directories, and even a raft.log file. But still failed:
[12120] 2020/04/28 11:25:21.681533 [FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/etc/nss/mtl-dev/mtl-dev" is empty
Since this was a working config, I'm assuming that I'm not following a good procedure to restart a failed node. What am I missing?
-Tom
The text was updated successfully, but these errors were encountered: