Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/etc/nss/mtl-dev/mtl-dev" is empty #1039

Open
tharrold opened this issue Apr 28, 2020 · 17 comments

Comments

@tharrold
Copy link

Hi, I have seen similar issues reported here, and I'm wondering if I am doing something wrong when trying to recover one node of a 3 node cluster.

We have a 3 node cluster, and one node had an issue, causing nats to stop. (filesystem full in this case). I moved "store" and our log_path directories off to a save location, and restarted. It appears to have recovered the streaming state, but not the cluster log path.

It DID create the directories, and even a raft.log file. But still failed:

[12120] 2020/04/28 11:25:21.681533 [FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/etc/nss/mtl-dev/mtl-dev" is empty

Since this was a working config, I'm assuming that I'm not following a good procedure to restart a failed node. What am I missing?

-Tom

@kozlovic
Copy link
Member

If you start with streaming "store" but no RAFT logs, those may be created empty but the system will detect that there was no RAFT state but a streaming store.

When you have copied the raft_log_path content and streaming stores to a safe location, have you copied those back to the system where you have restarted? In other words, you needed both directories to the newly restarted node. I think that you did not do that properly.

@tharrold
Copy link
Author

I will try that. Is it not possible to recover a node, if the files (store, raft logs) are corrupted or lost?

@kozlovic
Copy link
Member

If you have the remaining of the cluster working fine, say 2 out of 3 nodes running with a new leader, technically you could remove the state from the failed node (that means both raft and streaming store) and restart that node which will start from scratch and sync its state with the current leader.
What you can't do is start with no raft but with streaming store for a give node.

@tharrold
Copy link
Author

Here are the results of attempting to restart the failed node using the original logs and store:

[root@sv26rdev01 mtl-dev]# su - nss
Last login: Tue Apr 28 11:46:20 EDT 2020 on pts/0
[nss@sv26rdev01 ~]$ /opt/nss-0.16-41775/nats-streaming-server -sc /etc/nss/mtl-d ev/nss.config
[3216] 2020/04/28 12:13:28.067816 [INF] STREAM: Starting nats-streaming-server[m tl-dev] version 0.16.0
[3216] 2020/04/28 12:13:28.067878 [INF] STREAM: ServerID: pyghNxsTfQZXA0jIp8jYJq
[3216] 2020/04/28 12:13:28.067885 [INF] STREAM: Go version: go1.11.13
[3216] 2020/04/28 12:13:28.067892 [INF] STREAM: Git commit: [27593aa]
[3216] 2020/04/28 12:13:28.068811 [INF] Starting nats-server version 2.0.4
[3216] 2020/04/28 12:13:28.068829 [INF] Git commit [c8ca58e]
[3216] 2020/04/28 12:13:28.069157 [INF] Starting http monitor on 0.0.0.0:44222
[3216] 2020/04/28 12:13:28.069398 [INF] Listening for client connections on 0.0. 0.0:4222
[3216] 2020/04/28 12:13:28.069408 [INF] Server id is NAGJOTLAZ3ZKEXDA2BPHNDCW5AE IMKJLUVTN56UOBVNJK5Y7LRNHTPD5
[3216] 2020/04/28 12:13:28.069414 [INF] Server is ready
[3216] 2020/04/28 12:13:28.070762 [INF] Listening for route connections on 0.0.0 .0:6222
[3216] 2020/04/28 12:13:28.072793 [INF] 140.188.62.250:6222 - rid:1 - Route conn ection created
[3216] 2020/04/28 12:13:28.072932 [INF] 140.188.62.251:6222 - rid:2 - Route conn ection created
[3216] 2020/04/28 12:13:28.096530 [INF] STREAM: Recovering the state...
[3216] 2020/04/28 12:13:28.100310 [INF] STREAM: Recovered 1 channel(s)
[3216] 2020/04/28 12:13:28.100339 [INF] STREAM: Cluster Node ID : a
[3216] 2020/04/28 12:13:28.100356 [INF] STREAM: Cluster Log Path: /etc/nss/mtl-d ev/mtl-dev
[3216] 2020/04/28 12:13:28.101908 [INF] STREAM: Shutting down.
[3216] 2020/04/28 12:13:28.102148 [INF] Server Exiting..
[3216] 2020/04/28 12:13:28.102557 [INF] 140.188.62.250:6222 - rid:1 - Router con nection closed
[3216] 2020/04/28 12:13:28.102612 [INF] 140.188.62.251:6222 - rid:2 - Router con nection closed
[3216] 2020/04/28 12:13:28.102952 [INF] STREAM: Failed to start: log not found
panic: log not found [recovered]
panic: log not found

goroutine 1 [running]:
github.com/nats-io/nats-streaming-server/server.RunServerWithOpts.func1(0x9d5834 0, 0x9d35fa0)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ server.go:1596 +0x17a
panic(0x86513c0, 0x9ce31c8)
/home/travis/.gimme/versions/go1.11.13.linux.amd64/src/runtime/panic.go: 513 +0x16d
github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.NewRaf t(0x9d35a24, 0x880f860, 0x9ce64d0, 0x8812320, 0x9c60180, 0x8810b80, 0x9d0c0f0, 0 x8810020, 0x9ce6130, 0x8813820, ...)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/vendor/ github.com/hashicorp/raft/api.go:532 +0x1067
github.com/nats-io/nats-streaming-server/server.(*StanServer).createRaftNode(0x9 d58340, 0x9ceb378, 0x7, 0x9cec8c0, 0x11, 0x9d3c050)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ clustering.go:359 +0x618
github.com/nats-io/nats-streaming-server/server.(*StanServer).createServerRaftNo de(0x9d58340, 0x9ce4b01, 0x872dccc, 0x87003c8)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ clustering.go:136 +0x9d
github.com/nats-io/nats-streaming-server/server.(*StanServer).startRaftNode(0x9d 58340, 0x8700301, 0x14, 0x9ce2190)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ server.go:1963 +0x2e
github.com/nats-io/nats-streaming-server/server.(*StanServer).start(0x9d58340, 0 x5, 0x0, 0x0)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ server.go:1906 +0xaa7
github.com/nats-io/nats-streaming-server/server.RunServerWithOpts(0x9d58000, 0x9 d5a000, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ server.go:1703 +0xb6f
github.com/nats-io/nats-streaming-server/server.Run(0x9d58000, 0x9d5a000, 0x86f9 6d4, 0xc, 0x0)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ service.go:25 +0x29
main.main()
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/nats-st reaming-server.go:162 +0x43

@kozlovic
Copy link
Member

That means the raft logs are corrupted, so in that case you have no choice but to start that node without any store (raft+streaming). It should be able to ultimately join the existing cluster and sync its state. If there were lots of messages, it may take some time.

@tharrold
Copy link
Author

OK, that seems to work. The first time it did not, but I suspect it was a missing directory in the log path, or a permission issue creating the log_path directory.

I have one further question: if we were to lose TWO of the 3 nodes, can the 3rd node be started in non-clustered mode? Being able to recover the data is key to our plans to use nss, so any guidance is appreciated.

-Tom

@kozlovic
Copy link
Member

That's tricky. So definitively you should save off the streaming stores (where the messages are), just in case the restarted node cannot fully recover before one of the 2 other servers fails.

With a single server, there is no quorum, so that server would stop servicing clients. Restarting in standalone mode to recover the data would work but only for messages, not subscriptions state. The subscriptions are stored in the raft log.

@tharrold
Copy link
Author

We do not currently use subscriptions, so bringing up a single surviving node in standalone mode would be our plan, if 2 nodes were to fail. (We are considering the situation where 2 nodes are co-located at a single site that fails for some reason - we would always have a 3rd node located a different site.)

Question: is it possible (or advisable?) for nss to detect when it's raft or message logs are corrupted on startup, and then pull them over from the surviving cluster nodes? Basically, eliminate the manual step of having to remove the directories and restart.

Thank you for your assistance.

@kozlovic
Copy link
Member

I would be a bit scared of automatically removing those files. That would not give a chance to the user to make a copy.

@tharrold
Copy link
Author

I understand. I will update our procedures to remove (but save) both raft logs and file store before recovering a crashed node.

@snayakie
Copy link

Hello. I understand that this issue is regarding a genuine restore scenario, but I am getting the same error on pod creation, in a kubernetes environment, for pod mounting a PVC volume, and gives the same error on first start.

[1] 2020/04/19 22:49:13.299687 [INF] STREAM: Starting nats-streaming-server[m1-stancluster] version 0.17.0
[1] 2020/04/19 22:49:13.300243 [INF] STREAM: ServerID: W3XC2Uu43ORI0YXly072rx
[1] 2020/04/19 22:49:13.300249 [INF] STREAM: Go version: go1.13.6
[1] 2020/04/19 22:49:13.300259 [INF] STREAM: Git commit: [c8735a0]
[1] 2020/04/19 22:49:13.400136 [INF] STREAM: Recovering the state...
[1] 2020/04/19 22:49:13.407013 [INF] STREAM: Recovered 0 channel(s)
[1] 2020/04/19 22:49:13.407052 [INF] STREAM: Cluster Node ID : "m1-stancluster-3"
[1] 2020/04/19 22:49:13.407057 [INF] STREAM: Cluster Log Path: /pv/stan/raft/m1-stancluster-3
[1] 2020/04/19 22:49:13.422128 [INF] STREAM: Shutting down.
[1] 2020/04/19 22:49:13.424052 [FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/pv/stan/raft/m1-stancluster-3" is empty

It is strange that since the pvc is just freshly created, it should not have any state to recover.
Would appreciate if you can explain why this may be happening.

@wallyqs
Copy link
Member

wallyqs commented Apr 30, 2020

@snayakie maybe there was a pvc with the same name created before? Is the stafefulset something like this? https://github.com/nats-io/k8s/blob/master/nats-streaming-server/simple-stan.yml

@kozlovic
Copy link
Member

The fact that it says:

[1] 2020/04/19 22:49:13.407013 [INF] STREAM: Recovered 0 channel(s)

indicates to me that the directory already existed. Otherwise you would have had something like:

[1] 2020/04/30 16:20:42.080641 [INF] STREAM: Recovering the state...
[1] 2020/04/30 16:20:42.080795 [INF] STREAM: No recovered state

@snayakie
Copy link

snayakie commented Apr 30, 2020

@snayakie maybe there was a pvc with the same name created before? Is the stafefulset something like this? https://github.com/nats-io/k8s/blob/master/nats-streaming-server/simple-stan.yml

@wallyqs Yes, its like that example, but we're creating new pvc, no pvc of that name existed prior.

@kozlovic yes, I agree, I looked the nats-streaming-server code a bit and can see the logic, but I'm unable to explain how it finds state in the pvc just create few seconds before. And each of my 3 pods have separate pvcs mounted for themselves. So I'm puzzled.

@kozlovic
Copy link
Member

Puzzling indeed, but the error you describe would mean really that the streaming store was found while the raft was not.
Of course once streaming and raft stores are created (even empty, meaning that no producer/consumer is ever started, so there is no channel, etc..), the server would not fail on restart. The issue is really if the server finds an existing streaming store while the raft store is brand new.

@mcuadros
Copy link

I am having the example same problem, I remove the cluster and the pvc, and then I create again that cluster and some nodes fail to boot:

[1] 2020/05/16 10:20:30.152948 [INF] STREAM: Starting nats-streaming-server[nats-streaming-cluster] version 0.17.0
[1] 2020/05/16 10:20:30.153080 [INF] STREAM: ServerID: Fh3Nt1H1OMYViqwaQOZYNZ
[1] 2020/05/16 10:20:30.153084 [INF] STREAM: Go version: go1.13.7
[1] 2020/05/16 10:20:30.153088 [INF] STREAM: Git commit: [f4b7190]
[1] 2020/05/16 10:20:30.168437 [INF] STREAM: Recovering the state...
[1] 2020/05/16 10:20:30.169647 [INF] STREAM: Recovered 0 channel(s)
[1] 2020/05/16 10:20:30.169666 [INF] STREAM: Cluster Node ID : "nats-streaming-cluster-2"
[1] 2020/05/16 10:20:30.169671 [INF] STREAM: Cluster Log Path: /pv/stan/raft/nats-streaming-cluster-2
[1] 2020/05/16 10:20:30.172639 [INF] STREAM: raft: Initial configuration (index=0): []
[1] 2020/05/16 10:20:30.172718 [INF] STREAM: Shutting down.
[1] 2020/05/16 10:20:30.172800 [INF] STREAM: raft: Node at nats-streaming-cluster."nats-streaming-cluster-2".nats-streaming-cluster [Follower] entering Follower state (Leader: "")
[1] 2020/05/16 10:20:30.173057 [FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/pv/stan/raft/nats-streaming-cluster-2" is empty

@tpyo
Copy link

tpyo commented Aug 12, 2020

Ran into this with two new clusters today and it started working after changing dir and log_path 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants