[FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/etc/nss/mtl-dev/mtl-dev" is empty #1039

tharrold · 2020-04-28T15:43:16Z

Hi, I have seen similar issues reported here, and I'm wondering if I am doing something wrong when trying to recover one node of a 3 node cluster.

We have a 3 node cluster, and one node had an issue, causing nats to stop. (filesystem full in this case). I moved "store" and our log_path directories off to a save location, and restarted. It appears to have recovered the streaming state, but not the cluster log path.

It DID create the directories, and even a raft.log file. But still failed:

[12120] 2020/04/28 11:25:21.681533 [FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/etc/nss/mtl-dev/mtl-dev" is empty

Since this was a working config, I'm assuming that I'm not following a good procedure to restart a failed node. What am I missing?

-Tom

kozlovic · 2020-04-28T15:52:41Z

If you start with streaming "store" but no RAFT logs, those may be created empty but the system will detect that there was no RAFT state but a streaming store.

When you have copied the raft_log_path content and streaming stores to a safe location, have you copied those back to the system where you have restarted? In other words, you needed both directories to the newly restarted node. I think that you did not do that properly.

tharrold · 2020-04-28T16:11:39Z

I will try that. Is it not possible to recover a node, if the files (store, raft logs) are corrupted or lost?

kozlovic · 2020-04-28T16:13:32Z

If you have the remaining of the cluster working fine, say 2 out of 3 nodes running with a new leader, technically you could remove the state from the failed node (that means both raft and streaming store) and restart that node which will start from scratch and sync its state with the current leader.
What you can't do is start with no raft but with streaming store for a give node.

tharrold · 2020-04-28T16:14:25Z

Here are the results of attempting to restart the failed node using the original logs and store:

[root@sv26rdev01 mtl-dev]# su - nss
Last login: Tue Apr 28 11:46:20 EDT 2020 on pts/0
[nss@sv26rdev01 ~]$ /opt/nss-0.16-41775/nats-streaming-server -sc /etc/nss/mtl-d ev/nss.config
[3216] 2020/04/28 12:13:28.067816 [INF] STREAM: Starting nats-streaming-server[m tl-dev] version 0.16.0
[3216] 2020/04/28 12:13:28.067878 [INF] STREAM: ServerID: pyghNxsTfQZXA0jIp8jYJq
[3216] 2020/04/28 12:13:28.067885 [INF] STREAM: Go version: go1.11.13
[3216] 2020/04/28 12:13:28.067892 [INF] STREAM: Git commit: [27593aa]
[3216] 2020/04/28 12:13:28.068811 [INF] Starting nats-server version 2.0.4
[3216] 2020/04/28 12:13:28.068829 [INF] Git commit [c8ca58e]
[3216] 2020/04/28 12:13:28.069157 [INF] Starting http monitor on 0.0.0.0:44222
[3216] 2020/04/28 12:13:28.069398 [INF] Listening for client connections on 0.0. 0.0:4222
[3216] 2020/04/28 12:13:28.069408 [INF] Server id is NAGJOTLAZ3ZKEXDA2BPHNDCW5AE IMKJLUVTN56UOBVNJK5Y7LRNHTPD5
[3216] 2020/04/28 12:13:28.069414 [INF] Server is ready
[3216] 2020/04/28 12:13:28.070762 [INF] Listening for route connections on 0.0.0 .0:6222
[3216] 2020/04/28 12:13:28.072793 [INF] 140.188.62.250:6222 - rid:1 - Route conn ection created
[3216] 2020/04/28 12:13:28.072932 [INF] 140.188.62.251:6222 - rid:2 - Route conn ection created
[3216] 2020/04/28 12:13:28.096530 [INF] STREAM: Recovering the state...
[3216] 2020/04/28 12:13:28.100310 [INF] STREAM: Recovered 1 channel(s)
[3216] 2020/04/28 12:13:28.100339 [INF] STREAM: Cluster Node ID : a
[3216] 2020/04/28 12:13:28.100356 [INF] STREAM: Cluster Log Path: /etc/nss/mtl-d ev/mtl-dev
[3216] 2020/04/28 12:13:28.101908 [INF] STREAM: Shutting down.
[3216] 2020/04/28 12:13:28.102148 [INF] Server Exiting..
[3216] 2020/04/28 12:13:28.102557 [INF] 140.188.62.250:6222 - rid:1 - Router con nection closed
[3216] 2020/04/28 12:13:28.102612 [INF] 140.188.62.251:6222 - rid:2 - Router con nection closed
[3216] 2020/04/28 12:13:28.102952 [INF] STREAM: Failed to start: log not found
panic: log not found [recovered]
panic: log not found

goroutine 1 [running]:
github.com/nats-io/nats-streaming-server/server.RunServerWithOpts.func1(0x9d5834 0, 0x9d35fa0)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ server.go:1596 +0x17a
panic(0x86513c0, 0x9ce31c8)
/home/travis/.gimme/versions/go1.11.13.linux.amd64/src/runtime/panic.go: 513 +0x16d
github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.NewRaf t(0x9d35a24, 0x880f860, 0x9ce64d0, 0x8812320, 0x9c60180, 0x8810b80, 0x9d0c0f0, 0 x8810020, 0x9ce6130, 0x8813820, ...)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/vendor/ github.com/hashicorp/raft/api.go:532 +0x1067
github.com/nats-io/nats-streaming-server/server.(*StanServer).createRaftNode(0x9 d58340, 0x9ceb378, 0x7, 0x9cec8c0, 0x11, 0x9d3c050)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ clustering.go:359 +0x618
github.com/nats-io/nats-streaming-server/server.(*StanServer).createServerRaftNo de(0x9d58340, 0x9ce4b01, 0x872dccc, 0x87003c8)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ clustering.go:136 +0x9d
github.com/nats-io/nats-streaming-server/server.(*StanServer).startRaftNode(0x9d 58340, 0x8700301, 0x14, 0x9ce2190)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ server.go:1963 +0x2e
github.com/nats-io/nats-streaming-server/server.(*StanServer).start(0x9d58340, 0 x5, 0x0, 0x0)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ server.go:1906 +0xaa7
github.com/nats-io/nats-streaming-server/server.RunServerWithOpts(0x9d58000, 0x9 d5a000, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ server.go:1703 +0xb6f
github.com/nats-io/nats-streaming-server/server.Run(0x9d58000, 0x9d5a000, 0x86f9 6d4, 0xc, 0x0)
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/server/ service.go:25 +0x29
main.main()
/home/travis/gopath/src/github.com/nats-io/nats-streaming-server/nats-st reaming-server.go:162 +0x43

kozlovic · 2020-04-28T16:16:22Z

That means the raft logs are corrupted, so in that case you have no choice but to start that node without any store (raft+streaming). It should be able to ultimately join the existing cluster and sync its state. If there were lots of messages, it may take some time.

tharrold · 2020-04-28T16:28:27Z

OK, that seems to work. The first time it did not, but I suspect it was a missing directory in the log path, or a permission issue creating the log_path directory.

I have one further question: if we were to lose TWO of the 3 nodes, can the 3rd node be started in non-clustered mode? Being able to recover the data is key to our plans to use nss, so any guidance is appreciated.

-Tom

kozlovic · 2020-04-28T16:31:42Z

That's tricky. So definitively you should save off the streaming stores (where the messages are), just in case the restarted node cannot fully recover before one of the 2 other servers fails.

With a single server, there is no quorum, so that server would stop servicing clients. Restarting in standalone mode to recover the data would work but only for messages, not subscriptions state. The subscriptions are stored in the raft log.

tharrold · 2020-04-28T17:13:12Z

We do not currently use subscriptions, so bringing up a single surviving node in standalone mode would be our plan, if 2 nodes were to fail. (We are considering the situation where 2 nodes are co-located at a single site that fails for some reason - we would always have a 3rd node located a different site.)

Question: is it possible (or advisable?) for nss to detect when it's raft or message logs are corrupted on startup, and then pull them over from the surviving cluster nodes? Basically, eliminate the manual step of having to remove the directories and restart.

Thank you for your assistance.

kozlovic · 2020-04-28T17:38:56Z

I would be a bit scared of automatically removing those files. That would not give a chance to the user to make a copy.

tharrold · 2020-04-28T19:02:29Z

I understand. I will update our procedures to remove (but save) both raft logs and file store before recovering a crashed node.

snayakie · 2020-04-30T21:43:06Z

Hello. I understand that this issue is regarding a genuine restore scenario, but I am getting the same error on pod creation, in a kubernetes environment, for pod mounting a PVC volume, and gives the same error on first start.

[1] 2020/04/19 22:49:13.299687 [INF] STREAM: Starting nats-streaming-server[m1-stancluster] version 0.17.0
[1] 2020/04/19 22:49:13.300243 [INF] STREAM: ServerID: W3XC2Uu43ORI0YXly072rx
[1] 2020/04/19 22:49:13.300249 [INF] STREAM: Go version: go1.13.6
[1] 2020/04/19 22:49:13.300259 [INF] STREAM: Git commit: [c8735a0]
[1] 2020/04/19 22:49:13.400136 [INF] STREAM: Recovering the state...
[1] 2020/04/19 22:49:13.407013 [INF] STREAM: Recovered 0 channel(s)
[1] 2020/04/19 22:49:13.407052 [INF] STREAM: Cluster Node ID : "m1-stancluster-3"
[1] 2020/04/19 22:49:13.407057 [INF] STREAM: Cluster Log Path: /pv/stan/raft/m1-stancluster-3
[1] 2020/04/19 22:49:13.422128 [INF] STREAM: Shutting down.
[1] 2020/04/19 22:49:13.424052 [FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/pv/stan/raft/m1-stancluster-3" is empty

It is strange that since the pvc is just freshly created, it should not have any state to recover.
Would appreciate if you can explain why this may be happening.

wallyqs · 2020-04-30T22:18:59Z

@snayakie maybe there was a pvc with the same name created before? Is the stafefulset something like this? https://github.com/nats-io/k8s/blob/master/nats-streaming-server/simple-stan.yml

kozlovic · 2020-04-30T22:21:06Z

The fact that it says:

[1] 2020/04/19 22:49:13.407013 [INF] STREAM: Recovered 0 channel(s)

indicates to me that the directory already existed. Otherwise you would have had something like:

[1] 2020/04/30 16:20:42.080641 [INF] STREAM: Recovering the state...
[1] 2020/04/30 16:20:42.080795 [INF] STREAM: No recovered state

snayakie · 2020-04-30T22:30:30Z

@snayakie maybe there was a pvc with the same name created before? Is the stafefulset something like this? https://github.com/nats-io/k8s/blob/master/nats-streaming-server/simple-stan.yml

@wallyqs Yes, its like that example, but we're creating new pvc, no pvc of that name existed prior.

@kozlovic yes, I agree, I looked the nats-streaming-server code a bit and can see the logic, but I'm unable to explain how it finds state in the pvc just create few seconds before. And each of my 3 pods have separate pvcs mounted for themselves. So I'm puzzled.

kozlovic · 2020-04-30T22:53:02Z

Puzzling indeed, but the error you describe would mean really that the streaming store was found while the raft was not.
Of course once streaming and raft stores are created (even empty, meaning that no producer/consumer is ever started, so there is no channel, etc..), the server would not fail on restart. The issue is really if the server finds an existing streaming store while the raft store is brand new.

mcuadros · 2020-05-16T10:24:24Z

I am having the example same problem, I remove the cluster and the pvc, and then I create again that cluster and some nodes fail to boot:

[1] 2020/05/16 10:20:30.152948 [INF] STREAM: Starting nats-streaming-server[nats-streaming-cluster] version 0.17.0
[1] 2020/05/16 10:20:30.153080 [INF] STREAM: ServerID: Fh3Nt1H1OMYViqwaQOZYNZ
[1] 2020/05/16 10:20:30.153084 [INF] STREAM: Go version: go1.13.7
[1] 2020/05/16 10:20:30.153088 [INF] STREAM: Git commit: [f4b7190]
[1] 2020/05/16 10:20:30.168437 [INF] STREAM: Recovering the state...
[1] 2020/05/16 10:20:30.169647 [INF] STREAM: Recovered 0 channel(s)
[1] 2020/05/16 10:20:30.169666 [INF] STREAM: Cluster Node ID : "nats-streaming-cluster-2"
[1] 2020/05/16 10:20:30.169671 [INF] STREAM: Cluster Log Path: /pv/stan/raft/nats-streaming-cluster-2
[1] 2020/05/16 10:20:30.172639 [INF] STREAM: raft: Initial configuration (index=0): []
[1] 2020/05/16 10:20:30.172718 [INF] STREAM: Shutting down.
[1] 2020/05/16 10:20:30.172800 [INF] STREAM: raft: Node at nats-streaming-cluster."nats-streaming-cluster-2".nats-streaming-cluster [Follower] entering Follower state (Leader: "")
[1] 2020/05/16 10:20:30.173057 [FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/pv/stan/raft/nats-streaming-cluster-2" is empty

tpyo · 2020-08-12T10:05:32Z

Ran into this with two new clusters today and it started working after changing dir and log_path 🤔

kozlovic mentioned this issue Sep 17, 2020

Running nats streaming cluster on kubernetes #1068

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/etc/nss/mtl-dev/mtl-dev" is empty #1039

[FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/etc/nss/mtl-dev/mtl-dev" is empty #1039

tharrold commented Apr 28, 2020

kozlovic commented Apr 28, 2020

tharrold commented Apr 28, 2020

kozlovic commented Apr 28, 2020

tharrold commented Apr 28, 2020

kozlovic commented Apr 28, 2020

tharrold commented Apr 28, 2020

kozlovic commented Apr 28, 2020

tharrold commented Apr 28, 2020

kozlovic commented Apr 28, 2020

tharrold commented Apr 28, 2020

snayakie commented Apr 30, 2020

wallyqs commented Apr 30, 2020

kozlovic commented Apr 30, 2020

snayakie commented Apr 30, 2020 •

edited

Loading

kozlovic commented Apr 30, 2020

mcuadros commented May 16, 2020

tpyo commented Aug 12, 2020

[FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/etc/nss/mtl-dev/mtl-dev" is empty #1039

[FTL] STREAM: Failed to start: streaming state was recovered but cluster log path "/etc/nss/mtl-dev/mtl-dev" is empty #1039

Comments

tharrold commented Apr 28, 2020

kozlovic commented Apr 28, 2020

tharrold commented Apr 28, 2020

kozlovic commented Apr 28, 2020

tharrold commented Apr 28, 2020

kozlovic commented Apr 28, 2020

tharrold commented Apr 28, 2020

kozlovic commented Apr 28, 2020

tharrold commented Apr 28, 2020

kozlovic commented Apr 28, 2020

tharrold commented Apr 28, 2020

snayakie commented Apr 30, 2020

wallyqs commented Apr 30, 2020

kozlovic commented Apr 30, 2020

snayakie commented Apr 30, 2020 • edited Loading

kozlovic commented Apr 30, 2020

mcuadros commented May 16, 2020

tpyo commented Aug 12, 2020

snayakie commented Apr 30, 2020 •

edited

Loading