Better Error Catching Due to Perf/Env Failures #23

GaryChicago · 2018-10-28T13:27:50Z

Most of the time when node-startum-pool has a slight lag with RPC from the daemons due to performance of hardware/software, network latency, etc... the whole pool crashes or the pool stops listening due to a very quick timeout.

I couldn't imagine if chrome crashed every time facebook wouldn't load the whole way for somebody. lol.

E.G.
DNS use case to talk to from node-stratum-pool to daemons. If main DNS server times out linux by default waits 5 seconds (crazy long I know) to start using the second DNS server. s-nomp will give a socket hangup almost instantly if it doesn't respond. Now this is good but the connection doesn't timeout on node-stratum's side and it retries the socket indefinitely. DNS is used in scaled setups.. being forceful on IP is something I'm against. Now I've worked around this by setting the DNS timeout to 500ms.

Another scenario from someone else.

2) it hangs from 1 to 5 minutes, the daemon is just busy, it doesn't break
3) after this short period of times, it replies again and works as usual
WHAT'S HAPPENING UNDER THE HOOD:
1) the pool recognize that the daemon is not replying anymore --> socket hang up
2) it drops the pool for that specific coin --> zeroclassic pool fork died (it drops the stratum, nothing inside logs, all clear)
SOLUTION:
1) before dropping a pool fork, wait more time OR double check multiple times ---> IF SOCKET HANG UP MORE THAN 3 TIMES IN A ROW --> drop pool
got it?

from jacko0088.

Now my solution is to force a reconnect to the daemon after N socket hangups set by the end user or statically coded and a timeout_connect option.

The text was updated successfully, but these errors were encountered:

egyptianbman · 2018-10-28T19:48:06Z

This is due to a default timeout of 60 seconds defined here: https://github.com/s-nomp/node-stratum-pool/blob/master/lib/daemon.js#L55

It is actually a good thing that you are seeing these errors (compared to not seeing them). It means there is an issue with the node that needs to be resolved. Allowing overly-long rpc calls to be hidden is hiding a serious problem, leading to your pool finding less blocks among other issues.

I don't think a 60 second timeout is unreasonable, especially when you're talking about mining -- every millisecond counts.

First, check the size of your wallet.dat file. If it's larger than 100MB, it's a good time to migrate to a new wallet. When wallet.dat gets too large, the whole node starts to bog down.

GaryChicago · 2018-10-28T20:04:34Z

@egyptianbman It's fine we're seeing the errors, which is good. My issue is that the entire pool crashes about about 30 seconds of these spammed errors.

I'll post more about it if I get time to test. I can recreate this by killing the first DNS server set inside Linux. Expected behavior is for s-nomp to spit out errors about socket hangups and then retry. The current behavior of this is infinite socket hangups until the pool is restarted manually. This also causes the pool forks to crash. I think network hangups (even if the daemon is just hung) can be handled a bit better to include self-healing.

egyptianbman · 2018-10-28T20:49:31Z

The infinite retries could be caused by pm2 if you're using that. We definitely need more information about which calls are failing. Depending on the call that's failing, the resolution could be completely different from another.

GaryChicago · 2018-10-29T03:54:24Z

@egyptianbman I'll try it out without PM2 to see if its an issue with that. Thanks for the opinion. 😄

egyptianbman · 2018-11-05T22:06:01Z

So I've run into this with my ZEN node. It seems if one getblocktemplate call delays past 60 seconds, it goes into a constant try-and-fail loop. My suspicion is that these getblocktemplate calls are building on top of themselves so I'm working on some code to allow this call to be longer than usual. I'm testing right now to try to find the sweet spot limit.

GaryChicago added the enhancement New feature or request label Oct 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Error Catching Due to Perf/Env Failures #23

Better Error Catching Due to Perf/Env Failures #23

GaryChicago commented Oct 28, 2018 •

edited

Loading

egyptianbman commented Oct 28, 2018 •

edited

Loading

GaryChicago commented Oct 28, 2018 •

edited

Loading

egyptianbman commented Oct 28, 2018

GaryChicago commented Oct 29, 2018

egyptianbman commented Nov 5, 2018

Better Error Catching Due to Perf/Env Failures #23

Better Error Catching Due to Perf/Env Failures #23

Comments

GaryChicago commented Oct 28, 2018 • edited Loading

egyptianbman commented Oct 28, 2018 • edited Loading

GaryChicago commented Oct 28, 2018 • edited Loading

egyptianbman commented Oct 28, 2018

GaryChicago commented Oct 29, 2018

egyptianbman commented Nov 5, 2018

GaryChicago commented Oct 28, 2018 •

edited

Loading

egyptianbman commented Oct 28, 2018 •

edited

Loading

GaryChicago commented Oct 28, 2018 •

edited

Loading