Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Error Catching Due to Perf/Env Failures #23

Open
GaryChicago opened this issue Oct 28, 2018 · 5 comments
Open

Better Error Catching Due to Perf/Env Failures #23

GaryChicago opened this issue Oct 28, 2018 · 5 comments
Labels
enhancement New feature or request

Comments

@GaryChicago
Copy link
Contributor

GaryChicago commented Oct 28, 2018

Most of the time when node-startum-pool has a slight lag with RPC from the daemons due to performance of hardware/software, network latency, etc... the whole pool crashes or the pool stops listening due to a very quick timeout.

I couldn't imagine if chrome crashed every time facebook wouldn't load the whole way for somebody. lol.

E.G.
DNS use case to talk to from node-stratum-pool to daemons. If main DNS server times out linux by default waits 5 seconds (crazy long I know) to start using the second DNS server. s-nomp will give a socket hangup almost instantly if it doesn't respond. Now this is good but the connection doesn't timeout on node-stratum's side and it retries the socket indefinitely. DNS is used in scaled setups.. being forceful on IP is something I'm against. Now I've worked around this by setting the DNS timeout to 500ms.

Another scenario from someone else.

2) it hangs from 1 to 5 minutes, the daemon is just busy, it doesn't break
3) after this short period of times, it replies again and works as usual
WHAT'S HAPPENING UNDER THE HOOD:
1) the pool recognize that the daemon is not replying anymore --> socket hang up
2) it drops the pool for that specific coin --> zeroclassic pool fork died (it drops the stratum, nothing inside logs, all clear)
SOLUTION:
1) before dropping a pool fork, wait more time OR double check multiple times ---> IF SOCKET HANG UP MORE THAN 3 TIMES IN A ROW --> drop pool
got it?

from jacko0088.

Now my solution is to force a reconnect to the daemon after N socket hangups set by the end user or statically coded and a timeout_connect option.

@GaryChicago GaryChicago added the enhancement New feature or request label Oct 28, 2018
@egyptianbman
Copy link
Member

egyptianbman commented Oct 28, 2018

This is due to a default timeout of 60 seconds defined here: https://github.com/s-nomp/node-stratum-pool/blob/master/lib/daemon.js#L55

It is actually a good thing that you are seeing these errors (compared to not seeing them). It means there is an issue with the node that needs to be resolved. Allowing overly-long rpc calls to be hidden is hiding a serious problem, leading to your pool finding less blocks among other issues.

I don't think a 60 second timeout is unreasonable, especially when you're talking about mining -- every millisecond counts.

First, check the size of your wallet.dat file. If it's larger than 100MB, it's a good time to migrate to a new wallet. When wallet.dat gets too large, the whole node starts to bog down.

@GaryChicago
Copy link
Contributor Author

GaryChicago commented Oct 28, 2018

@egyptianbman It's fine we're seeing the errors, which is good. My issue is that the entire pool crashes about about 30 seconds of these spammed errors.

I'll post more about it if I get time to test. I can recreate this by killing the first DNS server set inside Linux. Expected behavior is for s-nomp to spit out errors about socket hangups and then retry. The current behavior of this is infinite socket hangups until the pool is restarted manually. This also causes the pool forks to crash. I think network hangups (even if the daemon is just hung) can be handled a bit better to include self-healing.

@egyptianbman
Copy link
Member

The infinite retries could be caused by pm2 if you're using that. We definitely need more information about which calls are failing. Depending on the call that's failing, the resolution could be completely different from another.

@GaryChicago
Copy link
Contributor Author

@egyptianbman I'll try it out without PM2 to see if its an issue with that. Thanks for the opinion. 😄

@egyptianbman
Copy link
Member

So I've run into this with my ZEN node. It seems if one getblocktemplate call delays past 60 seconds, it goes into a constant try-and-fail loop. My suspicion is that these getblocktemplate calls are building on top of themselves so I'm working on some code to allow this call to be longer than usual. I'm testing right now to try to find the sweet spot limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants