-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding protections to handle timeouts and dead servers. #53
base: master
Are you sure you want to change the base?
adding protections to handle timeouts and dead servers. #53
Conversation
Thanks for using You mention running into timeout prone scenarios. Is it possible to write some tests to simulate those scenarios, each in isolation, so that we can be confident these changes address them specifically? Also, it looks like there are many places where you add timeouts or change the default timeouts to be more aggressive. We've found the existing defaults to work sufficiently, and I'd prefer not to change them without good reason, but certainly not opposed to making more things configurable. I'd love to hear more about your set up and why you're running into some of this. |
Yep. I noticed some of those defaults changing and will revert. I can also bring more configuration options to the top for the ones that weren't currently exposed. I am running on linux and have trouble with the test suite. I will take a further look into ways to test these scenarios. My use-case is for log pulling from hundreds of servers that are scaling up and down with volume. Sometimes they are removed in a way that acts like a timeout situation, or they could just get into one due to low memory. the issue I was seeing was that new servers were not getting discovered for potentially days after they came up. And exceptions were getting thrown in discovery that exited the discovery loop so that it wasn't attempted any longer.
reworking to expose the following configurations would be a good improvement as well. Let me know if you agree to the names and I can update the pull. |
specs passing now
I was able to figure out the specs. adding some now. |
…cluster gem that I will also put a pull in for.
I pushed the nsq-cluster gem as well pull. I ran the specs against that and added the new ones. let me know about the new variables and if that makes sense and I will update that as well. |
Sorry for the complete radio silence - I did merge your nsq-cluster change in and tested against your branch but it seems like there's still issues with the specs. |
I can check. We have run this patch in prod for many years and it resolved all the lockups we had in the past. |
I have ran into many timeout prone scenarios and this resolves them
if a lookupd times out then set reasonable break and move on.
if an nsqd or server times out due to being out of memory or shutting down then timeout quickly and move on.
this fixes case where nsqd would timeout and hang the discovery process so no new servers were identified. even if that nsqd was unreachable or came back.