Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding protections to handle timeouts and dead servers. #53

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

pbrumm
Copy link

@pbrumm pbrumm commented Apr 1, 2019

I have ran into many timeout prone scenarios and this resolves them

if a lookupd times out then set reasonable break and move on.
if an nsqd or server times out due to being out of memory or shutting down then timeout quickly and move on.

this fixes case where nsqd would timeout and hang the discovery process so no new servers were identified. even if that nsqd was unreachable or came back.

@bschwartz
Copy link
Member

Thanks for using nsq-ruby and for the PR, Pete!

You mention running into timeout prone scenarios. Is it possible to write some tests to simulate those scenarios, each in isolation, so that we can be confident these changes address them specifically?

Also, it looks like there are many places where you add timeouts or change the default timeouts to be more aggressive. We've found the existing defaults to work sufficiently, and I'd prefer not to change them without good reason, but certainly not opposed to making more things configurable.

I'd love to hear more about your set up and why you're running into some of this.

@pbrumm
Copy link
Author

pbrumm commented Apr 2, 2019

Yep. I noticed some of those defaults changing and will revert. I can also bring more configuration options to the top for the ones that weren't currently exposed.

I am running on linux and have trouble with the test suite. I will take a further look into ways to test these scenarios.

My use-case is for log pulling from hundreds of servers that are scaling up and down with volume. Sometimes they are removed in a way that acts like a timeout situation, or they could just get into one due to low memory.

the issue I was seeing was that new servers were not getting discovered for potentially days after they came up. And exceptions were getting thrown in discovery that exited the discovery loop so that it wasn't attempted any longer.

  1. if the discovery loop tries connecting to a server that is blocked. (ctrl-z to test) then the discovery loop will never complete. So I added non-blocking reads and a reasonable timeout of 10 seconds to receive any data from the connection.

  2. lookupd is more unlikely to be unresponsive but I added protections there for cases where there is bad network connections.

  3. for the retries I split out the lookupd scenario from the directly configured nsqd situation. I feel if lookupd is providing the server lists it needs to trust that source and give up quicker on nsqds. so it retries fewer times to break out faster. If 100 servers are removed then each would go through the retry logic individually in their own threads for potentially 8 hours before giving up.

  4. To handle cases where the server is initially connected but is unresponsive later, I added a monitor for the last_heartbeat to allow a connection to go bad if the nsqd isn't able to keep up with sending heartbeats.

  5. I switched to StandardError as it allows compile related issues to get through.

  6. the switch to non blocking reads and writes also keeps the retry loops running. as before if one was hung, the code would just be stuck there waiting until the nsqd was taken out of that state by dropping connection or becoming responsive again.

reworking to expose the following configurations would be a good improvement as well. Let me know if you agree to the names and I can update the pull.
max_retry_sleep_seconds
max_retry_attempts
heartbeat_seconds
heartbeat_max_stale_seconds

@pbrumm
Copy link
Author

pbrumm commented Apr 2, 2019

I was able to figure out the specs. adding some now.

@pbrumm
Copy link
Author

pbrumm commented Apr 3, 2019

I pushed the nsq-cluster gem as well pull. I ran the specs against that and added the new ones.

let me know about the new variables and if that makes sense and I will update that as well.
it will definitely make the specs run faster as well.

@chen-anders
Copy link
Contributor

Sorry for the complete radio silence - I did merge your nsq-cluster change in and tested against your branch but it seems like there's still issues with the specs.

@pbrumm
Copy link
Author

pbrumm commented Nov 30, 2023

I can check. We have run this patch in prod for many years and it resolved all the lockups we had in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants