adding protections to handle timeouts and dead servers. #53

pbrumm · 2019-04-01T21:12:32Z

I have ran into many timeout prone scenarios and this resolves them

if a lookupd times out then set reasonable break and move on.
if an nsqd or server times out due to being out of memory or shutting down then timeout quickly and move on.

this fixes case where nsqd would timeout and hang the discovery process so no new servers were identified. even if that nsqd was unreachable or came back.

bschwartz · 2019-04-02T03:53:38Z

Thanks for using nsq-ruby and for the PR, Pete!

You mention running into timeout prone scenarios. Is it possible to write some tests to simulate those scenarios, each in isolation, so that we can be confident these changes address them specifically?

Also, it looks like there are many places where you add timeouts or change the default timeouts to be more aggressive. We've found the existing defaults to work sufficiently, and I'd prefer not to change them without good reason, but certainly not opposed to making more things configurable.

I'd love to hear more about your set up and why you're running into some of this.

pbrumm · 2019-04-02T14:21:38Z

Yep. I noticed some of those defaults changing and will revert. I can also bring more configuration options to the top for the ones that weren't currently exposed.

I am running on linux and have trouble with the test suite. I will take a further look into ways to test these scenarios.

My use-case is for log pulling from hundreds of servers that are scaling up and down with volume. Sometimes they are removed in a way that acts like a timeout situation, or they could just get into one due to low memory.

the issue I was seeing was that new servers were not getting discovered for potentially days after they came up. And exceptions were getting thrown in discovery that exited the discovery loop so that it wasn't attempted any longer.

if the discovery loop tries connecting to a server that is blocked. (ctrl-z to test) then the discovery loop will never complete. So I added non-blocking reads and a reasonable timeout of 10 seconds to receive any data from the connection.
lookupd is more unlikely to be unresponsive but I added protections there for cases where there is bad network connections.
for the retries I split out the lookupd scenario from the directly configured nsqd situation. I feel if lookupd is providing the server lists it needs to trust that source and give up quicker on nsqds. so it retries fewer times to break out faster. If 100 servers are removed then each would go through the retry logic individually in their own threads for potentially 8 hours before giving up.
To handle cases where the server is initially connected but is unresponsive later, I added a monitor for the last_heartbeat to allow a connection to go bad if the nsqd isn't able to keep up with sending heartbeats.
I switched to StandardError as it allows compile related issues to get through.
the switch to non blocking reads and writes also keeps the retry loops running. as before if one was hung, the code would just be stuck there waiting until the nsqd was taken out of that state by dropping connection or becoming responsive again.

reworking to expose the following configurations would be a good improvement as well. Let me know if you agree to the names and I can update the pull.
max_retry_sleep_seconds
max_retry_attempts
heartbeat_seconds
heartbeat_max_stale_seconds

specs passing now

pbrumm · 2019-04-02T15:24:46Z

I was able to figure out the specs. adding some now.

…cluster gem that I will also put a pull in for.

pbrumm · 2019-04-03T21:39:57Z

I pushed the nsq-cluster gem as well pull. I ran the specs against that and added the new ones.

let me know about the new variables and if that makes sense and I will update that as well.
it will definitely make the specs run faster as well.

chen-anders · 2023-11-30T03:37:24Z

Sorry for the complete radio silence - I did merge your nsq-cluster change in and tested against your branch but it seems like there's still issues with the specs.

pbrumm · 2023-11-30T13:41:57Z

I can check. We have run this patch in prod for many years and it resolved all the lockups we had in the past.

adding protections to handle timeouts and dead servers.

11c4c2a

adding fix to have the method work as it did before

5a2cd16

specs passing now

pbrumm added 2 commits April 2, 2019 16:51

adding some specs to show the timeouts. this requires an updated nsq-…

ade2b8e

…cluster gem that I will also put a pull in for.

fixing for tls connections

c6f941e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding protections to handle timeouts and dead servers. #53

adding protections to handle timeouts and dead servers. #53

pbrumm commented Apr 1, 2019

bschwartz commented Apr 2, 2019

pbrumm commented Apr 2, 2019

pbrumm commented Apr 2, 2019

pbrumm commented Apr 3, 2019

chen-anders commented Nov 30, 2023

pbrumm commented Nov 30, 2023

adding protections to handle timeouts and dead servers. #53

Are you sure you want to change the base?

adding protections to handle timeouts and dead servers. #53

Conversation

pbrumm commented Apr 1, 2019

bschwartz commented Apr 2, 2019

pbrumm commented Apr 2, 2019

pbrumm commented Apr 2, 2019

pbrumm commented Apr 3, 2019

chen-anders commented Nov 30, 2023

pbrumm commented Nov 30, 2023