Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SubscribeSocket does not reconnect on network reconnection #845

Open
valeriob opened this issue Jan 17, 2020 · 27 comments
Open

SubscribeSocket does not reconnect on network reconnection #845

valeriob opened this issue Jan 17, 2020 · 27 comments

Comments

@valeriob
Copy link
Contributor

Environment

NetMQ Version:    4.0.0.207
Operating System: Win10 1909
.NET Version:     Core 3.0.1

I created a console application that connect to a server via SubscribeSocket, i run the program with wifi ON, i start receiving messages. If i disable wifi the messages stop. If i enable wifi within ~10 seconds, the messages start coming again, if i wait more no more message ever come.
On some computers with the same software the problem does not manifest itself, i investigated the problem here, but i do not know how to work around it :( somdoron/AsyncIO#35
We precisely chose netmq for the network resiliency since the application will work with intermittent wifi connection 😢

@somdoron
Copy link
Member

Which side is the listener and which is connect?
I'm going to introduce ZMTP3 soon with heartbeat, it might solve it.

If you can, try to make the publisher be the connector and the subscriber the listener.
That way, the publisher will be able to recognize disconnections.

You can also try and enable TcpKeepAlive, it might also solve it.

@somdoron
Copy link
Member

Bottom line, I'm not sure it is NetMQ problem, but TCP. Tcp keepalive is off by default.
If network is down, the receiver might not be able to recognize it. ZMTP v2, which NetMQ is implementing doesn't have heartbeat mechanism. ZMTPv3 does have it, however it disabled by default.

I'm working on ZMTPv3 PR, it should be merged in a couple of days, I will make the heartbeat commit now. If you can test it and see if the problem is solved that will be great.

@valeriob
Copy link
Contributor Author

Thanks @somdoron, the publisher binds, the subscriber connects, since i have 1 server and many occasionally connected clients that all need to be notified of some events.
My problem are the clients that do not receive data when the network reconnects.

If you see my issue on AsyncIO somdoron/AsyncIO#35, you will see that i diagnosed that GetQueuedCompletionStatusEx never ever ever (it's not a timeout, i waited 10 minutes) returns 😄 And it happens only on some computers, not in all 😭

I'll try new bits as soon as they are released !

@somdoron
Copy link
Member

Yes, I saw the issue with AsyncIO, I'm not sure it is related. Give me 10 minutes

@somdoron
Copy link
Member

@valeriob can you check with the following PR:

#843

You need to enable HeartbeatInterval, check out the socket option:
https://github.com/somdoron/netmq/blob/ZMTP3/src/NetMQ/SocketOptions.cs#L415

Also checkout the test:
https://github.com/somdoron/netmq/blob/ZMTP3/src/NetMQ.Tests/ZMTPTests.cs

Both publisher and subscriber need the new version.
You can enable the heartbeat only on the subscriber side if you want.

@valeriob
Copy link
Contributor Author

Thanks @somdoron, i tried that PR, but when i simulated bad network condition i got an unhandled exception on a timer, i'll be able to reproduce the problem on monday.
Meanwhile you can reproduce it yourself maybe with this tool https://jagt.github.io/clumsy/download.html i just added 10% packed loss on the localhost comunication, running both client and server on my computer.

@valeriob
Copy link
Contributor Author

This is the exception i get :
image
i tried to enumerate earlier (var timers = m_timers[key].ToList();) just to see if i could go on but i got this exception :

image

@somdoron
Copy link
Member

I think I fixed the issue, please check:
https://github.com/somdoron/netmq

@valeriob
Copy link
Contributor Author

Thanks @somdoron , now i get this exception :

image

@somdoron
Copy link
Member

Oops, I think I fixed that now:
https://github.com/somdoron/netmq/tree/ZMTP3

I will try to simulate a broken network later today.

@valeriob
Copy link
Contributor Author

Thanks,
there is still some problem with a timer :

image

@somdoron
Copy link
Member

somdoron commented Jan 20, 2020

I think I fixed that. You can check again

@somdoron
Copy link
Member

@valeriob any updates? it is now part of master. If you can confirm that it works I will release a beta version to nuget.

@valeriob
Copy link
Contributor Author

Hi @somdoron sorry i was away for a few days, i just tested commit e99b59a
but i still got this unhandled exception :
image

@somdoron
Copy link
Member

:(
What is the value of id?

@valeriob
Copy link
Contributor Author

No problem, id is 1 :

image

Valerio

@somdoron
Copy link
Member

I was able to reproduce and fix:
somdoron@6c05444

Branch:
https://github.com/somdoron/netmq/tree/TimerInvokeNull

@valeriob
Copy link
Contributor Author

valeriob commented Jan 26, 2020

Unfortunately now after 20second of drop 80%, the server crashes :

image

and sometimes like this :

image

The problem is that it does not recover :(
Valerio

@somdoron
Copy link
Member

Any chance you are receiving from multiple sockets?

@valeriob
Copy link
Contributor Author

Yes, it's possible i tried to restart the publisher after the crash to see if the client would recover.
The test now is :
Start the publisher and the subscriber (without any network interference).

  1. Raise packet loss to 80% => the events stop coming (ofc)
  2. Wait 30s
  3. Restore packet loss to 0%
  4. The subscriber does not receive anything anymore.
  5. Restore packet loss to 80% => exception on the subscriber
    image

@valeriob
Copy link
Contributor Author

As far as i can see, protocol v3 looks like more complicated to implement correctly, i'll keep helping with the test, but what do you think to take a look at what @wmjordan said in this issue ? somdoron/AsyncIO#35 i guess it will benefit many ppl.

@somdoron
Copy link
Member

Only the heartbeat is a bit complicated.

Anyway, I don't think it is AsyncIO issue, but tcp by design thing. Tcp doesn't has heart beat (not by default at least), so if connection is closed ungracefully the other side won't know about it until it try to send.

This is why we need heartbeat. Can you share the code you are using for testing the pub and sub?

@valeriob
Copy link
Contributor Author

Ofc, i'll extract the bits, maybe i'm missusing something.

@valeriob
Copy link
Contributor Author

@valeriob
Copy link
Contributor Author

Hi,
i'm testing the latest version and adding those options to the socket looks like solving the problem.
sub.Options.HeartbeatInterval = TimeSpan.FromMilliseconds(10);
sub.Options.HeartbeatTimeout = TimeSpan.FromMilliseconds(1);
I'm setting those on the sockets that do the "connect" side of the link, am i right ?
Thanks

@valeriob
Copy link
Contributor Author

Hi,
i've been testing it for a few days, when i run pub and sub on two different process the connection recovers perfectly, but if i run both publisher and subscriber in the same proces, it does'nt, here is the repro, just run the PublisherSubscriber project.
https://github.com/valeriob/NetMqNetworkFailures/tree/master/src/NetMqNetworkFailures/PublisherSubscriber

Valerio

@Serg046
Copy link

Serg046 commented Mar 29, 2022

I have experienced the same issue. Then I decided to test the same with https://github.com/zeromq/clrzmq4 and got the same behavior. Seems this is by design on zmq side, it just stops reconnection after a while. Then I tried to add manual reconnections, my hope was to use Monitor feature to intercept Disconnected event. But the thing is that I don't get this event, I get just Connected events. I try something like this:

using var monitor = new NetMQMonitor(socket, "inproc://sub", SocketEvents.All);
monitor.EventReceived += (sender, eventArgs) =>
{
	Console.Write(eventArgs.SocketEvent.ToString());
	Console.WriteLine("EventReceived");
};
Task.Run(() => monitor.Start());

So I have no idea how to reconnect the sub socket, I don't even see any flags like State on the socket so that I can just check it in a cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants