Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoP Proxy seems redundant, introduces unnecessary complexity #1218

Open
tsturzl opened this issue Jan 17, 2024 · 3 comments
Open

MoP Proxy seems redundant, introduces unnecessary complexity #1218

tsturzl opened this issue Jan 17, 2024 · 3 comments

Comments

@tsturzl
Copy link
Contributor

tsturzl commented Jan 17, 2024

Is your enhancement request related to a problem? Please describe.

In several attempts to get the MoP proxy to work properly we run across constant stability problems, and bugs related to the MoP proxy. The current design is that the MoP broker only serves the subset of topics that Pulsar broker serves, and the MoP proxy handles routing connections to the designated broker. This gets complicated by the fact that the proxy does a lot of mediation, but ultimately has to handle the lifecycle of n number of MQTT connections to each broker. The current keep alive mechanism is flawed in the fact that keep alive from the client may never happen if the client is communicating frequently enough, and that means that the keep alive to certain brokers might not be maintained. Introducing keep alive between client and proxy, and proxy and each broker isn't too difficult, but there this only increases the complexity of lifecycle management of the proxy and broker connections in conjunction with the client and proxy connection.

Describe the solution you'd like

Omit the need for a MoP proxy at all. Terminate MQTT as soon as possible and translate into Pulsar on the same broker the connection from the client is made. Instead of having each MoP broker handle the subset of topics on that broker, instead have the MoP broker do a lookup, if it resolves the lookup is local then handle the operation similar to what's being done now, if the lookup resolves to a different broker then create a PulsarClient to that broker to handle the operation. Alternatively a lot of work can possibly be saved but just having a PulsarClient which is already connected to all brokers that you forward operations through and rely on it's own mechanism to route to the correct broker.

I get the concern of not wanting to handle the routing on the Pulsar broker, as Pulsar has obviously made the decision not to do this themselves, and Pulsar itself has a Pulsar Proxy to do something very similar. The reality is that the MoP proxy already runs on the broker, and rather than using Pulsar to forward the requests, MQTT is forwarded as MQTT messages, which creates a lot of complicated problems, especially because the way that the MoP proxy forwards requests is not implemented in a way where the MoP proxy is a proper MQTT client. The reality is forwarding MQTT the way the proxy does takes special consideration, because that's not really the way MoP was designed.

It really seems like the benefits of a MoP proxy are almost entirely lost, but the complexity remains. I know there is a proposal to move the proxy onto the Pulsar Proxy, but there's not really a current effort to do that. Even if the effort was made, the problem of complexity persists. It seems that moving to the Pulsar Proxy finally achieves the intended design, but as it stands the complexity of having the MoP proxy doesn't seem to add any benefit in any regard I can think of. If load balance is an issue, I'd simply recommend informing the users to run MoP behind a load balancer.

Describe alternatives you've considered
Continuously patching issues in the MoP proxy.

@tsturzl
Copy link
Contributor Author

tsturzl commented Jan 26, 2024

We are going to move this direction whether or not we agree on this change. We strongly believe the MoP proxy does not work and can not be easily modified to work. It carries too much state between separate connections. MQTT protocol should be terminated as soon as possible to reduce complexity.

@StevenLeRoux
Copy link

@tsturzl are you aware of any current effort for better MQTT protocol support?

@tsturzl
Copy link
Contributor Author

tsturzl commented Nov 26, 2024

@StevenLeRoux We are likely moving away from this project, and will likely build our own solution to mediate MQTT access into Pulsar. The problem with this project seems to be in the design that each MQTT broker instances only serves the data on that local broker, if you reach that broker directly to access a namespace bundle it doesn't hold the MQTT connection will be dropped erroneously and you'll see a log from pulsar saying something like "bundle not served by broker". Their solution to this was, instead of having each MQTT broker open a pulsar protocol connection to the broker which contains the bundle, to instead create an MQTT proxy layer which implements yet another service each broker runs exposing another MQTT service which looks up which broker contains the bundle and then connects to that brokers MQTT service. The forwarding logic is wrong though, and the developers don't seem to want to address the issue.

The root problem, as I've detailed in a few other tickets, is that each connection to the MQTT proxy forwards off to numerous other MQTT broker connections. This forwarding means that if your MQTT client sends a MQTT PING it gets a request for every individual forwarding connection it has. To make matters worse forwarding connections seem to duplicate, as in you might have 10 connections to a single broker for a single proxy connection. So you might get 30 PING responses. What makes this problem even worse is the fragmentation of state on the connection side, the proxy holds some of the state, but so do all of the forwarded connections, so if that forwarded connection dies and reconnects the only hope is to kill the entire session, because it's really complicated to try and reconcile the state among all of these connections which are all just MQTT connections. So if you have persistent state set, the management of that is fragmented between these connections in a way that I don't believe even currently work though I've never tested it. The other problem with this solution is these connections from the MoP Proxy are timing out on MQTT keep alive, because not all of them get messages routinely, and MQTT keep alive will only send out the ping in absence of other messages, so these forwarding connections routinely timeout and disconnect the client.

My hope was to agree with developers on removing this MQTT proxy, as I don't believe StreamNative has a lot of expertise around MQTT, which is fine and understandable. The problem is that they simply won't respond to me, and often completely deny that this problem exists, despite me providing several ways for them to reproduce the issue, and having provided a detailed explanation of the issue accompanied by TCP data from wireshark. We've basically gone to the lengths of providing a full audit, breakdown of the bug, and even offered to coordinate on the solution. They have yet to fully acknowledge the problem exists, meanwhile StreamNative reps assure me that big name clients use this product, though having already fix 3 major behavioral bugs in this project I can confidently say whatever big name companies are using this either have exceedingly simple and non-critical use-cases for this, or that they've simply piloted some proof of concepts with some companies who don't actively use this for any actual production workload.

Our design proposal is simply that instead of running 2 MQTT services on different port, one which forwards to the others. Just have the MQTT translation layer terminate into Pulsar as soon as possible. Take MQTT data and convert it into Pulsar, and allow the Pulsar proxy or the Pulsar client workout the routing to the appropriate broker. They apparently are worried about connections, because I believe the pulsar client when connecting to brokers directly is creating a connection per broker, which is exactly why the Pulsar Proxy (different than the MoP Proxy) exists. It's all complicated, and likely much more than it needs to be. It seems like they are trying to optimize for number of TCP connections, meanwhile I'm fairly certain a MQTT client connecting to the MoP proxy and interacting with many topics creates a incredibly number of forwarding connections, they deny this despite observing this both in a debugger and wireshark.

We are looking at creating a alternative solution that doesn't even need to run in-process as a protocol handler, which can allow for better scaling as a separate process. The issue is I work for a for-profit company, this is my day job. I can't guarantee that, if we create our own in-house solution, that it will be open source. It's much easier to contribute to an already open source project. I can't say for certain that the solution we create will ever be open source, and it's not really something I'm at liberty to decide.

TL;DR, there is an issue I've went to great lengths to detail, prove, and reproduce. The maintainer of this project has routinely rejected the issue, despite the evidence, and won't even acknowledge my requests to explain places in the code that outline the exact behavior I've described. We're planning to, in coming months, replace MoP internally, but that replacement may never be open sourced due to the fact it's not a decision I'm at liberty of making.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants