-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Short-lived TCP connection feature request. #8408
Comments
Thanks for the report! This certainly is an interesting problem. If I understand your question, there are two possible solutions:
It would seem to me that solution 1 is preferable as the server doesn't need to make any assumptions. I'm curious as to your thoughts? |
This should be moved to the tcp output plugin repo. Regarding the concern, a failed node should be detected automatically and ignored by the output. No "periodic reconnection" is necessary. Further, with TCP, we have no way of knowing when our data is actually received by the client, so it is very difficult to determine a safe time to destroy a healthy connection in a way that doesn't lose data. If you want to discuss this further, please move this issue to the logstash-output-tcp plugin repo. |
@jordansissel I think I read this differently. This is about the TCP input and beats input . The problem isn't failover detection, but rather that if you say, have 4 LS hosts, and say 20 clients. When the clients connect they may load balance. However, if you lose 2 hosts, those connected to the two that died will reconnect to the two live ones. The problem seems to be that when bringing up two new replacement hosts the clients don't reconnect and rebalance. I'm going to reopen this so we can hear back from @dustinhyun to clarify here. |
I assumed tcp output. I don't know given the description -- will wait until we have more details. |
This is about TCP/beats input as @andrewvc said. (Front-end logstash in the following)
We have a bunch of systems. Some of them send logs via Filebeat while others use TCP direct and new input protocols may be added if requested. Front-end logstash here handles various protocols and throttles input if needed. Having short-live TCP connection at front-end logstash could be a way more simple solution for load balancing than asking each client to implement TCP reconnect. |
@dustinhaver thank you for the clarification
I'm not sure I understand. Let's step back a moment. What load are you looking to balance? A "connection" is not a unit of "load" for either beats or tcp inputs. Each connection costs basically one socket but otherwise has no persistent cpu resource consumption. The data flowing over a connection is what consumes resources, not the connection itself. This is to say that having perfectly balanced connection count (N connections distributed evently across M logstash instances) does not mean that you will have perfectly balanced resource consumption across Logstash nodes. So let's back up and define what you mean by "load" and "balance" ? What is load for you? What is balance for you? As an aside, the TCP input randomly closing an otherwise healthy connection will cause data loss up the total of client send buffer + window size + server receive buffer (which can be several megabytes) for each terminated healthy connection. I am strongly against this approach for TCP for the data loss scenarios it enables. |
Is it that you want to have unmanaged beats and tcp clients discover new endpoints automatically without intervention? |
Can you describe the negative impact of an underutilized Logstash node when otherwise logs are flowing without latency to other Logstash nodes? |
Yes, connection is not a unit of load. But if it is reconnected periodically, DNS RR have a chance to route connection to the other logstash node.
So what I said was to manage load by reconnection. This may not be perfect but reduces connection starvation of newly added logstash node. (reference) And you're right! I missed the point you mentioned about data loss. Now I clearly understand that closing connection at server side is not a good idea. My idea came from typical web server connection which closes connection(keep-alive off as possible) right after sending a respond. But now I understand that this shouldn't be done in TCP/Beat input. If I need to close connnection periodically, it should be done at client-side('sender', specifically). Negative impact we had happened when all logstash nodes were receiving full traffic it can handle. To upgrade logstash to 5.6.0, we restarted them. And then, one node restarted slightly earlier than the others. Log traffic surged into that logstash node(yellow one int the graph). So we restarted that one(10:43). Then other two started to receive traffic two times more than the earlier one had in total. I think max event traffic a logstash(5.6.0) can handle is about 30k/sec in our environment. And here again, node restarted had no connection, which means we couldn't utilize the resource of newly added logstash. Now I understand why you said about output plugin in your earlier comment. I'll check if clients using my logstash support short-live connection. I'll re-open this issue if they use logstash to send logs and want this feature to be added. Thank you for clarification at implementation level. |
All this detail helps a lot!
For beats we are probably going to work on adding node discovery to the
protocol so beats clients can periodically learn about the set of valid
logstash endpoints and do connection distribution using that.
For tcp, we don't have much option without an additional protocol (which is
more complex like beats)
…On Thu, Sep 28, 2017 at 6:47 PM Dongseok Hyun (Dustin) < ***@***.***> wrote:
@jordansissel <https://github.com/jordansissel>
Yes, connection is not a unit of load. But if it is reconnected
periodically, DNS RR have a chance to route connection to the other
logstash node.
AS-IS:
logstash1
[----------data over connection ---------] => logstash2 (keeping connection)
logstash3
Suggested:
[----data---] ===============================> logstash1
reconnect [---data---] ==================> logstash2
reconnect [---data---] =====> logstash3
So what I said was to manage load by reconnection. This may not be perfect
but reduces connection starvation of newly added logstash node. (reference
<http://www.ateam-oracle.com/long-lived-tcp-connections-and-load-balancers/>
)
------------------------------
And you're right! I missed the point you mentioned about data loss. Now I
clearly understand that closing connection at server side is not a good
idea. My idea came from typical web server connection which closes
connection(keep-alive off as possible) right after sending a respond. But
now I understand that this shouldn't be done in TCP/Beat input. If I need
to close connnection periodically, it should be done at
client-side('sender', specifically).
------------------------------
Negative impact we had happened when all logstash nodes were receiving
full traffic it can handle. To upgrade logstash to 5.6.0, we restarted
them. And then, one node restarted slightly earlier than the others. Log
traffic surged into that logstash node(yellow one int the graph). So we
restarted that one(10:43). Then other two started to receive traffic two
times more than the earlier one had. I think max event traffic a
logstash(5.6.0) can handle is about 30k/sec in our environment. And here
again, node restarted had no connection, which means we couldn't utilize
the resource of newly added logstash.
[image: image]
<https://user-images.githubusercontent.com/4222998/30997258-87f2c882-a501-11e7-86c4-67214f4d5923.png>
Now I understand why you said about output plugin in your earlier comment.
I'll check if clients using my logstash support short-live connection. I'll
re-open this issue if they use logstash to send logs and want this feature
to be added.
Thank you for clarification at implementation level.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8408 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIC6jrRv27nnRjy9PrEAd5SmZK8Rd3Yks5snEwWgaJpZM4Pm7Ag>
.
|
This is exactly why Beats has the |
@praseodym even the "ttl" was not a good solution to this problem. We need to solve it without having clients and servers arbitrarily closing connections. With beats protocol this can work because it is stateful and data is acknowledged. Long term, we are likely to add discovery into the beats protocol so that it can learn about available endpoints and balance accordingly (and the TTL mode will likely be removed) With plain tcp, I cannot think of a way that closing a healthy connection would not sometimes cause dataloss. At a minimum, tcp clients doing this must set SO_LINGER in order to not abandon in-flight data when closing. I have significant doubts that most clients will remember to do this. At minimum, we are adding a protocol on top of TCP (with a ceremony that clients must set SO_LINGER or risk data loss). |
@jordansissel Re: discovery in Beats protocol; I think this would be very hard to get right as well. Especially when running Logstash in autoscaled environments (e.g. on Kubernetes like mentioned in #8242) with an external load balancer in front. Plain TCP is even harder indeed. Maybe an idea would be to write a new lightweight Beat that works similarly to netcat to which data can be piped. |
I disagree on the difficulty, and autoscaling (whatever this means to anyone) doesn't make it any harder. Regardless, one still needs to know how to ask for a list of nodes providing a service (DNS, ZooKeeper, etcd, etc are all used for this kind of discovery service elsewhere, for example). For beats protocol, I generally do not recommend a load balancer because they, in general, seem to behave poorly with respect to actual "load" balancing over non-HTTP protocols, and further, often give false-negatives for health checks causing a healthy node to be removed from the pool. I strongly recommend making sure your load balancer, if you insist on having one, behave well with beats protocol. Most load balancers fixate on HTTP and behave very badly with the beats protocol. Whether it is hard or not is unimportant, imo. We will very likely provide a way to discover nodes and will use this to teach beats/logstash about other logstash nodes. :) |
Given the original description of in this issue is a request for "short lived TCP connections" implemented on the server-side (tcp input) of Logstash, and that I think I did an OK job describing why this will cause data loss and is not something I want to make available (due to data loss, etc), I'm going to close this. For other discovery or load balancing concerns, I invite you to open a new issue specific to the problem (not the solution, for example, not "connection TTLs"). |
I have 6 nodes for logstash and receives 80k events per second at peak time. Connections are TCP and Filebeat(TCP 5044) mixed.
For the case of node failure, I hid hostnames over internal DNS. Clients knows domain name only and DNS for logstash is running with round robin policy.
The problem happens when I restart one or more of logstash servers. Because they all have long-live TCP connections, traffic doesn't come in to newly started logstash node.
Here, host csb70112 restarted, traffic routed to the others doesn't come back(red line).
If logstash could supports 'short-live tcp' option, disconnecting connection after getting a certain amount of events or certain time has passed(given as config value), I think load balancing will be done nicely.
Would this be possible feature?
The text was updated successfully, but these errors were encountered: