-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All output fail if one of many outputs fails #8524
Comments
We agree that this is a problem! What is your preferred solution here. First, if there are outputs that fail without blocking the whole pipeline, that is a bug and a violation of our durability guarantees. Which ones do not behave like this? My second question is, what is your desired behavior? If an output is broken we can either:
For your use case, which is preferable? |
Hi @andrewvc , thanks for your reply.
This question also depends. We would prefer temporally store the data some where for recovery. However, we are also using Kafka as message bus, from which we can retrieve the passed data and let Logstash process them again, without storing them again at Logstash's side. |
@fluency03 this is where it gets so challenging. Where should we store it while we wait? The local FS, somewhere else? What would your preference be? I'm thinking the best thing to do here would be to add a new We could later add a |
Even just dropping events in the offending output and logging a WARN to stderr would be better than the current behaviour of one plugin being able to lock up the entire logstash instance. |
@wadejensen A custom plugin would allow you to make this decision, should you want it. Logstash is designed, and intended, to never drop data. The idea of dropping data when an output is slow or unreachable is not something I will be easily convinced. The impact of our existing design choice is that Logstash goes as fast as the slowest component. An output "failing" is a very complex subject. Failure is subjective for everyone -- there are so many symptoms which could be classified as a network partition, temporary or otherwise, and you are asking to drop data any time there is any kind of fault. In many ways, an overloaded server is fundamentally indistinguishable from a failed server. If you are open to data loss during network partitions or other faults, you have a few options for outputs:
My entire sysadmin/operations experience has informed Logstash's never-drop-data design. I am open to discussing other behaviors, but it will take more than saying "it would be better to drop" to convince me. This is not meant as a challenge, but to say that I have considered for many years these concerns and I still resist most requests to drop data during network faults. I am listening to you, though, and I appreciate this feedback and discussion. |
We have some ideas (@andrewvc's been exploring them) for adding some kind of stream branching where you can, by your pipeline's design, have lossy/asynchronous outputs in a separate pipeline but still have strong delivery attempts on other outputs. I don't know how this will look in the long-term, but it is on our radar. It's less a checkbox to enable "drop data when an output is having problems" and more a way to model your pipeline's delivery priorities. |
The wonderful thing to look forward to in 6.0 is independent pipelines (yay!). While the feature itself doesn't solve the problem you're describing, it provides easier methods to mitigate it while better, more complete solutions are worked on. Imagine a single Logstash pipeline that receives from source S, processes the events, and broadcasts them in parallel to n instances of a broker (Redis, Kafka, etc.). Then you can have independent "publishing" instances each reading from their own broker instance and shipping to the intended outbound service independent from the others. The best part of 6.0 is that all of these pipelines would exist within the same JVM, rather than separate instances. With Monitoring enabled, you'd be able to see individual flow rates for each of the "output" pipelines. In the future, Logstash may (subject to change or alteration at any time) allow you to route this traffic flow internally, removing the need for the broker altogether, via the stem branching flow that @jordansissel just mentioned. The team is aware of the shortcomings and is working on ways to improve things. |
@untergeek thank you for describing this better than I was able ❤️ |
@jordansissel Thanks for your response and suggestions. I think we are optimising for different things here so I doubt one will convince the other but I'll try and express my take. You've said that intentional data loss or dropping data is not the answer, but from my perspective the current solution is unintuitive and does cause data to be lost. My use case: There are outputs to elasticsearch, Kafka and other Logstash instances downstream via TCP which are managed by the users where they can create their own filters. For me, it begs belief that my platform Logstash service should be brought to its knees if a user takes down their downstream Kafka or Logstash for some reason. The messages I am responsible for delivering to Elasticsearch do not get delivered, and are effectively lost unless I manually retrieve them from the baremetal filesystem or replay the filebeat events. You're not really preventing data loss, just punting the responsibility to the upstream service, or in the case of your UDP suggestion, the upstream service. I think where we clash is that our system is designed to service multiple use cases, and Logstash as it is designed does not cater to that. But rather there should be one Logstash per output in most cases, particularly in a multitenant environment. This just feels like a shame and a missed opportunity from my perspective, but its not your job to make free software for me that fits my needs, its my job to pick software which meets them. To that end, it might be beneficial to point out prominently in the documentation that Logstash in the current single pipeline mode operates as slowly as the slowest output, as I don't this is intuitive to users. Its an easy expectation to have and understandable that we might get upset when the software doesn't do what we thought it said it would on the tin. Thank you very much for your contributions to Logstash. |
@wadejensen are you aware of the upcoming multiple pipelines feature in
6.0? Does that change things?
https://www.elastic.co/guide/en/logstash/master/multiple-pipelines.html
…On Nov 4, 2017 12:51 AM, "Wade Jensen" ***@***.***> wrote:
@jordansissel <https://github.com/jordansissel> Thanks for your response
and suggestions. I think we are optimising for different things here so I
doubt one will convince the other but I'll try and express my take.
You've said that intentional data loss or dropping data is not the answer,
but from my perspective the current solution is unintuitive and does cause
data to be lost.
My use case:
I have 120 baremetal nodes each running Filebeat to collect logs written
to the baremetal filesystem each of which passes logs to one of 3
dockerised Logstash instances in a round robin format. We run a multitenant
platform in which only privileged users have access to the baremetal
filesystem, so this Logstash is the only access which users have for
retrieving logs created by their various applications.
There are outputs to elasticsearch, Kafka and other Logstash instances
downstream via TCP which are managed by the users where they can create
their own filters.
For me, it begs belief that my platform Logstash service should be brought
to its knees if a user takes down their downstream Kafka or Logstash for
some reason. The messages I am responsible for delivering to Elasticsearch
do not get delivered, and are effectively lost unless I manually retrieve
them from the baremetal filesystem or replay the filebeat events. You're
not really preventing data loss, just punting the responsibility to the
upstream service, or in the case of your UDP suggestion, the upstream
service.
I think where we clash is that our system is designed to service multiple
use cases, and Logstash as it is designed does not cater to that. But
rather there should be one Logstash per output in most cases, particularly
in a multitenant environment.
This just feels like a shame and a missed opportunity from my perspective,
but its not your job to make free software for me that fits my needs, its
my job to pick software which meets them.
To that end, it might be beneficial to point out prominently in the
documentation that Logstash in the current single pipeline mode operates as
slowly as the slowest output, as I don't this is intuitive to users. Its an
easy expectation to have and understandable that we might get upset when
the software doesn't do what we thought it said it would on the tin.
Thank you very much for your contributions to Logstash.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8524 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIBY9caN1ukVPezR18Q0RR89OdCWuHfks5sy_t1gaJpZM4QFoon>
.
|
Thanks all of your discussions. I also saw this regarding multiple-pipelines:
I think this is really good design for logic separations. |
Today, we encountered another output blocking: elasticsearch output is also blocking other outputs, when elasticsearch output plugin is having error making index. |
Even though multiple-pipelines could partially solve this problem. The problem still remains. For example, there could also be multiple outputs within a single pipeline. All outputs within its pipeline can still be blocked due to one output blocking of that pipeline. |
I am also think is it a good idea to have a 'rescue' output. For example, when somehow the data I got is incomplete, and they are passed to elasticsearch output. If my elastic output is doing dynamically indexing like:
However, due to the incompletion of the data, the index wrong index cannot be created (and also because I have given rules to the user Now, logstash will repeating this action like following and blocking itself outputting to elasticsearch, which in turn blocks all other outputs.
I am think could we have a mechanism or special output plugin which will rescue from the output failure. For example, the previous blocking is caused by wrong indexing. Then what we can have could be like this:
Then, the failed data will be reindexed to index Maybe it's similar to this:
This is more like you know what are the filter plugin failures and you act based on them at output part. But the rescue is the action on output plugins. Should I open another issue for discussing output failure rescue? |
We have such a mechanism today. It's too hard to use, but it exists. This mechanism is called the dead letter queue (DLQ). https://www.elastic.co/guide/en/logstash/current/dead-letter-queues.html We only currently deliver things to the DLQ if it is something we consider "permanently undeliverable" -- and there are basically only two cases for this: One, on a mapping exception from ES which is unfixable without destroying data. Two, when an index is closed (which is debatable, since you can open the index), but is at least a property of the event. I dont' think I would consider doing 403s to DLQ by default. Maybe we can make it configurable, but never by default. If you want such a feature, please open an issue on the logstash-output-elasticsearch repo :) |
Even though, as mentioned here by @jordansissel :
But as far as i am concerned, this should be a design matter at the first place of Logstash, i.e., is it more reasonable to make each plugins working separately, asynchronously, and "reactively", so that one's failure won't have impact on others? I am wondering how the Logstash plugins (input/filter/output) work.
Thanks :) |
having a similar issue with elasticsearch and kinesis-plugins. |
having the same issue with kafka and tcp output plugins, if kafka plugin fails, then no data is delivered to fluentd over tcp.
|
#9225 may address the concerns raised in this thread. Would that approach be useful to those of you facing this issue? |
Has there ever been a solution that fixes this issue? One failure blocks all other outputs and tends to fill up logs rather quickly. eg. [2019-09-26T13:47:25,865][WARN ][logstash.outputs.syslog ] syslog tcp output exception: closing, reconnecting and resending event {:host=>"X.X.X.X", :port=>514, :exception=>#<Errno::ECONNREFUSED: Connection refused - connect(2) for "X.X.X.X" port 514>, :backtrace=>["org/jruby/ext/socket/RubyTCPSocket.java:135:in |
@BobTheBuilder7828 Look into pipeline-to-pipeline communication, as this allows you to create discrete pipelines per output. |
Is there any way to "throttle" the error output and/or retry delta? Either have a pre-determined/set period for a retry or have it back off meaning 30s, 60s, 120s, etc ... as to not pollute logs so badly? |
Exponential back-off settings will depend on the plugin. You can't disable or throttle the warning messages, but you might be able to use an API call to set the level to ERROR rather than INFO. See https://www.elastic.co/guide/en/logstash/current/logging.html for information on that. |
Honestly, I'd be more concerned that port 514 is unavailable on the remote side. That's a pretty standard service (syslog), and it being unavailable is an abnormality that should be logged pretty heavily. This is why for a syslog output, Logstash does not throttle retries or error messages. It's a tcp port that is expected to be open and remain open at all times. |
Yeah, I agree … just planning for if/when an output host is offline for some reason … do not want it to kill the entire thing. Thank you for your replies. |
Sadly I'm facing the same issue. In my output plugin, there is if-else condition to output to our internal Elasticsearch server or external Kafka server based on tag of document. But if Kafka failed (lost connection, Kafka broker not available), all output to my internal Elasticsearch also stopped working. |
Sending to multiple pipelines (basically replicating the data), then acting on those individual pipelines (even if they are doing the same operation, just to a different host) was the only way I was able to get around this issue. One failed host for output kills the whole works unless you break it apart (pipeline-to-pipeline) as outlined above. |
It worked. But logstash said that, 1 pipeline worker per CPU core. So it's not a feasible solution if you have multiple pipepines. |
That number is only a default setting. You can dial it up higher manually. |
Also, logstash dont start if output cannot connect (by example, to a RabbitMQ server). Using pipeline To pipeline, to seperate input from output, dont work! input is starting, but logstash HTTP API dont, whereas we use this endpoint to make Docker contianer ready. I would like to control better pipeline flow: What about a new setting
Also, in case of mulitple outputs, another new setting to configure behavior,
Or, option in pipeline definition, as such:
|
There are edge cases I've witnessed which can cause the outputs of multiple pipelines to become blocked. |
In my case I forward all prod data to a dev cluster, by just having multiple outputs in logstash the dev cluster going down takes out prod as well. Regarding logstash design to never drop data, that's already violated with "ensure_delivery => false", it's utilization is just obscured. For those that want to see the solution in config rather this discussion:
You can also place Perhaps this could be implemented internally by having a base output class flag like "nonblocking" which sets up a unique pipeline in the background. The change is repeatable enough to be done computationally. |
@braddeicide The pipeline to pipeline communication solution comes with a risk. When the persisted queue of the second pipeline becomes full then the outputs of the first pipeline become blocked so your production cluster is not receiving any data. Of course you can increase the persisted queue size of the second pipeline. But disk size is not unlimited. |
We run a real-time fintech platform which manages billions of calculations daily, each of which needs to be auditable. We run one pipeline per instance, and each pipeline has 4 outputs. One of the outputs failed (TCP) and blocked all the other outputs for one of our instances. In our case, the platform must continue running even if ES isn't working and needs attention, but we must also maintain audibility of the system actions. Our fallback is for log stash to write the logs to disk, which we later archive away. We generate a few hundred GB of logs on a daily basis. This has been working well for us until we experience an outage on our ES server, which in turn resulted in permanent data loss because log stash decided to block all outputs. A policy born out of intellectual purism doesn't work when it meets the real world. Please reconsider your position and add an option to allow this fallback behaviour. Network outages happen. Plain log files are far more important than ES in a lot of cases (especially in ours, when push comes to shove, the plain text log file is the final point truth when auditing financial data). A blip on the network should not be the cause of data loss. As for the recommended workarounds, they won't work for us. We must minimise the points of failure in our infrastructure, not increase them to accommodate an obvious flaw in the design of log stash. |
i would really like to underline @ctgrid s last two paragraphs. This decision to not support options for higher availability and reduced data loss, while probably trivially implemented, is not easily understandable from a real world perspective. We are outputting our logs to an elk cluster and a subset of the logs (the auditing part) to a SIEM which isnt highly available. We cannot accept losing logs just when the siem is under maintenance. We would really favor @andrewvc s solution which should be a blast
|
Update for posterity: We looked at forking/building a plugin and after some soul searching we decided against it. We opted for a risky and unpopular approach (often for good reason) of writing our own solution using Rust and replacing logstash with it. Given the volume of real-time data we're processing and the importance of not losing said data when an endpoint goes down, this ended up being the right decision for us, and one that was well worth the investment. We were able to achieve an order of magnitude improvement in CPU and memory utilisation over logstash, which was unexpected, and very welcome. |
Logstash -> TCP: tcp output exception, EOFError: End of file reached
According to:
This is caused by the shutting down of TCP server, which the logstash’s TCP client is trying to connect to. And this is indeed the case.
As stated in the code tcp.rb Line 153 of logstash-output-tcp plugin:
“# don't expect any reads, but a readable socket might # mean the remote end closed, so read it and throw it away. # we'll get an EOFError if it happens.”
However:
Logstash stopped outputting event via all output plugins once TCP output stopped.
Once the TCP output was not working due to the TCP server was stopped, all of our other outputs stopped working as well, such as Redis.
According to:
Many people have encountered such situation: all output fail if one of many outputs fails. It has been behaving like such for a long time, and according to different developers’ experience, it depends on the plugin. Some plugin’s failure will block all outputs but some won’t.
I think this is a serious problem and has be to be fixed. And I think this is a more general issue not related to single particular Logstash plugin, so I submitted it here.
The text was updated successfully, but these errors were encountered: