Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All output fail if one of many outputs fails #8524

Open
fluency03 opened this issue Oct 25, 2017 · 36 comments
Open

All output fail if one of many outputs fails #8524

fluency03 opened this issue Oct 25, 2017 · 36 comments

Comments

@fluency03
Copy link

Logstash -> TCP: tcp output exception, EOFError: End of file reached

According to:

This is caused by the shutting down of TCP server, which the logstash’s TCP client is trying to connect to. And this is indeed the case.

As stated in the code tcp.rb Line 153 of logstash-output-tcp plugin:

image

“# don't expect any reads, but a readable socket might # mean the remote end closed, so read it and throw it away. # we'll get an EOFError if it happens.”

However:

Logstash stopped outputting event via all output plugins once TCP output stopped.

Once the TCP output was not working due to the TCP server was stopped, all of our other outputs stopped working as well, such as Redis.

According to:

Many people have encountered such situation: all output fail if one of many outputs fails. It has been behaving like such for a long time, and according to different developers’ experience, it depends on the plugin. Some plugin’s failure will block all outputs but some won’t.

I think this is a serious problem and has be to be fixed. And I think this is a more general issue not related to single particular Logstash plugin, so I submitted it here.

  • Version: 5.6.2
  • Operating System:
  • Config File (if you have sensitive info, please remove it):
  • Sample Data:
  • Steps to Reproduce:
@andrewvc
Copy link
Contributor

We agree that this is a problem! What is your preferred solution here. First, if there are outputs that fail without blocking the whole pipeline, that is a bug and a violation of our durability guarantees. Which ones do not behave like this?

My second question is, what is your desired behavior?

If an output is broken we can either:

  1. Drop all messages going to it resulting in data loss to that output
  2. Buffer those messages to disk (which can't last forever)

For your use case, which is preferable?

@fluency03
Copy link
Author

Hi @andrewvc , thanks for your reply.

  1. In my case, the TCP output is blocking other outputs (like, redis).

  2. What I am thinking is that it's better to make this configurable (in logstash.yml or in logstash.conf, which is another decision), and give the option to the users.

This question also depends. We would prefer temporally store the data some where for recovery. However, we are also using Kafka as message bus, from which we can retrieve the passed data and let Logstash process them again, without storing them again at Logstash's side.

@andrewvc
Copy link
Contributor

andrewvc commented Nov 3, 2017

@fluency03 this is where it gets so challenging. Where should we store it while we wait? The local FS, somewhere else?

What would your preference be?

I'm thinking the best thing to do here would be to add a new buffering policy to filters and outputs. It would have settings for: blocking (current behavior) drop (drop any events, never retry).

We could later add a buffer behavior, but there are a lot of complications and tradeoffs there that would take a while to design. Would you be OK with just dropping events @fluency03 in your use case?

@wadejensen
Copy link

Even just dropping events in the offending output and logging a WARN to stderr would be better than the current behaviour of one plugin being able to lock up the entire logstash instance.

@jordansissel
Copy link
Contributor

jordansissel commented Nov 4, 2017

@wadejensen A custom plugin would allow you to make this decision, should you want it.

Logstash is designed, and intended, to never drop data. The idea of dropping data when an output is slow or unreachable is not something I will be easily convinced. The impact of our existing design choice is that Logstash goes as fast as the slowest component.

An output "failing" is a very complex subject. Failure is subjective for everyone -- there are so many symptoms which could be classified as a network partition, temporary or otherwise, and you are asking to drop data any time there is any kind of fault. In many ways, an overloaded server is fundamentally indistinguishable from a failed server.

If you are open to data loss during network partitions or other faults, you have a few options for outputs:

  1. use the UDP output, assuming DNS is functioning (?), packets will go out and it's up to the network to lose or deliver them.
  2. use something like rabbitmq or redis which allows you to drop data for anything downstream that's not listening (though a failure manifests differently here, and also requires rabbitmq/redis be online).
  3. Write a custom plugin; this could be to fork our existing plugins and implement your dropping behavior yourself.

My entire sysadmin/operations experience has informed Logstash's never-drop-data design. I am open to discussing other behaviors, but it will take more than saying "it would be better to drop" to convince me. This is not meant as a challenge, but to say that I have considered for many years these concerns and I still resist most requests to drop data during network faults. I am listening to you, though, and I appreciate this feedback and discussion.

@jordansissel
Copy link
Contributor

We have some ideas (@andrewvc's been exploring them) for adding some kind of stream branching where you can, by your pipeline's design, have lossy/asynchronous outputs in a separate pipeline but still have strong delivery attempts on other outputs. I don't know how this will look in the long-term, but it is on our radar. It's less a checkbox to enable "drop data when an output is having problems" and more a way to model your pipeline's delivery priorities.

@untergeek
Copy link
Member

The wonderful thing to look forward to in 6.0 is independent pipelines (yay!). While the feature itself doesn't solve the problem you're describing, it provides easier methods to mitigate it while better, more complete solutions are worked on.

Imagine a single Logstash pipeline that receives from source S, processes the events, and broadcasts them in parallel to n instances of a broker (Redis, Kafka, etc.). Then you can have independent "publishing" instances each reading from their own broker instance and shipping to the intended outbound service independent from the others. The best part of 6.0 is that all of these pipelines would exist within the same JVM, rather than separate instances. With Monitoring enabled, you'd be able to see individual flow rates for each of the "output" pipelines.

In the future, Logstash may (subject to change or alteration at any time) allow you to route this traffic flow internally, removing the need for the broker altogether, via the stem branching flow that @jordansissel just mentioned. The team is aware of the shortcomings and is working on ways to improve things.

@jordansissel
Copy link
Contributor

@untergeek thank you for describing this better than I was able ❤️

@wadejensen
Copy link

@jordansissel Thanks for your response and suggestions. I think we are optimising for different things here so I doubt one will convince the other but I'll try and express my take.

You've said that intentional data loss or dropping data is not the answer, but from my perspective the current solution is unintuitive and does cause data to be lost.

My use case:
I have 120 baremetal nodes each running Filebeat to collect logs written to the baremetal filesystem each of which passes logs to one of 3 dockerised Logstash instances in a round robin format. We run a multitenant platform in which only privileged users have access to the baremetal filesystem, so this Logstash is the only access which users have for retrieving logs created by their various applications.

There are outputs to elasticsearch, Kafka and other Logstash instances downstream via TCP which are managed by the users where they can create their own filters.

For me, it begs belief that my platform Logstash service should be brought to its knees if a user takes down their downstream Kafka or Logstash for some reason. The messages I am responsible for delivering to Elasticsearch do not get delivered, and are effectively lost unless I manually retrieve them from the baremetal filesystem or replay the filebeat events. You're not really preventing data loss, just punting the responsibility to the upstream service, or in the case of your UDP suggestion, the upstream service.

I think where we clash is that our system is designed to service multiple use cases, and Logstash as it is designed does not cater to that. But rather there should be one Logstash per output in most cases, particularly in a multitenant environment.

This just feels like a shame and a missed opportunity from my perspective, but its not your job to make free software for me that fits my needs, its my job to pick software which meets them.

To that end, it might be beneficial to point out prominently in the documentation that Logstash in the current single pipeline mode operates as slowly as the slowest output, as I don't this is intuitive to users. Its an easy expectation to have and understandable that we might get upset when the software doesn't do what we thought it said it would on the tin.

Thank you very much for your contributions to Logstash.

@andrewvc
Copy link
Contributor

andrewvc commented Nov 4, 2017 via email

@fluency03
Copy link
Author

Thanks all of your discussions.

I also saw this regarding multiple-pipelines:

Having multiple pipelines in a single instance also allows these event flows to have different performance and durability parameters (for example, different settings for pipeline workers and persistent queues). This separation means that a blocked output in one pipeline won’t exert backpressure in the other.

I think this is really good design for logic separations.

@fluency03
Copy link
Author

Today, we encountered another output blocking: elasticsearch output is also blocking other outputs, when elasticsearch output plugin is having error making index.

@fluency03
Copy link
Author

Even though multiple-pipelines could partially solve this problem. The problem still remains.

For example, there could also be multiple outputs within a single pipeline. All outputs within its pipeline can still be blocked due to one output blocking of that pipeline.

@fluency03
Copy link
Author

fluency03 commented Nov 15, 2017

I am also think is it a good idea to have a 'rescue' output.

For example, when somehow the data I got is incomplete, and they are passed to elasticsearch output. If my elastic output is doing dynamically indexing like:

    index => '%{[@metadata][beat]}-%{+YYYY.MM}'
    document_type => '%{[@metadata][type]}'
    user => 'logstash'
    password => '${LOGSTASH2ELASTICSEARCH}'

However, due to the incompletion of the data, the index wrong index cannot be created (and also because I have given rules to the user logstash so that it can only create certain types of index).

Now, logstash will repeating this action like following and blocking itself outputting to elasticsearch, which in turn blocks all other outputs.

[2017-11-15T11:27:26,293][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"security_exception", "reason"=>"action [indices:admin/create] is unauthorized for user [logstash]"})

I am think could we have a mechanism or special output plugin which will rescue from the output failure.

For example, the previous blocking is caused by wrong indexing. Then what we can have could be like this:

try {
  elasticsearch {
    hosts => ['hostname']
    ssl => true
    cacert => '/etc/logstash/certs/root.cer'
    index => '%{[@metadata][beat]}-%{+YYYY.MM}'
    document_type => '%{[@metadata][type]}'
    user => 'logstash'
    password => '${LOGSTASH2ELASTICSEARCH}'
  }
} rescue {
  elasticsearch
  {
    hosts => ['hostname']
    ssl => true
    cacert => '/etc/logstash/certs/root.cer'
    index => 'failure-%{+YYYY.MM}'
    document_type => 'failure'
    user => 'logstash'
    password => '${LOGSTASH2ELASTICSEARCH}'
  }
}

Then, the failed data will be reindexed to index failure-%{+YYYY.MM}. And no outputs will be blocked and we also recorded what are the data causing such failure.

Maybe it's similar to this:

if "_jsonparsefailure" in [tags] {
  elasticsearch {
    hosts => ['hostname']
    ssl => true
    cacert => '/etc/logstash/certs/root.cer'
    index => 'failure-%{+YYYY.MM}'
    document_type => 'failure'
    user => 'logstash'
    password => '${LOGSTASH2ELASTICSEARCH}'
  }
} 

This is more like you know what are the filter plugin failures and you act based on them at output part.

But the rescue is the action on output plugins.


Should I open another issue for discussing output failure rescue?

@jordansissel
Copy link
Contributor

I am think could we have a mechanism or special output plugin which will rescue from the output failure.

We have such a mechanism today. It's too hard to use, but it exists. This mechanism is called the dead letter queue (DLQ). https://www.elastic.co/guide/en/logstash/current/dead-letter-queues.html

We only currently deliver things to the DLQ if it is something we consider "permanently undeliverable" -- and there are basically only two cases for this: One, on a mapping exception from ES which is unfixable without destroying data. Two, when an index is closed (which is debatable, since you can open the index), but is at least a property of the event.

I dont' think I would consider doing 403s to DLQ by default. Maybe we can make it configurable, but never by default. If you want such a feature, please open an issue on the logstash-output-elasticsearch repo :)

@fluency03
Copy link
Author

fluency03 commented Nov 16, 2017

Even though, as mentioned here by @jordansissel :

output "failing" is a very complex subject. Failure is subjective for everyone.

But as far as i am concerned, this should be a design matter at the first place of Logstash, i.e., is it more reasonable to make each plugins working separately, asynchronously, and "reactively", so that one's failure won't have impact on others?

I am wondering how the Logstash plugins (input/filter/output) work.

  • Are they working as different threads? If so, one output plugin should not be blocking others.
  • Are they working asynchronously or reactive? If so, there also should not be blocking between different plugins.

Thanks :)

@v2b1n
Copy link

v2b1n commented Dec 8, 2017

having a similar issue with elasticsearch and kinesis-plugins.
If the elasticsearch-plugin fails(e.g. because the es-cluster is unavailable), then no data is delivered to kinesis too.

@devopsberlin
Copy link

having the same issue with kafka and tcp output plugins, if kafka plugin fails, then no data is delivered to fluentd over tcp.

output {
    kafka {
      ...
    }
    tcp {
      host => "kafka-fluentd"
      port => 24224
      codec=> "json"
    }
}

@andrewvc
Copy link
Contributor

#9225 may address the concerns raised in this thread. Would that approach be useful to those of you facing this issue?

@BobTheBuilder7828
Copy link

Has there ever been a solution that fixes this issue? One failure blocks all other outputs and tends to fill up logs rather quickly.

eg.

[2019-09-26T13:47:25,865][WARN ][logstash.outputs.syslog ] syslog tcp output exception: closing, reconnecting and resending event {:host=>"X.X.X.X", :port=>514, :exception=>#<Errno::ECONNREFUSED: Connection refused - connect(2) for "X.X.X.X" port 514>, :backtrace=>["org/jruby/ext/socket/RubyTCPSocket.java:135:in initialize'", "org/jruby/RubyIO.java:876:in new'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:209:in connect'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:177:in publish'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-codec-line-3.0.8/lib/logstash/codecs/line.rb:50:in encode'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:147:in receive'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:89:in block in multi_receive'", "org/jruby/RubyArray.java:1792:in each'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:89:in multi_receive'", "org/logstash/config/ir/compiler/OutputStrategyExt.java:118:in multi_receive'", "org/logstash/config/ir/compiler/AbstractOutputDelegatorExt.java:101:in multi_receive'", "/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:239:in block in start_workers'"], :event=>#LogStash::Event:0x2ba1184a}
[2019-09-26T13:47:26,872][WARN ][logstash.outputs.syslog ] syslog tcp output exception: closing, reconnecting and resending event {:host=>"X.X.X.X", :port=>514, :exception=>#<Errno::ECONNREFUSED: Connection refused - connect(2) for "X.X.X.X" port 514>, :backtrace=>["org/jruby/ext/socket/RubyTCPSocket.java:135:in initialize'", "org/jruby/RubyIO.java:876:in new'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:209:in connect'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:177:in publish'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-codec-line-3.0.8/lib/logstash/codecs/line.rb:50:in encode'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:147:in receive'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:89:in block in multi_receive'", "org/jruby/RubyArray.java:1792:in each'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:89:in multi_receive'", "org/logstash/config/ir/compiler/OutputStrategyExt.java:118:in multi_receive'", "org/logstash/config/ir/compiler/AbstractOutputDelegatorExt.java:101:in multi_receive'", "/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:239:in block in start_workers'"], :event=>#LogStash::Event:0x2ba1184a}
[2019-09-26T13:47:27,878][WARN ][logstash.outputs.syslog ] syslog tcp output exception: closing, reconnecting and resending event {:host=>"X.X.X.X", :port=>514, :exception=>#<Errno::ECONNREFUSED: Connection refused - connect(2) for "X.X.X.X" port 514>, :backtrace=>["org/jruby/ext/socket/RubyTCPSocket.java:135:in initialize'", "org/jruby/RubyIO.java:876:in new'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:209:in connect'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:177:in publish'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-codec-line-3.0.8/lib/logstash/codecs/line.rb:50:in encode'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:147:in receive'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:89:in block in multi_receive'", "org/jruby/RubyArray.java:1792:in each'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:89:in multi_receive'", "org/logstash/config/ir/compiler/OutputStrategyExt.java:118:in multi_receive'", "org/logstash/config/ir/compiler/AbstractOutputDelegatorExt.java:101:in multi_receive'", "/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:239:in block in start_workers'"], :event=>#LogStash::Event:0x2ba1184a}
[2019-09-26T13:47:28,885][WARN ][logstash.outputs.syslog ] syslog tcp output exception: closing, reconnecting and resending event {:host=>"X.X.X.X", :port=>514, :exception=>#<Errno::ECONNREFUSED: Connection refused - connect(2) for "X.X.X.X" port 514>, :backtrace=>["org/jruby/ext/socket/RubyTCPSocket.java:135:in initialize'", "org/jruby/RubyIO.java:876:in new'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:209:in connect'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:177:in publish'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-codec-line-3.0.8/lib/logstash/codecs/line.rb:50:in encode'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:147:in receive'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:89:in block in multi_receive'", "org/jruby/RubyArray.java:1792:in each'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:89:in multi_receive'", "org/logstash/config/ir/compiler/OutputStrategyExt.java:118:in multi_receive'", "org/logstash/config/ir/compiler/AbstractOutputDelegatorExt.java:101:in multi_receive'", "/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:239:in block in start_workers'"], :event=>#LogStash::Event:0x2ba1184a}
[2019-09-26T13:47:29,892][WARN ][logstash.outputs.syslog ] syslog tcp output exception: closing, reconnecting and resending event {:host=>"X.X.X.X", :port=>514, :exception=>#<Errno::ECONNREFUSED: Connection refused - connect(2) for "X.X.X.X" port 514>, :backtrace=>["org/jruby/ext/socket/RubyTCPSocket.java:135:in initialize'", "org/jruby/RubyIO.java:876:in new'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:209:in connect'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:177:in publish'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-codec-line-3.0.8/lib/logstash/codecs/line.rb:50:in encode'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:147:in receive'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:89:in block in multi_receive'", "org/jruby/RubyArray.java:1792:in each'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:89:in multi_receive'", "org/logstash/config/ir/compiler/OutputStrategyExt.java:118:in multi_receive'", "org/logstash/config/ir/compiler/AbstractOutputDelegatorExt.java:101:in multi_receive'", "/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:239:in block in start_workers'"], :event=>#LogStash::Event:0x2ba1184a}
[2019-09-26T13:47:30,898][WARN ][logstash.outputs.syslog ] syslog tcp output exception: closing, reconnecting and resending event {:host=>"X.X.X.X", :port=>514, :exception=>#<Errno::ECONNREFUSED: Connection refused - connect(2) for "X.X.X.X" port 514>, :backtrace=>["org/jruby/ext/socket/RubyTCPSocket.java:135:in initialize'", "org/jruby/RubyIO.java:876:in new'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:209:in connect'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:177:in publish'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-codec-line-3.0.8/lib/logstash/codecs/line.rb:50:in encode'", "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5/lib/logstash/outputs/syslog.rb:147:in receive'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:89:in block in multi_receive'", "org/jruby/RubyArray.java:1792:in each'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:89:in multi_receive'", "org/logstash/config/ir/compiler/OutputStrategyExt.java:118:in multi_receive'", "org/logstash/config/ir/compiler/AbstractOutputDelegatorExt.java:101:in multi_receive'", "/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:239:in block in start_workers'"], :event=>#LogStash::Event:0x2ba1184a}

@untergeek
Copy link
Member

untergeek commented Sep 26, 2019

@BobTheBuilder7828 Look into pipeline-to-pipeline communication, as this allows you to create discrete pipelines per output.

@BobTheBuilder7828
Copy link

Is there any way to "throttle" the error output and/or retry delta? Either have a pre-determined/set period for a retry or have it back off meaning 30s, 60s, 120s, etc ... as to not pollute logs so badly?

@untergeek
Copy link
Member

Exponential back-off settings will depend on the plugin. You can't disable or throttle the warning messages, but you might be able to use an API call to set the level to ERROR rather than INFO. See https://www.elastic.co/guide/en/logstash/current/logging.html for information on that.

@untergeek
Copy link
Member

Honestly, I'd be more concerned that port 514 is unavailable on the remote side. That's a pretty standard service (syslog), and it being unavailable is an abnormality that should be logged pretty heavily. This is why for a syslog output, Logstash does not throttle retries or error messages. It's a tcp port that is expected to be open and remain open at all times.

@BobTheBuilder7828
Copy link

Yeah, I agree … just planning for if/when an output host is offline for some reason … do not want it to kill the entire thing.

Thank you for your replies.

@duclm2609
Copy link

duclm2609 commented Feb 26, 2020

Sadly I'm facing the same issue. In my output plugin, there is if-else condition to output to our internal Elasticsearch server or external Kafka server based on tag of document. But if Kafka failed (lost connection, Kafka broker not available), all output to my internal Elasticsearch also stopped working.

@BobTheBuilder7828
Copy link

Sending to multiple pipelines (basically replicating the data), then acting on those individual pipelines (even if they are doing the same operation, just to a different host) was the only way I was able to get around this issue. One failed host for output kills the whole works unless you break it apart (pipeline-to-pipeline) as outlined above.

@duclm2609
Copy link

Sending to multiple pipelines (basically replicating the data), then acting on those individual pipelines (even if they are doing the same operation, just to a different host) was the only way I was able to get around this issue. One failed host for output kills the whole works unless you break it apart (pipeline-to-pipeline) as outlined above.

It worked. But logstash said that, 1 pipeline worker per CPU core. So it's not a feasible solution if you have multiple pipepines.

@untergeek
Copy link
Member

That number is only a default setting. You can dial it up higher manually.

@ebuildy
Copy link
Contributor

ebuildy commented Oct 9, 2020

Also, logstash dont start if output cannot connect (by example, to a RabbitMQ server). Using pipeline To pipeline, to seperate input from output, dont work! input is starting, but logstash HTTP API dont, whereas we use this endpoint to make Docker contianer ready.

I would like to control better pipeline flow:

What about a new setting input_always_up :

  • false ==> actual behavior
  • true ==> logstash start anyway, then queue events until output is OK.

Also, in case of mulitple outputs, another new setting to configure behavior, output_error_behavior :

  • output_queue_if_single_error ==> queue per output, send to other(s)
  • output_skip_error ==> ignore output error, send to other(s)
  • output_queue_all ==> actual behavior

Or, option in pipeline definition, as such:

input { http {}}

output {
kafka {
  ignore_failure => false
  stop_all_outputs => true
}
tcp {
  ignore_failure => true
}
}

@trexx
Copy link

trexx commented Feb 3, 2021

Sending to multiple pipelines (basically replicating the data), then acting on those individual pipelines (even if they are doing the same operation, just to a different host) was the only way I was able to get around this issue. One failed host for output kills the whole works unless you break it apart (pipeline-to-pipeline) as outlined above.

There are edge cases I've witnessed which can cause the outputs of multiple pipelines to become blocked.
In particular, if one server were to encounter DNS related errors, that would block the output of all pipelines.

@braddeicide
Copy link

braddeicide commented Apr 13, 2021

In my case I forward all prod data to a dev cluster, by just having multiple outputs in logstash the dev cluster going down takes out prod as well.

Regarding logstash design to never drop data, that's already violated with "ensure_delivery => false", it's utilization is just obscured.

For those that want to see the solution in config rather this discussion:

#Replace your second output with pipeline{}
output {
  elasticsearch {..foo..}
  pipeline {
    send_to => [descriptivestring]
  }
}

Create a new simple pipeline config, add it to pipelines.yml
input {
  pipeline {
    address => [descriptivestring]
  }
}
output {
  # your original output.
}

You can also place
ensure_delivery => false on your pipeline output, but I found even without it I can have one output fail without the default cascading failure behavior.

Perhaps this could be implemented internally by having a base output class flag like "nonblocking" which sets up a unique pipeline in the background. The change is repeatable enough to be done computationally.

@Rickvanderwaal
Copy link

@braddeicide The pipeline to pipeline communication solution comes with a risk. When the persisted queue of the second pipeline becomes full then the outputs of the first pipeline become blocked so your production cluster is not receiving any data.

Of course you can increase the persisted queue size of the second pipeline. But disk size is not unlimited.
I do rather like to drop events when my optional output is down so my first pipeline keeps processing data.

@ctgrid
Copy link

ctgrid commented Feb 8, 2023

@wadejensen A custom plugin would allow you to make this decision, should you want it.

Logstash is designed, and intended, to never drop data. The idea of dropping data when an output is slow or unreachable is not something I will be easily convinced. The impact of our existing design choice is that Logstash goes as fast as the slowest component.

An output "failing" is a very complex subject. Failure is subjective for everyone -- there are so many symptoms which could be classified as a network partition, temporary or otherwise, and you are asking to drop data any time there is any kind of fault. In many ways, an overloaded server is fundamentally indistinguishable from a failed server.

If you are open to data loss during network partitions or other faults, you have a few options for outputs:

  1. use the UDP output, assuming DNS is functioning (?), packets will go out and it's up to the network to lose or deliver them.
  2. use something like rabbitmq or redis which allows you to drop data for anything downstream that's not listening (though a failure manifests differently here, and also requires rabbitmq/redis be online).
  3. Write a custom plugin; this could be to fork our existing plugins and implement your dropping behavior yourself.

My entire sysadmin/operations experience has informed Logstash's never-drop-data design. I am open to discussing other behaviors, but it will take more than saying "it would be better to drop" to convince me. This is not meant as a challenge, but to say that I have considered for many years these concerns and I still resist most requests to drop data during network faults. I am listening to you, though, and I appreciate this feedback and discussion.

We run a real-time fintech platform which manages billions of calculations daily, each of which needs to be auditable. We run one pipeline per instance, and each pipeline has 4 outputs. One of the outputs failed (TCP) and blocked all the other outputs for one of our instances. In our case, the platform must continue running even if ES isn't working and needs attention, but we must also maintain audibility of the system actions. Our fallback is for log stash to write the logs to disk, which we later archive away. We generate a few hundred GB of logs on a daily basis.

This has been working well for us until we experience an outage on our ES server, which in turn resulted in permanent data loss because log stash decided to block all outputs.

A policy born out of intellectual purism doesn't work when it meets the real world. Please reconsider your position and add an option to allow this fallback behaviour. Network outages happen. Plain log files are far more important than ES in a lot of cases (especially in ours, when push comes to shove, the plain text log file is the final point truth when auditing financial data). A blip on the network should not be the cause of data loss.

As for the recommended workarounds, they won't work for us. We must minimise the points of failure in our infrastructure, not increase them to accommodate an obvious flaw in the design of log stash.

@daimoniac
Copy link

i would really like to underline @ctgrid s last two paragraphs. This decision to not support options for higher availability and reduced data loss, while probably trivially implemented, is not easily understandable from a real world perspective.

We are outputting our logs to an elk cluster and a subset of the logs (the auditing part) to a SIEM which isnt highly available. We cannot accept losing logs just when the siem is under maintenance.

We would really favor @andrewvc s solution which should be a blast

I'm thinking the best thing to do here would be to add a new buffering policy to filters and outputs. It would have settings for: blocking (current behavior) drop (drop any events, never retry).

@ctgrid
Copy link

ctgrid commented Feb 5, 2024

Update for posterity: We looked at forking/building a plugin and after some soul searching we decided against it. We opted for a risky and unpopular approach (often for good reason) of writing our own solution using Rust and replacing logstash with it. Given the volume of real-time data we're processing and the importance of not losing said data when an endpoint goes down, this ended up being the right decision for us, and one that was well worth the investment. We were able to achieve an order of magnitude improvement in CPU and memory utilisation over logstash, which was unexpected, and very welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests