Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logstash Integration with Elasticsearch Data Streams #12178

Closed
acchen97 opened this issue Aug 14, 2020 · 31 comments
Closed

Logstash Integration with Elasticsearch Data Streams #12178

acchen97 opened this issue Aug 14, 2020 · 31 comments
Assignees
Labels

Comments

@acchen97
Copy link
Contributor

acchen97 commented Aug 14, 2020

Overview

This is an overview of the Logstash integration with Elasticsearch data streams. The integration will be added as a feature to the existing Elasticsearch output plugin. This will include new data stream options that will be recommended for indexing any time series datasets (logs, metrics, etc.) into Elasticsearch. The existing options will continue to be used for non-time series use cases. This feature will be available on both the default and OSS Logstash distributions.

Indexing Strategy

The data streams integration will adopt the new indexing strategy under the {type}-{dataset}-{namespace} format, leveraging the composable templates bundled in Elasticsearch starting in 7.9.

The default data streams name used will be logs-generic-default. This default enables users to easily correlate data with other different data sources (e.g. with logs-* and logs-generic-*) in Elasticsearch. Given the new indexing strategy, the type, dataset, and namespace of the data stream name can all be configured separately.

As Logstash will not be fully ECS compliant until 8.0, there are caveats we need to document (or provide bootstrap checks) for users to avoid ECS conflicts.

  • Update the Beats input, TCP input, UDP input, and grok filter. If they are using these plugins, they should enable ECS compatibility mode to avoid ECS conflicts. This is work in progress for the 7.9 / 7.10 timeframe.
  • Users should not introduce any ECS conflicting fields in their pipeline when using this plugin. This should be more systematic in the future when we add ECS validation.

Example Configuration

Basic default configuration

output {
    elasticsearch {
        hosts => "hostname"
        data_stream => "true"
    }
}

Minimal settings to get started in Logstash 7.x. Events with the data_stream.* fields will automatically get routed to the appropriate data streams. Defaults to logs-generic-logstash if the fields are missing.

Customize data stream name

output {
    elasticsearch {
        hosts => "hostname"
        data_stream => "true"
        data_stream_timestamp => "@timestamp"
        data_stream_type => "metrics"
        data_stream_dataset => "foo"
        data_stream_namespace => "bar"
    }
}

Configuration Settings

These are the net new data stream specific settings that will be added to the Elasticsearch output plugin:

  • data_stream (string, optional) - defines whether data will be indexed into an Elasticsearch data stream. The data_stream_* settings will only be used if this setting is enabled. This setting supports the values true, false, and auto. Defaults to false in Logstash 7.x and auto starting in Logstash 8.0. More details on the auto behavior can be found in this issue.
  • data_stream_timestamp (timestamp, required) - the timestamp used for the data stream. Defaults to @timestamp.
  • data_stream_type (string, optional) - the data stream type used to construct the data stream at index time. Only logs or metrics is allowed. This field does not support hyphens (-). Defaults to logs.
  • data_stream_dataset (string, optional) - the data stream dataset used to construct the data stream at index time. This field does not support hyphens (-). Defaults to generic.
  • data_stream_namespace (string, optional) - the data stream namespace used to construct the data stream at index time. This field does not support hyphens (-). Defaults to default.
  • data_stream_auto_routing (boolean, optional) - automatically routes events by deriving the data stream name using specific event fields with the %{data_stream.type}-%{data_stream.dataset}-%{data_stream.namespace} format. If enabled, the data_stream.* event fields will take precedence over the data_stream_type, data_stream_dataset, and data_stream_namespace settings, but will fall back to them if any of the fields are missing from the event. Defaults to true.
  • data_stream_sync_fields (boolean, optional) - automatically syncs the data_stream.* event fields if they are missing from the event. This ensures the data_stream.* fields match the data stream name that events are indexed to. The field syncing behavior between this setting and the data_stream_auto_routing setting can be found in this issue. Defaults to true.

Elastic Agent Compatibility

Logstash often acts as an intermediary for receiving data from other systems like the Elastic Agent and Kafka. For these use cases, Logstash will by default use the data_stream.type, data_stream.dataset, and data_stream.namespace event fields to derive the data stream name. This allows events from the Elastic Agent to automatically be routed to the appropriate Elasticsearch data stream when using Logstash in between. This feature can be disabled by configuring the data_stream_auto_routing setting to false.

Format: %{data_stream.type}-%{data_stream.dataset}-%{data_stream.namespace}

Events received from the Elastic Agent should generally have all the data_stream.* fields populated. In the case where any of these fields are missing, the data_stream_sync_fields setting will be used to sync these fields prior to indexing.

Limitations

The primary limitation of data streams is the ability to perform updates to the documents. Logstash users have historically used the existing Elasticsearch output plugin’s capabilities to conduct document updates and achieve exactly once delivery semantics.

Future Considerations

  • The logs-generic-default is the default data stream for generic data from Logstash and the Elastic Agent. If users express feedback that it’s difficult to identify Logstash sourced data from the shared data stream, we could consider adding a from-logstash tag to the tags ECS base field for events coming from Logstash.
  • We want to guide users towards using the new indexing strategy, but if users express the need for more flexibility, we could introduce a free form option for specifying the data stream name in the future where template/ILM management would be manual.
@ruflin
Copy link
Member

ruflin commented Aug 17, 2020

@acchen97 Two notes on the above issue:

  • It is data_stream.dataset and not data_stream.name
  • You use the config option host. I know this has been used a lot in the past but I wonder if we could change also the config name to something ECS compatible?

@acchen97
Copy link
Contributor Author

@ruflin thanks for your notes. I've reconciled the former in the original issue. For the latter, I think we can stick with hosts as its accurately descriptive and is also consistent with how Beats and Agent configure it in the ES output. I'm not sure being ECS compatible is as critical here given this is a configuration setting rather than an event field.

@ph
Copy link
Contributor

ph commented Sep 3, 2020

@mostlyjason FYI.

@enotspe
Copy link

enotspe commented Sep 23, 2020

will there be cloud.id/cloud.auth support?

@colinsurprenant
Copy link
Contributor

@enotspe I would think so, I don't see why we would differ from the authentication strategies of the current elasticsearch output.

@ph
Copy link
Contributor

ph commented Nov 4, 2020

@colinsurprenant We have validation in place for the data_stream.* field in Kibana we should align on them. cc @jen-huang .

@colinsurprenant
Copy link
Contributor

@ph @acchen97 @jen-huang So should we be looking into this Future Considerations item right away?

When they are absent, we could have a setting that allows the data_stream.type, data_stream.dataset, and data_stream.namespace fields to be derived from the data stream name and added to the event prior to indexing.

And the question is more about making this a configurable default behaviour or not? i.e. should we allow the user to disable that in the case of documents not from Agent and not containing these fields?

@colinsurprenant
Copy link
Contributor

And as a followup question, if the user sets auto_routing => false and the document contains the data_stream.type, data_stream.dataset, and data_stream.namespace fields should we overwrite the fields with the plugin configured values?

@jen-huang
Copy link

@colinsurprenant We have validation in place for the data_stream.* field in Kibana we should align on them. cc @jen-huang .

Those fields have validation on agent side too to ensure safety with ES index name constraints.

Following discussion in elastic/kibana#75846, we implemented 20/100/100 byte length restriction for type, dataset, and namespace strings, respectively.

@acchen97
Copy link
Contributor Author

acchen97 commented Nov 4, 2020

@ph @acchen97 @jen-huang So should we be looking into this Future Considerations item right away?

When they are absent, we could have a setting that allows the data_stream.type, data_stream.dataset, and data_stream.namespace fields to be derived from the data stream name and added to the event prior to indexing.

I'm still a bit hesitant on tackling this in the first version. This would only apply to data that is not sent from Agent, and I'm not sure how adding this would impact those use cases yet. Also, it's not clear to me if and how these data_stream.* fields will be used in ES queries and downstream UI components. Perhaps we can wait for user feedback before we decide whether we want to add it. It's typically easier to add features rather than remove them in the future. /cc @jsvd

@ruflin
Copy link
Member

ruflin commented Nov 5, 2020

It is important that we add these fields. The new indexing strategy requires these fields to be there. It is expected that all dashboards / visualizations we build and hopefully also the one from the community, will filter on these. It will make the queries and with the dashboards much faster. If the fields are not in line with the indexing strategy, things will break apart.

@colinsurprenant
Copy link
Contributor

As we are closing in on the release of the logstash data streams output plugin

@Karrade7
Copy link

@acchen97 . For a non-agent use case: We have a multi-tenant strategy where each tenant has its own index. Such as datalake-tenant1, datalake-tenant2. We use logstash to feed data and set the index to the correct tenant. Under the new indexing strategy and this plugin can we supports this model: logs-tenant-dataset? Where tenant = ecs field organization.id?

@colinsurprenant
Copy link
Contributor

@Karrade7 @acchen97 Good point. In the current model with auto_routing: true, using mutate filter(s) you could set the value of any of the {type}-{dataset}-{namespace} fields for example.

But we could also provide string interpolation for the type, dataset and namespace options - that way you could reference the value of any event fields. I think this makes sense, it adds even more flexibility.

@Karrade7
Copy link

@colinsurprenant i think string interpolation and flexibility in general will be important here.
I think even in dataset there will be issues without. Since dataset is not a universal standard, there will times when you want the dataset to be set a certain way, but the index to be named differently. A perfect example of this is non-compliant characters for index names. If dataset is uppercase or contains the character "-" it won't index. I ran into this in 7.9 when I tried to use a set processor to set _index as "dl-cylance-{{organization.name}}". This did not work as some organization names had upper and lowercase and they would not index at all. Just giving an example of where unexpected issues can occur and flexibility for index naming will be useful.

@colinsurprenant
Copy link
Contributor

@Karrade7 I am not sure I understand your concern correctly; there are 2 things at play here:

  • The plugin type, dataset and namespace options are used primarily to create the index name when auto-routing: false. The index name will always be {type}-{dataset}-{namespace} unless auto-routing: true where the event fields [data_stream][type], [data_stream][dataset], [data_stream][namespace] will be used and if one of these fields is missing then the corresponding plugin options will be used.

  • If using set_data_stream_fields: true the event fields [data_stream][type], [data_stream][dataset], [data_stream][namespace] will always be updated by the plugin to match the values used to create the index name.

Are you saying that when not using auto_routing you would want to have the dataset option to use a value that would be different than the [data_stream][dataset] field value? If so it would certainly be possible but probably not advisable because I believe the intent for downstream usage by ES and Kibana will expect to have the ES indexed documents fields data_stream.type, data_stream.dataset, and data_stream.namespace matching the data stream index name.

@acchen97
Copy link
Contributor Author

I think even in dataset there will be issues without. Since dataset is not a universal standard, there will times when you want the dataset to be set a certain way, but the index to be named differently. A perfect example of this is non-compliant characters for index names. If dataset is uppercase or contains the character "-" it won't index. I ran into this in 7.9 when I tried to use a set processor to set _index as "dl-cylance-{{organization.name}}". This did not work as some organization names had upper and lowercase and they would not index at all. Just giving an example of where unexpected issues can occur and flexibility for index naming will be useful.

@Karrade7 hyphens are indeed not allowed in the dataset and namespace. @ph are there any restrictions for using uppercase letters in the new indexing strategy?

Are you saying that when not using auto_routing you would want to have the dataset option to use a value that would be different than the [data_stream][dataset] field value? If so it would certainly be possible but probably not advisable because I believe the intent for downstream usage by ES and Kibana will expect to have the ES indexed documents fields data_stream.type, data_stream.dataset, and data_stream.namespace matching the data stream index name.

@colinsurprenant I believe the data_stream.* fields will need to match the data stream name. The data_stream.* fields are constant keywords and my understanding is that the field values will need to be the same across an entire index.

@vbohata
Copy link

vbohata commented Dec 10, 2020

Why should be "type" limited only to logs and metrics? Currently we use the similar naming but we use following options: logs, metrics, monitors (typically up/down monitors, events from heartbeat, ...) and data (real application data, not logs).

@colinsurprenant
Copy link
Contributor

@vbohata this a good question and ultimately nothing will really prevent someone from having a "custom" Data Streams type other than "logs" and "metrics" but in the short terms these are the only one that have bundled ES templates and for which some visualizations will exist. The type option might have restrictions when first released and we might allow arbitrary type in the future, this is still being evaluated.

@cdino
Copy link

cdino commented Dec 12, 2020

in a bit confused, is this plugin already in the 7.10?

after I saw a presentation from @ruflin about data streams I started digging about how to integrate metricbeat and filebeat with the data streams...
I did some extra processing in Beat and Logstash

Here what I'm adding to the Beat config:

processors:
- add_fields:
    fields:
      namespace: default
      type: metrics
    target: data_stream
- copy_fields:
    fail_on_error: false
    fields:
    - from: event.dataset
      to: data_stream.dataset
    ignore_missing: true

And Logstash side im doing this:

elasticsearch {
  id => "elasticsearch_stream"
  hosts => 'http://tiny-master:9200'
  index => "%{[data_stream][type]}-%{[data_stream][dataset]}-%{[data_stream][namespace]}"
  manage_template => false
  action => "create"
}

in Kibana to enable all the templates and dashboards I just add a fake agent and everything is created.

Seems working only with metricbeat and filebeat, only some fields sometimes are rising shards errors:

"failures": [
      {
        "shard": 0,
        "index": ".ds-metrics-system.service-default-000001",
        "node": "wGq5lQ7BSKOYPV9_zkPQ3Q",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Field [system.service.state_since] of type [keyword] does not support custom formats"
        }
      }
    ]

this maybe because I'm not setting correctly all the fields?

With auditbeat data streams seem not working.

@ruflin
Copy link
Member

ruflin commented Dec 14, 2020

@cdino Nice work! You don't need to create a fake Agent, if you go to Settings of an integration, you have an install button. One thing to keep in mind, there is a chance that we make some breaking changes to the packages compared to the modules. Your error might be related to this, seems like system.service.state_since the format might be different. Sounds like this should be a date field?

@cdino What you discovered is, if you know what you are doing you don't need the new plugin ;-) 👏

Curious to hear what errors you got on the auditbeat side.

@cdino
Copy link

cdino commented Dec 14, 2020

@ruflin Thanks! Yes i will avoid to use it in production for now :) but I really like this approach will help us a lot in the future.
I will give another try with auditbeat, but seems that there is no datastream mapping for data_stream.[dataset] related events.

@sc7565
Copy link

sc7565 commented Dec 30, 2020

Is the plugin released or not? Did not find repo nor details. Wanted to check roadmap for same

@acchen97
Copy link
Contributor Author

@sc7565 this feature has not been released yet. It is on the near-term roadmap.

@kares kares self-assigned this Jan 14, 2021
@ph
Copy link
Contributor

ph commented Jan 18, 2021

@jsvd @kares Small question concerning the implementation, I've double check this issue and It's not clear to me. Will you do some check to verify if data stream is available or not on the remote cluster?

@kares
Copy link
Contributor

kares commented Jan 19, 2021

Was thinking about it yesterday what happens if there's a data_stream => true configuration and the endpoint (e.g. logs-system-default) we end up writing to is NOT a data-stream. So far I am not sure there's a need to do an explicit check (and I haven't checked if there's a reliable ES API to do so) given the even will include data_stream.* fields. But this might change.

Will you do some check to verify if data stream is available or not on the remote cluster?

We're certainly planning a version check on ES >= 7.9 to see if data-streams are available.

@jsvd
Copy link
Member

jsvd commented Jan 20, 2021

Discussions are ongoing with the ES team on what kind of primitives exist (such as "require_alias=true") or could be created so that data producers can ensure that they're writing to data streams and not wrongly creating indices where aliases should be.

We could create a cache of "already seen index names", but this will never be truly accurate, as we can have a cache miss, confirm an alias exists, then someone decides to delete the alias and template afterwards, causing logstash to create an index instead of an alias. And of course checking without a cache and per document is not performant.

@kares
Copy link
Contributor

kares commented Mar 23, 2021

a technical semi-blocker for data-stream support (due DS the plugin is in a need to check ES version):
logstash-plugins/logstash-output-elasticsearch#1001

kares added a commit to logstash-plugins/logstash-output-elasticsearch that referenced this issue Apr 12, 2021
(initial) specification elastic/logstash#12178

Co-authored-by: Ry Biesemeyer <[email protected]>
Co-authored-by: Karen Metts <[email protected]>
@kares
Copy link
Contributor

kares commented May 10, 2021

LS 7.13.0 is on track shipping data-stream support using it's logstash-output-elasticsearch plugin.

@kares kares added the v7.13.0 label May 10, 2021
@kares
Copy link
Contributor

kares commented May 26, 2021

as 7.13.0 is out we're fine to close this issue,
there's still follow up work (data_stream_timestamp support) but most of the work outlined here has been 🚢

@olegtrautvein
Copy link

As a fact, now logstash 7.10 writes to data stream using index template with described datastream{} section. What are the negative aspects of using this approach? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests