Logstash Integration with Elasticsearch Data Streams #12178

acchen97 · 2020-08-14T17:45:38Z

Overview

This is an overview of the Logstash integration with Elasticsearch data streams. The integration will be added as a feature to the existing Elasticsearch output plugin. This will include new data stream options that will be recommended for indexing any time series datasets (logs, metrics, etc.) into Elasticsearch. The existing options will continue to be used for non-time series use cases. This feature will be available on both the default and OSS Logstash distributions.

Indexing Strategy

The data streams integration will adopt the new indexing strategy under the {type}-{dataset}-{namespace} format, leveraging the composable templates bundled in Elasticsearch starting in 7.9.

The default data streams name used will be logs-generic-default. This default enables users to easily correlate data with other different data sources (e.g. with logs-* and logs-generic-*) in Elasticsearch. Given the new indexing strategy, the type, dataset, and namespace of the data stream name can all be configured separately.

As Logstash will not be fully ECS compliant until 8.0, there are caveats we need to document (or provide bootstrap checks) for users to avoid ECS conflicts.

Update the Beats input, TCP input, UDP input, and grok filter. If they are using these plugins, they should enable ECS compatibility mode to avoid ECS conflicts. This is work in progress for the 7.9 / 7.10 timeframe.
Users should not introduce any ECS conflicting fields in their pipeline when using this plugin. This should be more systematic in the future when we add ECS validation.

Example Configuration

Basic default configuration

output {
    elasticsearch {
        hosts => "hostname"
        data_stream => "true"
    }
}

Minimal settings to get started in Logstash 7.x. Events with the data_stream.* fields will automatically get routed to the appropriate data streams. Defaults to logs-generic-logstash if the fields are missing.

Customize data stream name

output {
    elasticsearch {
        hosts => "hostname"
        data_stream => "true"
        data_stream_timestamp => "@timestamp"
        data_stream_type => "metrics"
        data_stream_dataset => "foo"
        data_stream_namespace => "bar"
    }
}

Configuration Settings

These are the net new data stream specific settings that will be added to the Elasticsearch output plugin:

data_stream (string, optional) - defines whether data will be indexed into an Elasticsearch data stream. The data_stream_* settings will only be used if this setting is enabled. This setting supports the values true, false, and auto. Defaults to false in Logstash 7.x and auto starting in Logstash 8.0. More details on the auto behavior can be found in this issue.
data_stream_timestamp (timestamp, required) - the timestamp used for the data stream. Defaults to @timestamp.
data_stream_type (string, optional) - the data stream type used to construct the data stream at index time. Only logs or metrics is allowed. This field does not support hyphens (-). Defaults to logs.
data_stream_dataset (string, optional) - the data stream dataset used to construct the data stream at index time. This field does not support hyphens (-). Defaults to generic.
data_stream_namespace (string, optional) - the data stream namespace used to construct the data stream at index time. This field does not support hyphens (-). Defaults to default.
data_stream_auto_routing (boolean, optional) - automatically routes events by deriving the data stream name using specific event fields with the %{data_stream.type}-%{data_stream.dataset}-%{data_stream.namespace} format. If enabled, the data_stream.* event fields will take precedence over the data_stream_type, data_stream_dataset, and data_stream_namespace settings, but will fall back to them if any of the fields are missing from the event. Defaults to true.
data_stream_sync_fields (boolean, optional) - automatically syncs the data_stream.* event fields if they are missing from the event. This ensures the data_stream.* fields match the data stream name that events are indexed to. The field syncing behavior between this setting and the data_stream_auto_routing setting can be found in this issue. Defaults to true.

Elastic Agent Compatibility

Logstash often acts as an intermediary for receiving data from other systems like the Elastic Agent and Kafka. For these use cases, Logstash will by default use the data_stream.type, data_stream.dataset, and data_stream.namespace event fields to derive the data stream name. This allows events from the Elastic Agent to automatically be routed to the appropriate Elasticsearch data stream when using Logstash in between. This feature can be disabled by configuring the data_stream_auto_routing setting to false.

Format: %{data_stream.type}-%{data_stream.dataset}-%{data_stream.namespace}

Events received from the Elastic Agent should generally have all the data_stream.* fields populated. In the case where any of these fields are missing, the data_stream_sync_fields setting will be used to sync these fields prior to indexing.

Limitations

The primary limitation of data streams is the ability to perform updates to the documents. Logstash users have historically used the existing Elasticsearch output plugin’s capabilities to conduct document updates and achieve exactly once delivery semantics.

Future Considerations

The logs-generic-default is the default data stream for generic data from Logstash and the Elastic Agent. If users express feedback that it’s difficult to identify Logstash sourced data from the shared data stream, we could consider adding a from-logstash tag to the tags ECS base field for events coming from Logstash.
We want to guide users towards using the new indexing strategy, but if users express the need for more flexibility, we could introduce a free form option for specifying the data stream name in the future where template/ILM management would be manual.

The text was updated successfully, but these errors were encountered:

ruflin · 2020-08-17T08:04:34Z

@acchen97 Two notes on the above issue:

It is data_stream.dataset and not data_stream.name
You use the config option host. I know this has been used a lot in the past but I wonder if we could change also the config name to something ECS compatible?

acchen97 · 2020-08-17T17:49:26Z

@ruflin thanks for your notes. I've reconciled the former in the original issue. For the latter, I think we can stick with hosts as its accurately descriptive and is also consistent with how Beats and Agent configure it in the ES output. I'm not sure being ECS compatible is as critical here given this is a configuration setting rather than an event field.

ph · 2020-09-03T15:14:03Z

@mostlyjason FYI.

enotspe · 2020-09-23T04:02:05Z

will there be cloud.id/cloud.auth support?

colinsurprenant · 2020-09-23T15:27:35Z

@enotspe I would think so, I don't see why we would differ from the authentication strategies of the current elasticsearch output.

ph · 2020-11-04T19:31:35Z

@colinsurprenant We have validation in place for the data_stream.* field in Kibana we should align on them. cc @jen-huang .

colinsurprenant · 2020-11-04T20:54:00Z

@ph @acchen97 @jen-huang So should we be looking into this Future Considerations item right away?

When they are absent, we could have a setting that allows the data_stream.type, data_stream.dataset, and data_stream.namespace fields to be derived from the data stream name and added to the event prior to indexing.

And the question is more about making this a configurable default behaviour or not? i.e. should we allow the user to disable that in the case of documents not from Agent and not containing these fields?

colinsurprenant · 2020-11-04T20:58:04Z

And as a followup question, if the user sets auto_routing => false and the document contains the data_stream.type, data_stream.dataset, and data_stream.namespace fields should we overwrite the fields with the plugin configured values?

jen-huang · 2020-11-04T21:24:50Z

@colinsurprenant We have validation in place for the data_stream.* field in Kibana we should align on them. cc @jen-huang .

Those fields have validation on agent side too to ensure safety with ES index name constraints.

Following discussion in elastic/kibana#75846, we implemented 20/100/100 byte length restriction for type, dataset, and namespace strings, respectively.

acchen97 · 2020-11-04T22:58:15Z

@ph @acchen97 @jen-huang So should we be looking into this Future Considerations item right away?

When they are absent, we could have a setting that allows the data_stream.type, data_stream.dataset, and data_stream.namespace fields to be derived from the data stream name and added to the event prior to indexing.

I'm still a bit hesitant on tackling this in the first version. This would only apply to data that is not sent from Agent, and I'm not sure how adding this would impact those use cases yet. Also, it's not clear to me if and how these data_stream.* fields will be used in ES queries and downstream UI components. Perhaps we can wait for user feedback before we decide whether we want to add it. It's typically easier to add features rather than remove them in the future. /cc @jsvd

ruflin · 2020-11-05T12:01:25Z

It is important that we add these fields. The new indexing strategy requires these fields to be there. It is expected that all dashboards / visualizations we build and hopefully also the one from the community, will filter on these. It will make the queries and with the dashboards much faster. If the fields are not in line with the indexing strategy, things will break apart.

colinsurprenant · 2020-11-10T17:27:01Z

As we are closing in on the release of the logstash data streams output plugin

I added a release roadmap meta issue in https://github.com/logstash-plugins/logstash-output-elasticsearch_data_streams/issues/1
and a disscuss issue about the overwriting of the data_stream.* in the event https://github.com/logstash-plugins/logstash-output-elasticsearch_data_streams/issues/2

Karrade7 · 2020-11-14T15:53:12Z

@acchen97 . For a non-agent use case: We have a multi-tenant strategy where each tenant has its own index. Such as datalake-tenant1, datalake-tenant2. We use logstash to feed data and set the index to the correct tenant. Under the new indexing strategy and this plugin can we supports this model: logs-tenant-dataset? Where tenant = ecs field organization.id?

colinsurprenant · 2020-11-16T14:37:54Z

@Karrade7 @acchen97 Good point. In the current model with auto_routing: true, using mutate filter(s) you could set the value of any of the {type}-{dataset}-{namespace} fields for example.

But we could also provide string interpolation for the type, dataset and namespace options - that way you could reference the value of any event fields. I think this makes sense, it adds even more flexibility.

Karrade7 · 2020-11-16T17:48:59Z

@colinsurprenant i think string interpolation and flexibility in general will be important here.
I think even in dataset there will be issues without. Since dataset is not a universal standard, there will times when you want the dataset to be set a certain way, but the index to be named differently. A perfect example of this is non-compliant characters for index names. If dataset is uppercase or contains the character "-" it won't index. I ran into this in 7.9 when I tried to use a set processor to set _index as "dl-cylance-{{organization.name}}". This did not work as some organization names had upper and lowercase and they would not index at all. Just giving an example of where unexpected issues can occur and flexibility for index naming will be useful.

colinsurprenant · 2020-11-16T19:58:19Z

@Karrade7 I am not sure I understand your concern correctly; there are 2 things at play here:

The plugin type, dataset and namespace options are used primarily to create the index name when auto-routing: false. The index name will always be {type}-{dataset}-{namespace} unless auto-routing: true where the event fields [data_stream][type], [data_stream][dataset], [data_stream][namespace] will be used and if one of these fields is missing then the corresponding plugin options will be used.
If using set_data_stream_fields: true the event fields [data_stream][type], [data_stream][dataset], [data_stream][namespace] will always be updated by the plugin to match the values used to create the index name.

Are you saying that when not using auto_routing you would want to have the dataset option to use a value that would be different than the [data_stream][dataset] field value? If so it would certainly be possible but probably not advisable because I believe the intent for downstream usage by ES and Kibana will expect to have the ES indexed documents fields data_stream.type, data_stream.dataset, and data_stream.namespace matching the data stream index name.

acchen97 · 2020-11-16T20:57:21Z

I think even in dataset there will be issues without. Since dataset is not a universal standard, there will times when you want the dataset to be set a certain way, but the index to be named differently. A perfect example of this is non-compliant characters for index names. If dataset is uppercase or contains the character "-" it won't index. I ran into this in 7.9 when I tried to use a set processor to set _index as "dl-cylance-{{organization.name}}". This did not work as some organization names had upper and lowercase and they would not index at all. Just giving an example of where unexpected issues can occur and flexibility for index naming will be useful.

@Karrade7 hyphens are indeed not allowed in the dataset and namespace. @ph are there any restrictions for using uppercase letters in the new indexing strategy?

Are you saying that when not using auto_routing you would want to have the dataset option to use a value that would be different than the [data_stream][dataset] field value? If so it would certainly be possible but probably not advisable because I believe the intent for downstream usage by ES and Kibana will expect to have the ES indexed documents fields data_stream.type, data_stream.dataset, and data_stream.namespace matching the data stream index name.

@colinsurprenant I believe the data_stream.* fields will need to match the data stream name. The data_stream.* fields are constant keywords and my understanding is that the field values will need to be the same across an entire index.

vbohata · 2020-12-10T16:19:51Z

Why should be "type" limited only to logs and metrics? Currently we use the similar naming but we use following options: logs, metrics, monitors (typically up/down monitors, events from heartbeat, ...) and data (real application data, not logs).

colinsurprenant · 2020-12-10T19:18:49Z

@vbohata this a good question and ultimately nothing will really prevent someone from having a "custom" Data Streams type other than "logs" and "metrics" but in the short terms these are the only one that have bundled ES templates and for which some visualizations will exist. The type option might have restrictions when first released and we might allow arbitrary type in the future, this is still being evaluated.

cdino · 2020-12-12T17:34:53Z

in a bit confused, is this plugin already in the 7.10?

after I saw a presentation from @ruflin about data streams I started digging about how to integrate metricbeat and filebeat with the data streams...
I did some extra processing in Beat and Logstash

Here what I'm adding to the Beat config:

processors:
- add_fields:
    fields:
      namespace: default
      type: metrics
    target: data_stream
- copy_fields:
    fail_on_error: false
    fields:
    - from: event.dataset
      to: data_stream.dataset
    ignore_missing: true

And Logstash side im doing this:

elasticsearch {
  id => "elasticsearch_stream"
  hosts => 'http://tiny-master:9200'
  index => "%{[data_stream][type]}-%{[data_stream][dataset]}-%{[data_stream][namespace]}"
  manage_template => false
  action => "create"
}

in Kibana to enable all the templates and dashboards I just add a fake agent and everything is created.

Seems working only with metricbeat and filebeat, only some fields sometimes are rising shards errors:

"failures": [
      {
        "shard": 0,
        "index": ".ds-metrics-system.service-default-000001",
        "node": "wGq5lQ7BSKOYPV9_zkPQ3Q",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Field [system.service.state_since] of type [keyword] does not support custom formats"
        }
      }
    ]

this maybe because I'm not setting correctly all the fields?

With auditbeat data streams seem not working.

ruflin · 2020-12-14T08:32:14Z

@cdino Nice work! You don't need to create a fake Agent, if you go to Settings of an integration, you have an install button. One thing to keep in mind, there is a chance that we make some breaking changes to the packages compared to the modules. Your error might be related to this, seems like system.service.state_since the format might be different. Sounds like this should be a date field?

@cdino What you discovered is, if you know what you are doing you don't need the new plugin ;-) 👏

Curious to hear what errors you got on the auditbeat side.

cdino · 2020-12-14T10:21:32Z

@ruflin Thanks! Yes i will avoid to use it in production for now :) but I really like this approach will help us a lot in the future.
I will give another try with auditbeat, but seems that there is no datastream mapping for data_stream.[dataset] related events.

sc7565 · 2020-12-30T19:27:37Z

Is the plugin released or not? Did not find repo nor details. Wanted to check roadmap for same

acchen97 · 2021-01-11T08:07:28Z

@sc7565 this feature has not been released yet. It is on the near-term roadmap.

ph · 2021-01-18T18:20:26Z

@jsvd @kares Small question concerning the implementation, I've double check this issue and It's not clear to me. Will you do some check to verify if data stream is available or not on the remote cluster?

kares · 2021-01-19T05:25:59Z

Was thinking about it yesterday what happens if there's a data_stream => true configuration and the endpoint (e.g. logs-system-default) we end up writing to is NOT a data-stream. So far I am not sure there's a need to do an explicit check (and I haven't checked if there's a reliable ES API to do so) given the even will include data_stream.* fields. But this might change.

Will you do some check to verify if data stream is available or not on the remote cluster?

We're certainly planning a version check on ES >= 7.9 to see if data-streams are available.

jsvd · 2021-01-20T15:49:18Z

Discussions are ongoing with the ES team on what kind of primitives exist (such as "require_alias=true") or could be created so that data producers can ensure that they're writing to data streams and not wrongly creating indices where aliases should be.

We could create a cache of "already seen index names", but this will never be truly accurate, as we can have a cache miss, confirm an alias exists, then someone decides to delete the alias and template afterwards, causing logstash to create an index instead of an alias. And of course checking without a cache and per document is not performant.

kares · 2021-03-23T15:28:25Z

a technical semi-blocker for data-stream support (due DS the plugin is in a need to check ES version):
logstash-plugins/logstash-output-elasticsearch#1001

(initial) specification elastic/logstash#12178 Co-authored-by: Ry Biesemeyer <[email protected]> Co-authored-by: Karen Metts <[email protected]>

kares · 2021-05-10T05:48:11Z

LS 7.13.0 is on track shipping data-stream support using it's logstash-output-elasticsearch plugin.

kares · 2021-05-26T14:15:02Z

as 7.13.0 is out we're fine to close this issue,
there's still follow up work (data_stream_timestamp support) but most of the work outlined here has been 🚢

olegtrautvein · 2021-10-15T21:45:16Z

As a fact, now logstash 7.10 writes to data stream using index template with described datastream{} section. What are the negative aspects of using this approach? Thanks

acchen97 assigned colinsurprenant Aug 14, 2020

colinsurprenant mentioned this issue Sep 2, 2020

[Doc] Logstash data streams integration logstash-plugins/logstash-output-elasticsearch#966

Closed

ppf2 mentioned this issue Sep 2, 2020

Is it possible to support indexing of dynamic variables using rollover? logstash-plugins/logstash-output-elasticsearch#858

Closed

ph mentioned this issue Oct 13, 2020

Add support in e2e-testing repo for Logstash and add test for Logstash output in the Elastic Agent (Standalone and/or Fleet mode). elastic/e2e-testing#364

Closed

3 tasks

colinsurprenant mentioned this issue Oct 20, 2020

separate config options into elasticsearch specific and shared logstash-plugins/logstash-output-elasticsearch#973

Merged

colinsurprenant mentioned this issue Nov 5, 2020

common code refactoring logstash-plugins/logstash-output-elasticsearch#976

Merged

jszwedko mentioned this issue Nov 9, 2020

RFC for supporting Elasticsearch datastreams vectordotdev/vector#4939

Closed

colinsurprenant mentioned this issue Dec 1, 2020

bump version to 10.8.0 logstash-plugins/logstash-output-elasticsearch#983

Merged

binarylogic mentioned this issue Dec 6, 2020

enhancement(elasticsearch sink): add support for data streams vectordotdev/vector#5126

Merged

kares self-assigned this Jan 14, 2021

kares mentioned this issue Mar 3, 2021

Feat: data-stream support + logging improvements logstash-plugins/logstash-output-elasticsearch#988

Merged

5 tasks

karenzone mentioned this issue Apr 7, 2021

[Doc] Expand documentation for data streams logstash-plugins/logstash-output-elasticsearch#1006

Open

5 tasks

kares added the v7.13.0 label May 10, 2021

kares closed this as completed May 26, 2021

jdrouet mentioned this issue Jun 8, 2021

feat(elasticsearch sink): handle data stream mode vectordotdev/vector#7774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logstash Integration with Elasticsearch Data Streams #12178

Logstash Integration with Elasticsearch Data Streams #12178

acchen97 commented Aug 14, 2020 •

edited

Loading

ruflin commented Aug 17, 2020

acchen97 commented Aug 17, 2020

ph commented Sep 3, 2020

enotspe commented Sep 23, 2020

colinsurprenant commented Sep 23, 2020

ph commented Nov 4, 2020

colinsurprenant commented Nov 4, 2020

colinsurprenant commented Nov 4, 2020

jen-huang commented Nov 4, 2020

acchen97 commented Nov 4, 2020

ruflin commented Nov 5, 2020

colinsurprenant commented Nov 10, 2020

Karrade7 commented Nov 14, 2020

colinsurprenant commented Nov 16, 2020

Karrade7 commented Nov 16, 2020

colinsurprenant commented Nov 16, 2020

acchen97 commented Nov 16, 2020

vbohata commented Dec 10, 2020 •

edited

Loading

colinsurprenant commented Dec 10, 2020

cdino commented Dec 12, 2020

ruflin commented Dec 14, 2020

cdino commented Dec 14, 2020

sc7565 commented Dec 30, 2020

acchen97 commented Jan 11, 2021

ph commented Jan 18, 2021

kares commented Jan 19, 2021

jsvd commented Jan 20, 2021

kares commented Mar 23, 2021

kares commented May 10, 2021

kares commented May 26, 2021

olegtrautvein commented Oct 15, 2021

Logstash Integration with Elasticsearch Data Streams #12178

Logstash Integration with Elasticsearch Data Streams #12178

Comments

acchen97 commented Aug 14, 2020 • edited Loading

Overview

Indexing Strategy

Example Configuration

Basic default configuration

Customize data stream name

Configuration Settings

Elastic Agent Compatibility

Limitations

Future Considerations

ruflin commented Aug 17, 2020

acchen97 commented Aug 17, 2020

ph commented Sep 3, 2020

enotspe commented Sep 23, 2020

colinsurprenant commented Sep 23, 2020

ph commented Nov 4, 2020

colinsurprenant commented Nov 4, 2020

colinsurprenant commented Nov 4, 2020

jen-huang commented Nov 4, 2020

acchen97 commented Nov 4, 2020

ruflin commented Nov 5, 2020

colinsurprenant commented Nov 10, 2020

Karrade7 commented Nov 14, 2020

colinsurprenant commented Nov 16, 2020

Karrade7 commented Nov 16, 2020

colinsurprenant commented Nov 16, 2020

acchen97 commented Nov 16, 2020

vbohata commented Dec 10, 2020 • edited Loading

colinsurprenant commented Dec 10, 2020

cdino commented Dec 12, 2020

ruflin commented Dec 14, 2020

cdino commented Dec 14, 2020

sc7565 commented Dec 30, 2020

acchen97 commented Jan 11, 2021

ph commented Jan 18, 2021

kares commented Jan 19, 2021

jsvd commented Jan 20, 2021

kares commented Mar 23, 2021

kares commented May 10, 2021

kares commented May 26, 2021

olegtrautvein commented Oct 15, 2021

acchen97 commented Aug 14, 2020 •

edited

Loading

vbohata commented Dec 10, 2020 •

edited

Loading