✨ Add use_metadata option for index, id, and pipeline #1041

pemontto · 2021-09-20T15:55:15Z

Currently it's not possible to have a single elasticsearch output that can set a document_id and also use an autogenerated ID. In our case we have a large number of clusters and require at a minimum two outputs per cluster because of this, the increasing number of outputs has a noticeable performance impact.

This PR addresses that by implementing functionality similar to Beats where the index, id, and pipeline can be defined by values in the @metadata object:

{
  "message": "foo",
  "@metadata": {
    "_index": "myindex",
    "_id": "myid-1",
    "pipeline": "mypipeline"
  }
}

More than half the battle while developing locally was testing the plugin. There's very little definitive information on expected environment, the official documentation doesn't really do a good job here. This community presentation was useful, but ultimately didn't help testing. I just ended up trying to replicate the CI pipeline. As such the tests I've added are probably sub-optimal.

Happy to hear any feedback, and if necessary I can submit a PR just focussing on making the document_id behaviour similar to pipeline where empty values get dropped.

pemontto · 2021-09-20T18:20:28Z

I should also mention the document_id issue could be resolved in a different way by this previous pull request from 2016... #464

pemontto · 2021-09-21T06:43:23Z

I should also mention the document_id issue could be resolved in a different way by this previous pull request from 2016... #464

However that would only work with dynamic values (sprintf) if this was also implemented Default value for sprintf interpolation (Feature)... also outstanding since 2016

yaauie

FIrst: thank you so much for taking the time to put this together, and for including helpful docs and specs to demonstrate the behavior. I can certainly see the value of using this feature while "reindexing" existing Elasticsearch data.

A couple things to note:

In the upcoming Logstash 8, pipeline.ecs_compatibility will be on-by-default, so we need to consider where the metadata will be if we are consuming from an Elasticsearch input that is running in an ECS compatibility mode:

When ECS compatibility is disabled, docinfo_target uses the "@metadata" field as a default, with ECS enabled the plugin uses a naming convention "[@metadata][input][elasticsearch]" as a default target for placing document information.
-- Elasticsearch input plugin: Compatibility with the Elastic Common Schema (ECS)

This may be as simple as selecting our own docinfo target when the plugin is initialized based on its effective ECS compatibility mode, or we may need to add a config option (or overload the proposed use_metadata option) to allow a user to configure where on the event they expect the docinfo to be.
I've also left a note inline questioning the utility of resolving sprintf format strings. I think this is a risky feature, since it can cause a poorly-formed event to override the plugin's explicit configuration in a way that could cause very confusing behavior.

I will take the time to come back to this in a week or so after considering these things.

yaauie · 2022-01-13T17:23:29Z

docs/index.asciidoc

+  * Value type is <<boolean,boolean>>
+  * Default value is `false`
+
+Use and preference output parameters defined in the document metadata. The <<plugins-{type}s-{plugin}-index>> (`@metadata._index`), <<plugins-{type}s-{plugin}-document_id>> (`@metadata._id_`), and <<plugins-{type}s-{plugin}-pipeline>> (`@metadata.pipeline`) can be set by their respective `@metadata` fields.


It looks like _id has a trailing underscore:

Suggested change

Use and preference output parameters defined in the document metadata. The <<plugins-{type}s-{plugin}-index>> (`@metadata._index`), <<plugins-{type}s-{plugin}-document_id>> (`@metadata._id_`), and <<plugins-{type}s-{plugin}-pipeline>> (`@metadata.pipeline`) can be set by their respective `@metadata` fields.

Use and preference output parameters defined in the document metadata. The <<plugins-{type}s-{plugin}-index>> (`@metadata._index`), <<plugins-{type}s-{plugin}-document_id>> (`@metadata._id`), and <<plugins-{type}s-{plugin}-pipeline>> (`@metadata.pipeline`) can be set by their respective `@metadata` fields.

I think we should use the Logstash field reference notation for these, instead of the Elasticsearch dot-notation:

Suggested change

Use and preference output parameters defined in the document metadata. The <<plugins-{type}s-{plugin}-index>> (`@metadata._index`), <<plugins-{type}s-{plugin}-document_id>> (`@metadata._id_`), and <<plugins-{type}s-{plugin}-pipeline>> (`@metadata.pipeline`) can be set by their respective `@metadata` fields.

Use and preference output parameters defined in the document metadata. The <<plugins-{type}s-{plugin}-index>> (`[@metadata][index]`), <<plugins-{type}s-{plugin}-document_id>> (`[@metadata][_id]`), and <<plugins-{type}s-{plugin}-pipeline>> (`[@metadata][pipeline]`) can be set by their respective `@metadata` fields.

yaauie · 2022-01-13T17:32:39Z

lib/logstash/outputs/elasticsearch.rb

+      event_index = event.get("[@metadata][_index]")
+      params[:_index] = event.sprintf(event_index) if event_index && !event_index.empty?
+      event_pipeline = event.get("[@metadata][pipeline]")
+      params[:pipeline] = event.sprintf(event_pipeline) if event_pipeline && !event_pipeline.empty?


The specs don't have any coverage for sprintf-ing the value in [@metadata][_index], and I personally see this as a liability since Event#sprintf will does not support default values -- when a en event does include an [@metadata][_event] with a value like logstash-%{missing} and does not have a value for that referenced field, we end up setting the event's index to the literal logstash-%{missing} which can be problematic. I would rather not support sprintf's and have a more constrained surface-area:

Suggested change

event_index = event.get("[@metadata][_index]")

params[:_index] = event.sprintf(event_index) if event_index && !event_index.empty?

event_pipeline = event.get("[@metadata][pipeline]")

params[:pipeline] = event.sprintf(event_pipeline) if event_pipeline && !event_pipeline.empty?

event_index = event.get("[@metadata][_index]")

params[:_index] = event_index if event_index && !event_index.empty?

event_pipeline = event.get("[@metadata][pipeline]")

params[:pipeline] = event_pipeline if event_pipeline && !event_pipeline.empty?

kares · 2022-01-18T10:02:31Z

lib/logstash/outputs/elasticsearch.rb

@@ -251,6 +251,9 @@ class LogStash::Outputs::ElasticSearch < LogStash::Outputs::Base
  # ILM policy to use, if undefined the default policy will be used.
  config :ilm_policy, :validate => :string, :default => DEFAULT_POLICY

+  # ILM policy to use, if undefined the default policy will be used.


✂️ the comment does not make sense - seems to have been copy pasted

roaksoax added the status:needs-triage label Sep 20, 2021

pemontto force-pushed the feature-metadata branch from e82b2aa to 03d12d0 Compare September 20, 2021 17:59

pemontto force-pushed the feature-metadata branch from 03d12d0 to a052928 Compare September 21, 2021 13:41

✨ Add use_metadata option

e6278b8

pemontto force-pushed the feature-metadata branch from a052928 to e6278b8 Compare September 21, 2021 14:17

yaauie reviewed Jan 13, 2022

View reviewed changes

kares reviewed Jan 18, 2022

View reviewed changes

kares self-assigned this Jan 19, 2022

roaksoax added status:work-in-progress status:needs-triage and removed status:needs-triage status:work-in-progress labels Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add use_metadata option for index, id, and pipeline #1041

✨ Add use_metadata option for index, id, and pipeline #1041

pemontto commented Sep 20, 2021 •

edited

Loading

pemontto commented Sep 20, 2021

pemontto commented Sep 21, 2021

yaauie left a comment

yaauie Jan 13, 2022

yaauie Jan 13, 2022

kares Jan 18, 2022

	Use and preference output parameters defined in the document metadata. The <<plugins-{type}s-{plugin}-index>> (`@metadata._index`), <<plugins-{type}s-{plugin}-document_id>> (`@metadata._id_`), and <<plugins-{type}s-{plugin}-pipeline>> (`@metadata.pipeline`) can be set by their respective `@metadata` fields.
	Use and preference output parameters defined in the document metadata. The <<plugins-{type}s-{plugin}-index>> (`@metadata._index`), <<plugins-{type}s-{plugin}-document_id>> (`@metadata._id`), and <<plugins-{type}s-{plugin}-pipeline>> (`@metadata.pipeline`) can be set by their respective `@metadata` fields.

✨ Add use_metadata option for index, id, and pipeline #1041

Are you sure you want to change the base?

✨ Add use_metadata option for index, id, and pipeline #1041

Conversation

pemontto commented Sep 20, 2021 • edited Loading

pemontto commented Sep 20, 2021

pemontto commented Sep 21, 2021

yaauie left a comment

Choose a reason for hiding this comment

yaauie Jan 13, 2022

Choose a reason for hiding this comment

yaauie Jan 13, 2022

Choose a reason for hiding this comment

kares Jan 18, 2022

Choose a reason for hiding this comment

pemontto commented Sep 20, 2021 •

edited

Loading