Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add use_metadata option for index, id, and pipeline #1041

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pemontto
Copy link

@pemontto pemontto commented Sep 20, 2021

Currently it's not possible to have a single elasticsearch output that can set a document_id and also use an autogenerated ID. In our case we have a large number of clusters and require at a minimum two outputs per cluster because of this, the increasing number of outputs has a noticeable performance impact.

This PR addresses that by implementing functionality similar to Beats where the index, id, and pipeline can be defined by values in the @metadata object:

{
  "message": "foo",
  "@metadata": {
    "_index": "myindex",
    "_id": "myid-1",
    "pipeline": "mypipeline"
  }
}

More than half the battle while developing locally was testing the plugin. There's very little definitive information on expected environment, the official documentation doesn't really do a good job here. This community presentation was useful, but ultimately didn't help testing. I just ended up trying to replicate the CI pipeline. As such the tests I've added are probably sub-optimal.

Happy to hear any feedback, and if necessary I can submit a PR just focussing on making the document_id behaviour similar to pipeline where empty values get dropped.

@pemontto
Copy link
Author

I should also mention the document_id issue could be resolved in a different way by this previous pull request from 2016... #464

@pemontto
Copy link
Author

I should also mention the document_id issue could be resolved in a different way by this previous pull request from 2016... #464

However that would only work with dynamic values (sprintf) if this was also implemented Default value for sprintf interpolation (Feature)... also outstanding since 2016

Copy link
Contributor

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIrst: thank you so much for taking the time to put this together, and for including helpful docs and specs to demonstrate the behavior. I can certainly see the value of using this feature while "reindexing" existing Elasticsearch data.

A couple things to note:

  1. In the upcoming Logstash 8, pipeline.ecs_compatibility will be on-by-default, so we need to consider where the metadata will be if we are consuming from an Elasticsearch input that is running in an ECS compatibility mode:

    When ECS compatibility is disabled, docinfo_target uses the "@metadata" field as a default, with ECS enabled the plugin uses a naming convention "[@metadata][input][elasticsearch]" as a default target for placing document information.
    -- Elasticsearch input plugin: Compatibility with the Elastic Common Schema (ECS)

    This may be as simple as selecting our own docinfo target when the plugin is initialized based on its effective ECS compatibility mode, or we may need to add a config option (or overload the proposed use_metadata option) to allow a user to configure where on the event they expect the docinfo to be.

  2. I've also left a note inline questioning the utility of resolving sprintf format strings. I think this is a risky feature, since it can cause a poorly-formed event to override the plugin's explicit configuration in a way that could cause very confusing behavior.

I will take the time to come back to this in a week or so after considering these things.

* Value type is <<boolean,boolean>>
* Default value is `false`

Use and preference output parameters defined in the document metadata. The <<plugins-{type}s-{plugin}-index>> (`@metadata._index`), <<plugins-{type}s-{plugin}-document_id>> (`@metadata._id_`), and <<plugins-{type}s-{plugin}-pipeline>> (`@metadata.pipeline`) can be set by their respective `@metadata` fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like _id has a trailing underscore:

Suggested change
Use and preference output parameters defined in the document metadata. The <<plugins-{type}s-{plugin}-index>> (`@metadata._index`), <<plugins-{type}s-{plugin}-document_id>> (`@metadata._id_`), and <<plugins-{type}s-{plugin}-pipeline>> (`@metadata.pipeline`) can be set by their respective `@metadata` fields.
Use and preference output parameters defined in the document metadata. The <<plugins-{type}s-{plugin}-index>> (`@metadata._index`), <<plugins-{type}s-{plugin}-document_id>> (`@metadata._id`), and <<plugins-{type}s-{plugin}-pipeline>> (`@metadata.pipeline`) can be set by their respective `@metadata` fields.

I think we should use the Logstash field reference notation for these, instead of the Elasticsearch dot-notation:

Suggested change
Use and preference output parameters defined in the document metadata. The <<plugins-{type}s-{plugin}-index>> (`@metadata._index`), <<plugins-{type}s-{plugin}-document_id>> (`@metadata._id_`), and <<plugins-{type}s-{plugin}-pipeline>> (`@metadata.pipeline`) can be set by their respective `@metadata` fields.
Use and preference output parameters defined in the document metadata. The <<plugins-{type}s-{plugin}-index>> (`[@metadata][index]`), <<plugins-{type}s-{plugin}-document_id>> (`[@metadata][_id]`), and <<plugins-{type}s-{plugin}-pipeline>> (`[@metadata][pipeline]`) can be set by their respective `@metadata` fields.

Comment on lines +436 to +439
event_index = event.get("[@metadata][_index]")
params[:_index] = event.sprintf(event_index) if event_index && !event_index.empty?
event_pipeline = event.get("[@metadata][pipeline]")
params[:pipeline] = event.sprintf(event_pipeline) if event_pipeline && !event_pipeline.empty?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specs don't have any coverage for sprintf-ing the value in [@metadata][_index], and I personally see this as a liability since Event#sprintf will does not support default values -- when a en event does include an [@metadata][_event] with a value like logstash-%{missing} and does not have a value for that referenced field, we end up setting the event's index to the literal logstash-%{missing} which can be problematic. I would rather not support sprintf's and have a more constrained surface-area:

Suggested change
event_index = event.get("[@metadata][_index]")
params[:_index] = event.sprintf(event_index) if event_index && !event_index.empty?
event_pipeline = event.get("[@metadata][pipeline]")
params[:pipeline] = event.sprintf(event_pipeline) if event_pipeline && !event_pipeline.empty?
event_index = event.get("[@metadata][_index]")
params[:_index] = event_index if event_index && !event_index.empty?
event_pipeline = event.get("[@metadata][pipeline]")
params[:pipeline] = event_pipeline if event_pipeline && !event_pipeline.empty?

@@ -251,6 +251,9 @@ class LogStash::Outputs::ElasticSearch < LogStash::Outputs::Base
# ILM policy to use, if undefined the default policy will be used.
config :ilm_policy, :validate => :string, :default => DEFAULT_POLICY

# ILM policy to use, if undefined the default policy will be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✂️ the comment does not make sense - seems to have been copy pasted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants