Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Logstash data streams integration #966

Closed
ppf2 opened this issue Sep 1, 2020 · 6 comments
Closed

[Doc] Logstash data streams integration #966

ppf2 opened this issue Sep 1, 2020 · 6 comments
Assignees
Labels

Comments

@ppf2
Copy link
Member

ppf2 commented Sep 1, 2020

Data streams, a convenient, scalable way to ingest, search, and manage continuously generated time series data, was released in Elasticsearch 7.9.

While this feature is currently available in the default distribution of Elasticsearch, Logstash has not yet adopted it in its time-series indexing implementation.

The following walks you through how you can implement data streams integration with Logstash.

Using this recipe allows you to more easily workaround the well-known limitation in using dynamic variables with ILM+rollover in Logstash until more out of the box integration is available between Logstash and data streams.

Disclaimer: Keep in mind that Elasticsearch data streams only support create action today. If a document with the specified _id already exists, the indexing operation will fail (by design).

Step 1: Create the desired ILM policy in Elasticsearch (you can use either the API or Kibana UI):

# This is an arbitrary ILM policy that performs a rollover and delete
PUT _ilm/policy/my-30g-30d-ilm-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "30G"
          }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Step 2: Create an index template using v2 templates (you can use either the API or Kibana UI). "v2 templates" refer to the new _index_template implementation in Elasticsearch.

# Certainly, you can customize other index setting/mapping options as part of the index template
PUT /_index_template/my-data-stream-template
{
  "index_patterns": [ "my-data-stream*" ],
  "data_stream": { },
  "priority": 200,
  "template": {
    "settings": {
      "index.lifecycle.name": "my-30g-30d-ilm-policy"
    }
  }
}

(Optional) You can also create multiple index templates for each "type" of index/app if desired, e.g.,

PUT /_index_template/my-data-stream-app1-template
{
  "index_patterns": [ "my-data-stream-app1*" ],
  "data_stream": { },
  "priority": 200,
  "template": {
    "settings": {
      "index.number_of_shards": 1,
      "index.refresh_interval": "30s",
      "index.lifecycle.name": "my-30g-30d-ilm-policy"
    }
  }
}

PUT /_index_template/my-data-stream-app2-template
{
  "index_patterns": [ "my-data-stream-app2*" ],
  "data_stream": { },
  "priority": 200,
  "template": {
    "settings": {
      "index.number_of_shards": 3,
      "index.refresh_interval": "15s",
      "index.lifecycle.name": "my-30g-15d-ilm-policy"
    }
  }
}

(Optional) If you are running hot-warm architecture, make sure to include the index.routing.allocation.require setting in the index templates so that it will place new data stream indices in the hot tier by default. The following is an example for the hot-warm deployment template on Elastic Cloud.

# Elastic Cloud uses the node attribute "data" (by default) to define 
# the tiers in a hot-warm deployment template. When "data" is set to "hot", 
# it will allocate all `my-data-stream*` indices only to the hot tier in the deployment.
# If you are not running on Elastic Cloud, your node attribute/attribute value 
# will likely be different. 

# DO NOT simply copy and paste the example below without customization
# UNLESS you know the `data:hot` node attribute is properly set up in your environment :)
PUT /_index_template/my-data-stream-template
{
  "index_patterns": [ "my-data-stream*" ],
  "data_stream": { },
  "priority": 200,
  "template": {
    "settings": {
      "index.lifecycle.name": "my-30g-30d-ilm-policy",
      "index" : {
        "routing" : {
          "allocation" : {
            "require" : {
              "data" : "hot"  
            }
          }
        }
      }
    }
  }
}

Step 3: Configure Logstash Elasticsearch output

Example below assumes that the variable %{app_name} is already defined/populated to each event upstream from the output.

output {
  elasticsearch{
    user=>"elastic"
    password=>"password"
    index => "my-data-stream-%{app_name}"
    # To prevent LS output from interfering with the data stream setup,
    # ILM integration is explicitly disabled
    ilm_enabled => false 
    # Data streams only support create action
    "action" => "create"
    hosts => ["https://<es_host>:<es_port>"]
  }
}

As Logstash substitutes the field variable %{app_name} with its value in the event set upstream from the output, it will match the index template defined in Step 2. As a result, the underlying data stream for each "application type" will automatically be created.

Example of resulting backing indices (with rollover) of the data streams created for each "application type""

health status index
green  open   .ds-my-data-stream-app1-000001
green  open   .ds-my-data-stream-app1-000002
green  open   .ds-my-data-stream-app2-000001
green  open   .ds-my-data-stream-app2-000002
@andreykaipov
Copy link

This is really awesome! Thank you for the walkthrough!

Is the goal to eventually support variable interpolation in the ilm_rollover_alias option via this approach, or will using the index option like in your example become the recommended approach to workaround #858? I can see it being an issue as it might not be apparent to end users that the underlying indices behind an interpolated ilm_rollover_alias are actually data streams and will only support create actions (at the moment).

It's probably still too early to tell, but I figured I'd ask to gauge if it's worth implementing the data streams approach for now on our end.

@karenzone
Copy link
Contributor

Yes! Good stuff, @ppf2. Adding this to work in queue. Thanks for taking time to share this info with other users.

@ppf2
Copy link
Member Author

ppf2 commented Sep 2, 2020

Is the goal to eventually support variable interpolation in the ilm_rollover_alias option via this approach, or will using the index option like in your example become the recommended approach to workaround #858?

I think the long term plan is for the output to have actual data stream settings (so that we don't have to deal with the unintuitive setup here, turning ILM off and switching index option to create, etc..). This will certainly require a code change so I will have the LS devs comment here :)

@karenzone
Copy link
Contributor

karenzone commented Sep 2, 2020

@colinsurprenant FYI: Calling your attention to this issue as it relates to our docs work on datastreams.

@colinsurprenant
Copy link
Contributor

Great stuff @ppf2 - FYI we are currently working on the design and implementation strategy for a new data streams output plugin which will be essentially a stripped down version of the current elasticsearch output, see elastic/logstash#12178. Please let us know if you have any feedback/comments etc!

@kares
Copy link
Contributor

kares commented Jul 26, 2021

going to close this issue as elastic/logstash#12178 got shipped, let us know if there's anything more we need to do (e.g. in the docs)

@kares kares closed this as completed Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants