Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching of search results in memory #9

Open
robbles opened this issue Mar 11, 2015 · 13 comments
Open

Caching of search results in memory #9

robbles opened this issue Mar 11, 2015 · 13 comments

Comments

@robbles
Copy link

robbles commented Mar 11, 2015

Is caching the search results in memory within the scope of this plugin? I'm thinking about implementing something simple and making a pull request, but I'll just make a custom plugin if it's not something that would likely be accepted.

The use case is as follows:

  • I have a large volume of event data going through logstash into ES
  • I have a much smaller set of immutable configuration-like records stored in ES that the events reference by ID
  • I would like to augment the events with the referenced records without needlessly overloading ES with searches, when most of the data would fit in memory

I think this could be accomplished by adding a simple LRU cache and two optional configuration values: the size of the cache in entries, and a identifier that uniquely represents the search. Without these parameters, the plugin would behave as usual and just hit ES every time.

@roji
Copy link

roji commented Oct 21, 2015

Big +1 on this.

My use case is similar, need to augment logging events with immutable ES data. This would be a huge performance boost.

@pemontto
Copy link

pemontto commented Aug 7, 2017

Absolutely +1 on this. I had actually assumed this was part of the plugin 😢

@sw-jung
Copy link

sw-jung commented Oct 27, 2017

I have same needs also, but this issue haven't resolved so long time.

So I created new filter plugin for solve this and similar problems. Please see logstash-filter-memoize.

@acchen97
Copy link

Hello all, apologies for the delay here. We are thinking through this caching feature and would love feedback from the broader community.

For each of your use cases, is configurable LRU caching sufficient for most workloads? It would require initial cache warmup, and if the ES lookup dataset changes often, it could result in more misses which would impact throughput.

For our DB lookups, we offer two caching options. The jdbc_streaming filter is used with an LRU caching strategy, while the jdbc_static filter allows for full local caching of the lookup dataset at startup, along with a periodic cache refresh option. Would a similar full local caching strategy be useful for you? Any other strategies you'd like to see?

@pemontto
Copy link

@acchen97 LRU would be suitable for our use cases, though seperate caches for hits and misses would be useful. Full local cache could also be very useful for us too.

@acchen97
Copy link

@pemontto thanks for your input. Do you mind sharing details on your lookup dataset? i.e. what kind of data, how big it is, how often it changes

@guyboertje
Copy link

@pemontto

  • What is the cardinality of the lookup values?
  • Why separate hit miss caches? Different eviction times?

@guyboertje
Copy link

@acchen97
I can't see much more feedback on the horizon. IMO we can go ahead with porting the LRU cache code from JDBC Streaming in this filter. We can have separate hit and miss cache instances and a default_data setting that is used as the miss cache value.

@CodeCorrupt
Copy link

Big +1 on this! Has there been any movement since the last update?

@passing
Copy link

passing commented Jun 7, 2019

@acchen97 we would also benefit from this:
We are processing logs where each document has a specific "application-name" field, referring to the application that the log came from.
We want to keep a dictionary from our CMDB in elasticsearch that contains for each application the responsible business department, product group, application criticality, etc. and would like to add this information to all log documents.
Since we are processing a few thousand logs per second, we cannot use the elasticsearch filter without caching being available ...
LRU would be just fine as long as it supports setting a maximum TTL for the cached data

@passing
Copy link

passing commented Jun 7, 2019

here's btw. another workaround for the missing caching option:
https://reactivelabs.com/blog/2019/01/31/speeding-up-logstash-data-enrichment-with-memcached/
just this adds quite some complexity to the logstash configuration besides requiring to run some memcached instance

@tigermatos
Copy link

I know this is an old request, but just adding a huge +1. We have a logstash configuration that handles host logs. It would be extremely useful to lookup the host properties in ElasticSearch, such as Application, Customer, LogLevel, etc. This metadata (obtained from an Elasticsearch index) would not only enrich the event, but drive logic, such as, if the hostname is not tagged for WARN level, then drop WARN logs, or if the server belongs to Application XYZ, then ship the log to the XYZ index.
Stuff like that. For us, this is only viable if we can cache results locally, to avoid too frequent lookups. The idea is to lookup only when the key (hostname in this example) is not already found in a local hashmap. As simple as that.
We know how to do this using JDBC_STATIC, or memcached, but why build and maintain another database if the data is already in Elasticsearch?
Thanks.

@slippman
Copy link

slippman commented Feb 9, 2023

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants