-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing Data Storage and Retrieval for Time Series data. #9568
Comments
While this may provide gains in storage/throughput, it introduces few trade-offs:
These use cases may apply for time-series data, and should be evaluated to support with the proposed optimization. |
This should also reduce the merging overhead if _source is not stored explicitly. +1 to @mgodwan : As we start looking into solution, we need to carefully analyze the restriction it would bring in terms of use cases. Also if those restriction would work for time series workload specifically. Instead of skipping the _source completely, should we consider filtering the _source for specific field for which doc_Values is enabled. |
Can we try benchmarking something like this using the existing code? It's already possible to exclude some/all fields from source and it's possible to explicitly request that fields be loaded from doc values. That is, users can theoretically tune their index to do exactly what's being proposed here -- it sounds like we just want to make it easier (e.g. by defining a time series index type that excludes doc values fields from source and automatically retrieves those doc values fields at query time). I think the proposed change could help a lot. We could measure that improvement now by making some changes to the |
Related, if we can make retrieving from doc values "feel" the same as retrieving from source, we could transparently retrieve doc values instead of source if only fields with doc values are requested (to avoid decompressing the stored field block altogether). |
Right, @msfroh , yes agreed this optimization can help in all cases ( _source enabled/ disabled) |
@msfroh , you are right we can exclude/include fields from source and request fields to be explicitly loaded from docValues. I tried benchmarking the indexing performance to test exclusion of fields that are also stored as docValue(mapping below). Findings:
The overhead in this case is mostly due to rewriting of _source at the time of indexing. (link) In case, _source is completely omitted, there is ~25% gain in storage and slight gain in indexing throughput (~3-4%) and (~3%) gain in P50/P90 latency. I am yet to benchmark the query performance (explicitly request that fields be loaded from doc values.).
Yes, the idea is to make the these optimisation under the hood and make it default for timeseries indexes. We can keep source field disabled by default if all fields can be fetched from docValues and source can be generated at the query time. In case, some of the fields are not of docValue type, then we can store some of them as stored field and at the query time we can fetch partially from docValues and partially from stored field if _source is requested (depends the nature of the query)
|
I'm going to open a spin-off issue to target just the querying side of this, because I think doing doc value retrieval to avoid decompressing stored source might be a quick win (independent of the indexing changes). |
Hi all, I have developed something called derived source(the name is inspired by derived fields). It is similar in nature to this idea. If no one is currently working on this, I'd be happy to draft a RFC to detail my approach. |
@bugmakerrrrrr feel free to update this RFC or create a new one with your thoughts and link here. |
@rayshrey and I have been working on a POC to achieve the same as well. Currently, the POC is working for basic use cases. We are currently working to solve for updates/get, and search path. We're planning to have an RFC in next couple of weeks. |
BackgroundIn OpenSearch, field data is not only stored in various indexes, including inverted indexes, column stores, and BKD indexes, but also in In the current OS, some similar functionality has been provided. When creating index mapping, users can specify includes/excludes in the
GoalWhen writing documents to OpenSearch, the source field is not stored at all, and the source field can be reconstructed from the column store and returned to the user when reading, and is compatible with existing reindex, script, security and other functions. ProposalMapperThe schema of index in OpenSearch is specified by mapping, corresponding to the definition of each mapper in the code. mapper naturally has a nested structure, non-leaf field corresponds to
During the mapping update phase (including creating indexes, dynamically updating mappings when indexing, etc.), the mapping is checked to verify whether each field configuration in the document supports derived source. If any field does not support it, an exception is thrown and creation or writing is not allowed. When implemented, each type of mapper will override the validateDerivedSource method of the parent class to check whether the corresponding configuration supports rebuilding the source field. When parsing the mapping, the Specifically, for non-leaf fields, including root, the following attributes need to be checked:
For leaf fields, most fields only need to be checked:
Write phaseIn the current writing phase, before parsing the document, the metadata fields are preprocessed, and it is also at this stage that the For other APIs that rely on dynamically generated Read phaseAs mentioned above, In Lucene, the field type corresponding to the
Therefore, we can consider wrapping the
When reconstructing |
BackgrounddocValues are a columnar storage format used by Lucene to store indexed data in a way that facilitates efficient aggregation, sorting etc and stored fields, on the other hand, are used to store the actual values of fields as they were inserted into the index. Currently the _source field stores the original documents as stored field. We can possibly skip storing the _source field in cases where the values are already store in docValues and retrieve the field values from docValue instead. This will help in reducing the storage cost significantly. Based on the nature of the query we can skip or fetch some or all of the fields from docValues to serve the search queries. ApproachWe can create a third option for the _source field - To dynamically generate the source we:
POChttps://github.com/rayshrey/OpenSearch/tree/derived_source BenchmarkingStarted the benchmarking with Initial findings:
To solidify these findings, did another benchmarking with
For the
Observations:
|
@mgodwan, @nkumar04 , @rayshrey having a derived source is a very good solution. in k-NN plugin for vectors also we are working on building the derived source. Here is the RFC for that: opensearch-project/k-NN#2377. We identified that we don't store vectors in Also +1 on the indexing improvements, if we remove the fields from source and recovery source the indexing improvements will be there. In my observations the indexing improvements were greater than 22% for vector workloads if we remove source and recovery source. + different other benefits. Ref: #13490 |
@shwetathareja it will not always be the doc values. One example is Vectors where vectors before 2.17 were stored as docvalues but now have been moved to KNNVectorFormat(provided by Lucene). :) |
Yes - looks like we might be working on something similar - opensearch-project/k-NN#2377 - which probably means its a good idea and efforts should be merged! I was working on just for vector field - but I think generalization would be good and want to ensure consistency with the feature. My approach is similar to @bugmakerrrrrr - adding extension at Lucene level. I was working with custom codec in plugin, so looks like a bit different. PoC here: https://github.com/jmazanec15/k-NN-1/tree/derived-source-vectors. I did make it work with nested but it requires implementing a lot of extra logic in the lucene layer: https://github.com/jmazanec15/k-NN-1/blob/derived-source-vectors/src/main/java/org/opensearch/knn/index/codec/KNN990Codec/ParentChildHelper.java.
@bugmakerrrrrr For nested, index and copy_to, why is it not supported? @mgodwan @rayshrey did you consider implementing override at codec/lucene level in the stored fields format as opposed to OpenSearch level constructs? |
+1, in my proposal, the plugin only needs to override the new methods introduced in the mapper to support derived source functionality for custom field types.
|
+1, for string type, it can be the stored field, each type of field may require its own unique processing logic. |
+1 to @bugmakerrrrrr points around copy_to. The field will always be ambiguous to deal with.
+1, the idea is to be able to generate the values using the available formats (E.g. Doc Values, Stored Field, Vector). Each field type can dictate how to fetch values for the specific field so that the source to be returned to the users can be generated. This can be done with an interface with the existing Mapper which can tell whether the field type can supported based on the associated params, and then provide a way to append those values to the source. |
Having to modify the codec can be tricky as that will require a lot of interoperability amongst codecs which the fields declared in plugins may or may not be a candidate for (e.g. knn requires a different codec, and can implement it. scaled_float does not require a separate codec and can just function with a couple of interface methods within OpenSearch Mapper constructs using the I like the idea of having a wrapper on StoredFieldVisitor as @bugmakerrrrrr also suggested which will be pretty useful to hide underlying details from OpenSearch integration point. |
If we decide to implement it this way, I can create a PR, which I've already implemented a version of internally. |
@bugmakerrrrrr I just published a PR with the first basic changes required for implementing the derived source field can you review and see if it aligns with your implementation - #17040 |
Thanks @mgodwan that makes sense. @bugmakerrrrrr I like wrapping the StoredFieldVisitor as well and then overriding in the plugin is. It is very elegant and seems like the right long term approach. For 2.19, in k-NN, Im going to try to implement the approach in opensearch-project/k-NN#2377 to get initial feedback and then migrate to the generalized solution in the future. |
Nice! There is already one PR by @rayshrey to have a mapper interface and implementation in keyword field which aligns with your thoughts where we can discuss if the interfaces are in line. I'm also working on a PR to allow users to configure the usage of derived source for their indices which I plan to raise once the Mapper interface is finalized so we can verify whether derived source can be supported for user mappings or not. |
Thanks for the discussion folks, wrapper on StoredFieldVisitor is definitely a neat approach, +1. |
Is your feature request related to a problem? Please describe.
In OpenSearch, documents are stored using three primary formats: indexed, stored, and docValues. In case of time series data, a document usually consists of dimensions, time point and quantitative measurements that are used monitor various aspects of a system, process, or phenomenon. In cases like these, the numeric, keyword and similar datatypes are stored as the "stored" field as well as docValues that serves specific purposes related to search performance and retrieval. DocValues are a columnar storage format used by Lucene to store indexed data in a way that facilitates efficient aggregation, sorting etc and stored fields, on the other hand, are used to store the actual values of fields as they were inserted into the index.
For example, lets look at a document consists of performance related data points of an ec2 instance.
Here, hostName, hostIp and zone uses keyword as field type and timestamp/cpu/jvm uses numeric fields. Values for dimensions and mesurements are stored in docValue as well in stored fields as _source. The most common search query for such data set is aggregations like min/max/avg/sum. As we can see, data is stored twice here, we can possibly avoid storing data twice.
Describe the solution you'd like
Currently the _source field stores the original documents as stored field. We can possibly skip storing the _source field in such cases and retrieve the field values from docValue instead. This will help in reducing the storage cost significantly. Based on the nature of the query we can skip or fetch some or all of the fields from docValues to serve the search queries.
The text was updated successfully, but these errors were encountered: