-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Derived Source for Vectors #2377
Comments
I think when a model give fp32 then this %age will be more. I think with 128D the number of characters are pretty low if we compare with something like cohere datasets. I have seen this %age going to 80% too.
Any reason why we cannot enable it by default? I think we should enable it by default. WDYT? @jmazanec15 one more benefit of removing the vector field from source is speedup in the force merge. I was running some experiments, where I saw if we don't have vector in the _source there a good visible speedup in the force merge of vector indices.
I didn't understand why user need to do this? Because I was thinking we will just exclude the vector field while creating the index. |
That makes sense. 80% wouldnt surprise me too much
Right - this will default to true - but there will be a setting to disable it. One reason to disable it may be fore users who are pulling vectors from OpenSearch as a vector store. It may be slower with this respect.
Oh nice - yes I think there will be a lot of kind of side effect benefits from this.
They are not exluding the vector field when creating the index. It will actually be full transparent. Thus, on search, if they do not exclude the field, it will be returned (like it is today). This keeps experience consistent. If we wanted to exclude vector fields by default, this could be taken up separately. |
can you please elaborate more on this?
Sorry I am little confused on this part. Let me try to ask the question again. Are we suggesting customer to exclude vector fields during index mapping or not? |
Sure, the experience is meant to be as transparent as possible. By this, the intention is that customer will have the same user experience as they had without derived source as they have with derived source. In other words, derived source will not require any kind of change in user behavior. When they create an index like
they will interact in the same way with the index as if they had not enabled derived source. Thus, the following query would be expected to return the source vector:
So, if they do not want the vector field, as they do with normal vector indices, they would have to specify source exclusion
|
@jmazanec15 |
@jmazanec15 i like this [option 1], and also when i see your POC code, i think it is pretty good. |
@jmazanec15 @navneet1v is there any chance that we can implement custom codec format which contains |
@luyuncheng Yes, I think we should do the Custom Native DocValuesFormat as well for full precision vectors. It seems like it would save some space and have some other utilities as well. |
Introduction
This is an RFC that presents a proposal for removing knn_vector from "_source" field without loss of OpenSearch functionality that "_source" enables. "_source" in this context refers to the per document field in OpenSearch that stores the original source provided by the user as a StoredField in lucene. See SourceFieldMapper for more details.
This is a followup for #1571 and #1572.
Problem
Currently, vectors for native indices are stored in 3 places by default
In an experiment with 10k 128-dimensional vectors, the size break down of these files was:
With BEST_COMPRESSION codec:
From this, we can see that 47%-60% of the storage is going towards the _source storage. For more details, see: opensearch-project/OpenSearch#6356. Worse yet, for our disk based feature with our quantized vectors (vectors that get compressed in a lossy fashion), the native lib files will get smaller than the FlatVectorsFormat file, so the _source will take up an even larger percentage of the storage.
For a typical user, they should not need to get the source vector from OpenSearch. Thus, storing the vectors in _source poses significant problems for users with minimal benefits:
Because of this, we generally recommend to users that they disable storing the vectors in the source. However, this has serious limitations:
So, enter “derived_source”. We take inspiration from “derived fields” feature of OpenSearch to use one format of data for another purpose on the fly. The idea is that we already have the vectors available via the FlatVectorsFormat files (.vec). When we need to read the _source, we should just inject the vector fields into the _source field from the FlatVectorsFormat file. The effect will be that all functionality of OpenSearch works and we get a potential > 50% reduction in storage space for vectors.
Proposed Solutions
[Option # 1] (Preferred) Implement Custom StoredFieldsFormat in existing KNNCodec
Because the KNN plugin already implements its own Codec, we can override the StoredFieldsFormat to intercept and inject the vector fields when needed. This format would use the delegate pattern (as the k-NN plugin already does with core codecs) and only intervene with respect to accesses on the _source stored field on read and write (see PoC).
Pros
Cons
For this option, we created a PoC to showcase feasibility. The PoC was able to support the following features:
[Option # 2] Introduce a dedicated FetchSubPhase to inject vector into source
As an alternative, as was done in #1572 by @luyuncheng, we can also create a custom FetchSubPhases in order to prepare the payload with the injected source that we can return to the caller. Generally, this will be where _source gets read (but not guaranteed to be so).
The general workflow for users would be:
This approach has the following pros/cons:
Pros
Cons
[Option # 3] Implement Custom StoredFieldVisitor
The security plugin has a feature called “Field-level security” where admins can limit access to different users at the field level. This feature requires that they automatically filter or mask privileged fields from _source. This is similar to what we want to do for vectors! They do this by implementing a custom StoredFieldsVisitor, FlsStoredFieldsVisitor. The StoredFieldsVisitor will be called in the StoredFieldsReader, for a given document and a given field. Thus, their visitor has the option to intercept the “_source” field, and filter/mask the fields they want. They use the “onIndexModule” extension point in order to inject this via a custom readerWrapper.
We could do something similar for vector derived source, where instead of filtering and masking, we inject the vector fields.
Pros
Cons
Summary
We are proposing option 1 because it provides a consistent UX with existing OpenSearch UX and extends a low level enough point to be generally robust.
Proposed User Experience
The user interface will either have a cluster setting that will indicate whether or not to use the derived_source feature.
Open Questions
Avoid reconstruction of vectors on searches that later filter it out
In the current PoC, if someone excludes a field like this, in the StoredFieldsReader, we will inject the vector into the document and it will be later filtered out by OpenSearch logic. Instead of this, we need to figure out a way where we skip reconstruction in the first place if the field is going to be excluded anyway. This is a bit tricky to do and may involve a change in core. One idea is to pass this information in the FieldsVisitor and do some kind of type casting to get the information in the StoredFieldsReader component.
Next Steps
The text was updated successfully, but these errors were encountered: