-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse KNNVectorFieldData for reduce disk usage #1571
base: main
Are you sure you want to change the base?
Conversation
@navneet1v Easy test:
|
My question was how we are ensuring that KNNSubphase is not running during the search and running only during the re-indexing. |
@navneet1v gotcha, I will do the continues tests like reindex and other scenarios. |
Also there is something called as _recovery_source which is added as a fallback to support the re-indexing. If you are testing locally I would recommend to remove these line of code To ensure that recovery source is never created. This recovery source gets deleted after some when if indexing is happening continuously, but I have never tested this to understand does this really happen or not. |
src/main/java/org/opensearch/knn/index/KNNVectorDVLeafFieldData.java
Outdated
Show resolved
Hide resolved
@navneet1v i tested search source and reindex scenarios with |
src/main/java/org/opensearch/knn/index/fetch/KNNFetchSubPhase.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/KNNVectorDVLeafFieldData.java
Outdated
Show resolved
Hide resolved
@jmazanec15 as I see, we say also, I see we are trying to reconstruct the vector format with but I am not sure Which what do you think which is better. |
@navneet1v I added IT tests for the search, and reindex scenarios, I think it works with |
src/main/java/org/opensearch/knn/index/KNNVectorDVLeafFieldData.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/fetch/KNNFetchSubPhase.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/KNNVectorDVLeafFieldData.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/KNNVectorDVLeafFieldData.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/fetch/KNNFetchSubPhase.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/fetch/KNNFetchSubPhase.java
Outdated
Show resolved
Hide resolved
@luyuncheng and @bugmakerrrrrr agreed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@navneet1v @luyuncheng I've checked the filter logic in FetchSourcePhase, and I think that it's too complicated to implement in this subphase.
@luyuncheng can we fix up the comments and so that we can merge this change? |
Signed-off-by: luyuncheng <[email protected]>
@navneet1v FIXED at 2a61fcd |
Thanks @luyuncheng. I have been reviewing and I think overall it looks good. Im still not confident on the nested portion, particularly the Also, can we capture a list of known limitations in the issue? Somewhere we can refer to when developing the documentation what can and cannot be done with this feature. When testing, here is what I have:
Also, do we know if it works with partially constructed non-nested documents? Are there any other limitations for non-nested case? The functionality that will work:
|
@luyuncheng Im going to work on this one a little bit and see if I can add it to 2.18! Will open up a new PR. |
@jmazanec15 how about let me create a new PR, and rebase on the master? |
@luyuncheng I rebased and started experimenting with it here: https://github.com/jmazanec15/k-NN-1/commits/vector-synthetic-source/. Please take a look! I think before raising a new PR, itd be best to figure out a couple high level approach big questions - just so we dont end up going back and forth on revisions too much. Currently, I have a few concerns with implementing synthetic source as a fetch subphase:
I was discussing with @cwperks the field level security implementation and I thought it was pretty interesting and a similar strategy might be better for our use case. They implement the onIndexModule.setReaderWrapper() (see https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/OpenSearchSecurityPlugin.java#L698). Then, when reading certain fields, they will filter them out. For instance, for source: https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/configuration/DlsFlsFilterLeafReader.java#L652-L669. Given that our use case is just the opposite (we want to add fields when they are not present), it seems like this overall approach might make sense and give us a more robust solution that is compatible with a lot of features by default. The major issue with this, however, is that indexModule.setReaderWrapper() can only be set once (https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/IndexModule.java#L443-L459). Also, in the javadoc, it says "The wrapped reader can filter out document just like delete documents etc. but must not change any term or document content." I might be misinterpreting this - but it would seem like FLS might be breaking this contract (@cwperks am I incorrect here?) In order for this to work, I think we would need to somehow apply the synthetic source injection before the security fls wrapper does. That being said, I am wondering if we should put an extension point in OpenSearch core that will allow fields to inject into source here in a similar manner to how FLS security is implemented. |
^ that does appear to be the case. I'd have to dive into the change that introduced that comment to understand the motivation. It looks like its a change from before the fork. The FLS/FieldMasking does not change the stored data, but it does modify the result. In the case of FieldMasking it masks the value returned or with FLS it can choose to exclude fields from the result. |
@jmazanec15 as the following code shows: highlight added before plugin's |
@jmazanec15 @cwperks , hey, if we wrapper a |
@luyuncheng I had an alternative approach that I figured might let us cover more cases around fetch. For instance, other processors implementing fetch subphases. Im curious to hear your thoughts on it. Currently, we already have our own custom codec. What if we created our own custom StoredFieldsFormat. The format would need to be incredibly light weight - it would implement a delegate pattern on the upstream. However, for the StoredFieldsReader (which implements StoredFields), we override document:
Then, we can configure the sourceConsumer to manipulate the source via other formats such as doc values reader or vector values reader. Similarly, in the future, we could think about doing the same on the write side so that we can automatically disable source for vector fields by default. This approach would allow us to:
That being said, Im not sure about:
Does this approach sound reasonable @navneet1v @shatejas @heemin32? |
@jmazanec15 let me describe my understand of the usage for a new
As my proposal for this PR, i just want to cut the disk usage, and reuse the data in docValues. in the majority case, we do not want to retrieve knn_vector from source. i like your idea, it can take some advantage as you mentioned, but i do not know how to save the disk usage for a new StoreFieldsFormat |
It would operate in the same way - exclude the vector field from source in the mapping. This would then save on disk. @luyuncheng internally, the "source" is just stored as a stored field in lucene. So, from StoredFieldsReader, we are able to access any stored fields including source. Then, in the CustomStoredFieldsReader: @Override
public void document(int docId, StoredFieldVisitor storedFieldVisitor) throws IOException {
delegate.document(docId, storedFieldVisitor);
if (!(storedFieldVisitor instanceof FieldsVisitor)) {
return;
}
BytesReference originalSource = ((FieldsVisitor) storedFieldVisitor).source()
BytesReference syntheticVector = getVectorFromDocValuesOrVectorValues(docId, field)
putSyntheticVectorIntoSource(originalSource, syntheticVector, field);
} The FieldsVisitor contains the source that will be returned. So, this would let us basically modify the source as early as possible. Thus, we would be able to support as many other features relying on source as possible |
As update - I validated that the FetchSubPhase approach will work with field level security on this branch: https://github.com/jmazanec15/k-NN-1/tree/vector-synthetic-source. So, my only concern remaining with this approach is:
As @luyuncheng mentioned, we do not need to worry about this for core fetch subphases, but it could be problematic for non-core subphases. Also, Im wondering if there are any features out there that do not read the source via the fetch phase routine. |
@jmazanec15 so, when we introduce a new Because
|
@luyuncheng Not sure Im following completely.
I think there is some confusion around Stored Fields from an OpenSearch user perspective and from a Lucene perspective. The _source field is stored in Lucene as a stored field. See here:
So, the "_source" is fetched by calling the StoredFieldsReader.document - See this FieldVisitor. So, if we implement our own StoredFieldsReader, we have a chance to intercept the "_source" stored field on the Lucene level. FLS does something similar here: https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/configuration/DlsFlsFilterLeafReader.java#L89. So, taking the stored fields approach,
|
@luyuncheng This is PoC for what I am talking about: jmazanec15@ac5e3f8. |
@jmazanec15 LGTM, after i reviewed the commits, i really like this idea!!! this idea can reduce about 1/3 storage, and it would increase the cpu usage for parse json at store field. |
@jmazanec15 From user experience perspective, adding Another scenario I can think of (if we do read and write together) is the case where user explicitly mentioned And lastly, need to test that when user sends |
Description
in some scenarios, we want to
reduce the disk usage
andio throughput
for the source field. so, we would excludes knn fields in mapping which do not store the source like( this would make knn field can not be retrieve and rebuild)so I propose to use doc_values field for the vector fields. like:
Proposal
KNNVectorDVLeafFieldData
get data from docvaluesi rewrite
KNNVectorDVLeafFieldData
and makeKNN80BinaryDocValues
can return the specific knndocvalue_fields
like: (vector_field1
is knn field type)optimize result:
1m SIFT dataset, 1 shard,
with source store: 1389MB
without source store: 1055MB(-24%)
for the continues dive in to
knndocvalues
fields, I think when use faiss engine, we can usereconstruct_n
interface to retrieve the specific doc values and save the disk usage forBinaryDocValuesFormat
. or like this issue comments for redesign aKnnVectorsFormat
I added
KNNFetchSubPhase
and add a processor likeFetchSourcePhase#FetchSubPhaseProcessor
to combine thedocvalue_fields
into_source
something likesynthetic
logicIssues Resolved
#1087
#1572
KNNVectorDVLeafFieldData
can return the vectorDocValue fields like script do._source
with 1st step docValues fields response. and this way something likesynthetic source
but need explicit add value from search body likedocvalue_fields
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.