Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Disable field data on "_id" field #12185

Open
bharath-techie opened this issue Feb 6, 2024 · 3 comments
Open

[Feature Request] Disable field data on "_id" field #12185

bharath-techie opened this issue Feb 6, 2024 · 3 comments
Labels
>breaking Identifies a breaking change. enhancement Enhancement or improvement to existing feature or request Search:Query Capabilities v3.0.0 Issues and PRs related to version 3.0.0

Comments

@bharath-techie
Copy link
Contributor

bharath-techie commented Feb 6, 2024

Is your feature request related to a problem? Please describe

Right now , field data on "_id" field is enabled by default.

Users unaware of the implication of sorting on "_id" field , perform sorting on "_id" field on a large dataset and experience sudden increase in heap usage as field data of "_id" field will be cached to the default amount to 20% of heap. ( or based on the custom value )

 public static final Setting<Boolean> INDICES_ID_FIELD_DATA_ENABLED_SETTING = Setting.boolSetting(
        "indices.id_field_data.enabled",
        true,
        Property.Dynamic,
        Property.NodeScope
    );

"indices.fielddata.cache.size" is the setting which decides the cache size limit

Describe the solution you'd like

Make cluster setting default to 'false' , so that users take a conscious decision on enabling field data on "_id" field.

 public static final Setting<Boolean> INDICES_ID_FIELD_DATA_ENABLED_SETTING = Setting.boolSetting(
        "indices.id_field_data.enabled",
        false,
        Property.Dynamic,
        Property.NodeScope
    );

This will help the users continue using "_id" field for sorting and aggregations etc with just changing the cluster setting.

Related component

Search:Query Capabilities

Describe alternatives you've considered

No response

Additional context

No response

@bharath-techie bharath-techie added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 6, 2024
@peternied peternied added >breaking Identifies a breaking change. Indexing Indexing, Bulk Indexing and anything related to indexing and removed untriaged labels Feb 7, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3]
@bharath-techie Thanks for filing this issue. As a triage team the proposed behavior change is breaking and could be controversial.

@reta @msfroh What are your thoughts on this issue, is there someone else that should be looped into to look at this proposal?

@shwetathareja
Copy link
Member

shwetathareja commented Feb 9, 2024

This would be a breaking change. To start with we can update the OpenSearch documentation to warn users about the implication of sorting on fields with high cardinality including _id. We can keep this issue open to gather feedback from community users and can decide if we should do this in 3.0.

@ankitkala ankitkala removed the Indexing Indexing, Bulk Indexing and anything related to indexing label Feb 15, 2024
@msfroh
Copy link
Collaborator

msfroh commented Feb 22, 2024

What are your thoughts on this issue, is there someone else that should be looped into to look at this proposal?

I agree that it would a) be breaking and b) be a good idea. Given the opportunity for users to harm their cluster with _id fielddata, IMO we should definitely disable it by default.

Sorting by _id seems like a good option for a sorting tie-breaker, but users may not anticipate the cost in terms of heap usage. (If you really need to sort by ids as a tie-breaker, I would suggest writing the ids to a field with doc_values:true, and sort on that.)

@shwetathareja shwetathareja added the v3.0.0 Issues and PRs related to version 3.0.0 label Feb 27, 2024
@getsaurabh02 getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking Identifies a breaking change. enhancement Enhancement or improvement to existing feature or request Search:Query Capabilities v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: Later (6 months plus)
Development

No branches or pull requests

5 participants