-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove dict_id
from arrow_schema::field::Field
and make dictionary IDs an internal implementation detail of flight encoding/decoding
#5981
Comments
I think this would be a nice thing to do -- we have talked about it for a long time. I think the dict_id field in the schema is left over from very very early versions of arrow-rs |
Now that dict IDs are automatically assigned, how about we create a new type eg. |
Seems reasonable. We still need to do step 1 and 2 though. I realize now that we missed the 53.0.0 release so we can't do that until December and the next major release since it is a breaking change |
After trying to apply this change I realize it's actually entirely useless since like you said, after 1 and 2 there won't be any use for |
Once #6711 merges, we can change the default to not preserving dict IDs. Then the next breaking release we remove the ability to set dict IDs in the first place? |
Here is the follow up: #6788 |
This is not entirely true. It is part of Arrow's public API and is used by downstream systems such as Comet. |
As noted by @andygrove on #7017, arrow-java currently also exposes the field ID as part of its schema definition. I have double checked that the C FFI interface does not have a way to propagate this information across an FFI boundary - https://arrow.apache.org/docs/format/CDataInterface.html#the-arrowschema-structure. Provided we preserve the ability to control the IDs used by the IPC encoder, I think this shouldn't result in a loss of functionality. Perhaps @brancz you could confirm if this is possible. The conclusion of #6873 was to conduct a scream test, I am inclined to think this has failed. |
Copy/pasting from @andygrove on #7017 to get all the context in this ticket:
|
I looked into this more this morning and removed the uses of Removing these uses from our main branch, which uses arrow-rs 53.3.0, causes failures in Arrow IPC, which is to be expected if I am following along correctly. @tustvold Thanks for clarifying that this does not impact FFI. That was my main concern. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently the
dict_id
field is only used for the purposes of arrow flight encoding/decoding so dictionaries can be mapped to there associated fields.This is annoying and error-prone as the user is left the responsibility of assigning these dictionary IDs and ensuring that they are unique.
#5971 adds the option to auto-assign dictionary IDs during arrow flight encoding. This can be enabled by setting the
preserve_dict_id
option inIpcWriteOptions
tofalse
(current default istrue
Describe the solution you'd like
This can be done in stages but ultimately would like to
preserve_dict_id
default tofalse
preserve_dict_id
option altogetherdict_id
field fromarrow_schema::schema::Field
entirely as it no longer has any purposeDescribe alternatives you've considered
We can leave this is as a configurable option and either only do 1 above or we can leave auto-assigning of dictionary IDs as an opt-in feature
Additional context
The text was updated successfully, but these errors were encountered: