-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hooks to json encoder to override default encoding or add support for unsupported types #7015
base: main
Are you sure you want to change the base?
Conversation
… for unsupported types
TODO: add a test showing how to change the encoding of binary arrays. |
cc @tustvold |
Done |
arrow-json/src/writer/encoder.rs
Outdated
type EncoderFactoryResult<'a> = | ||
Result<Option<(Box<dyn Encoder + 'a>, Option<NullBuffer>)>, ArrowError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Option<&'a NullBuffer>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably could be for consistency, but it isn't likely to have any material performance impact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a review, and think it is probably fine to expose Encoder
, it is a relatively straightforward trait that I don't see changing. It is the reader side that I'd be much more hesitant about altering, as that is quite coupled with the parsing machinery.
That being said there is one slight wrinkle, as documented here
The implementation of Encoder for ArrayFormatter assumes it does not produce
characters that would need to be escaped within a JSON string, e.g.'"'
.
If support for user-provided format specifications is added, this assumption
may need to be revisited
I think we need to do the following:
- Remove the impl of Encoder for ArrayFormatter in favour of a private newtype wrapper
- Clearly document on Encoder the expectation that it generates valid escaped JSON
On a related note, the return type of Box<dyn Encoder + 'a>, Option<NullBuffer>
is a bit of a mouthful, and honestly makes for a bit of a funky API where the semantics of an Encoder
depend on a different data structure returned at construction time.
I wonder if a better API might be something like
pub trait Encoder {
fn encode(&mut self, idx: usize, out: &mut Vec<u8>);
fn nulls(&self) -> Option<NullBuffer>;
}
I also left a comment on potentially simplifying the factory down to a single method
arrow-json/src/writer/encoder.rs
Outdated
_array: &'a dyn Array, | ||
_data_type: &DataType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we pass both DataType and Array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to, it just simplifies the implementation of the EncoderFactory
a bit because almost surely all of them are going to call array.data_type()
. I can remove if you'd prefer.
arrow-json/src/writer/mod.rs
Outdated
@@ -426,10 +438,13 @@ mod tests { | |||
|
|||
let actual: Vec<Option<Value>> = input | |||
.split(|b| *b == b'\n') | |||
.map(|s| (!s.is_empty()).then(|| serde_json::from_slice(s).unwrap())) | |||
.map(|s| { | |||
println!("{:?}", str::from_utf8(s)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upsies!
arrow-json/src/writer/mod.rs
Outdated
} | ||
} | ||
let (_, type_ids, _, buffers) = array | ||
.as_any() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use AsArray instead of manual downcasts, it is much easier to read
arrow-json/src/writer/encoder.rs
Outdated
|
||
/// Make an encoder that if returned runs after all of the default encoders. | ||
/// If this method returns None then an error will be returned. | ||
fn make_fallback_encoder<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? AFAICT we could remove this with no real loss of functionality, if anything it would be more resilient as adding a new default encoder wouldn't break existing overrides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll drop it in favor of a single method, I think it should work for our use case.
} | ||
|
||
#[derive(Debug)] | ||
struct UnionEncoder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW adding a native UnionEncoder is likely not overly complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice 😄
How about I try to write one on our end and then we can upstream it if it works properly?
arrow-json/src/writer/encoder.rs
Outdated
type EncoderFactoryResult<'a> = | ||
Result<Option<(Box<dyn Encoder + 'a>, Option<NullBuffer>)>, ArrowError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably could be for consistency, but it isn't likely to have any material performance impact
Thank you for the review. I've pushed a (broken) version with most comments addressed aside from the part about |
|
Okay I think I understand now. Looking at the implementation it seems to me that it wouldn't be that much extra work to add formatters for Date, Timestamp, Time and Decimal, then we could ditch ArrayFormatter for JSON encoding altogether. I can do that as a prequel PR. Wdyt? |
A private newtype wrapper would avoid a non-trivial amount of code duplication whilst resolving the issue, and would be my preferred approach |
Makes sense, done! There is quite a bit of code churn from changing the |
We essentially had to vendor this module to implement this. I know this has been discussed in the past and there was pushback against exposing
Encoder
to the public API but since I had to go through the work of updating our vendored version I thought I'd at least try to make a PR to upstream this to make our lives easier in the future and bring value back to the community.In my opinion exposing this is a rather small adition the public API and any breaking changes in the future will be restricted to the JSON encoder, which is not a core part of the project. It seems worth it and addresses various requests e.g. how to encode a binary array (base64 encoded string or array of ints are both common). It has also been a stable API for ~ 1 year now.