-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure IPC stream messages are contiguous #6321
base: main
Are you sure you want to change the base?
Conversation
The failures in integration are expected; JS and Go are still producing IPC files without valid embedded streams so they will not be considered valid by the modified arrow-json-integration-test. I wasn't sure how best to expose that added validation; maybe an env var or build option or we just leave it commented until go and JS are fixed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this. I'm not an expert on this code but here's some initial thoughts from a scan of the changes.
arrow-ipc/src/writer.rs
Outdated
write_message_at_offset(writer, 0, encoded, write_options) | ||
} | ||
|
||
fn write_message_at_offset<W: Write>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: this function name confuses me slightly. If I'm writing to a "file-like object' (e.g. Write
) then there is no underlying capability to write at an offset. "Skip offset bytes" doesn't make sense in a write context (as opposed to a read context).
I think...instead...you are maybe using the offset
to determine how much padding to add at the end?
Maybe write_remaining_message
with offset
replaced by already_written
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about write_message_positioned_absolutely(writer, position, encoded, options)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, since write_message
doesn't seem widely used I could just add the already_written_len
argument there rather than adding a new fn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace how about now?
/// The number of bytes between each block of bytes, as an offset for random access | ||
block_offsets: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand what this was doing so I'm not sure I understand what it means that it is going away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the former mechanism for tracking absolute position in the output file. It was named according to the only thing it was used for: setting Block::offset
s. Since it's now used to track how much padding should be written after flatbuffers Messages and to make its expected value more obvious, I renamed it.
The integration CI test failure looks real to me: https://github.com/apache/arrow-rs/actions/runs/10603567022/job/29388207333?pr=6321 |
@alamb integration failures are expected since this PR adds validation for an aspect of the file format which not all producers guarantee yet (see also the main tracking issue). I wasn't sure how (or if) that added validation should be in this PR, any advice would be welcome. |
I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more |
pub fn write_message<W: Write>( | ||
mut writer: W, | ||
already_written_len: usize, | ||
encoded: EncodedData, | ||
write_options: &IpcWriteOptions, | ||
) -> Result<(usize, usize), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think technically this would be a breaking change as the method is publicly exposed to the API. Could maybe deprecate this and create it as a new method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's preferable I can certainly do that; I think that was the first thing I wrote. Then it seemed that this function is only public for use in arrow-flight, so I prioritized keeping things simpler.
70c88e8
to
9877966
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get some tests please
Marking as a draft as this PR has some feedback and appears to be waiting on tests. I am trying to clear the review backlog in this crate |
Which issue does this PR close?
Closes #6311
Rationale for this change
The format guarantees that each IPC file embeds a valid IPC stream. However, when writing IPC files arrow-rs currently aligns the encapsulated flatbuffers Messages to 64 byte boundaries instead of 8 bytes. This can leave gaps of padding bytes between the Messages which are not valid in the stream format.
What changes are included in this PR?
IpcWriteOptions::alignment
Are there any user-facing changes?
No