-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataframe API v2 p2: MVP implementation #7560
Conversation
82d2b01
to
e37a795
Compare
18886e3
to
a989ac2
Compare
e37a795
to
ae03b16
Compare
let cur_index_value = streaming_state_per_component | ||
.values() | ||
// NOTE: We're purposefully ignoring RowId-related semantics here: we just want to know | ||
// the value we're looking for on the "main" index (dedupe semantics). | ||
.min_by_key(|streaming_state| streaming_state.index_value) | ||
.map(|streaming_state| streaming_state.index_value)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than doing this on every call to next_row, I suspect it might be clearer to do this whole thing as a 2-phased process.
First, work just with the Timeline data from every view-relevant chunk to materialize a new column of sorted/unique TimeInt values (note, as an added benefit this is the same input you'll want to be able to feed into sample_index_values()
anyways). This could still be done incrementally, "batch-wise" by only looking at overlapping chunks on some horizon.
Then, once we have the ability to iterate over batches of TimeInts, we iterate through them incrementally and look for the matching values from the relevant chunks, as you're doing below, which now becomes a common code-path between this implementation and sampled_index_values()
Additionally, my gut is that having batches of unique TimeInts in advance sets us up nicely for some future optimizations.
- It lets us fairly easily parallelize the per-selected-column work. Each worker can independently yield a sequence of rows matching the requested sequence of TimeInts.
- It lets us look ahead to check for matching runs in the given columns. Any time we have a matching run in a range with a single column (happy path) we can directly yield a slice of multiple rows from our column-generator.
- Similarly, null runs can quickly be identified and generated when the last TimeInt in the requested batch is less than the next available time-int for the column.
- The aggregator consuming from each of the parallel columns generators can then yield RecordBatches based on overlapping row-runs from the separate columns, which means in the happy path of dense non-overlapping chunks we return to getting nice contiguous slices again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do these improvements in follow up PRs, let's focus on landing all semantics first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed -- not a requested change. Just an observation about the structure to keep in mind as you refactor in the direction of supporting sampled_index_values()
7d1cb72
to
39cfb1a
Compare
1d76116
to
94a9c09
Compare
94a9c09
to
9ce5152
Compare
bc4f392
to
9412150
Compare
We've integrated all of this in @abey79's work-in-progress dataframe-view -- everything works semantics-wise. Next steps (future PRs):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
A first implementation of the new dataframe APIs.
The name is now very misleading though: there isn't anything dataframe-y left in here, it is a row-based iterator with Rerun semantics baked in, driven by a sorted streaming join.
It is rather slow (related: #7558 (comment)), lacks many features and is full of edge cases, but it works.
It does support dedupe-latest semantics (slowly), view contents and selections, chunk overlaps, and pagination (horribly, by virtue of implementing
Iterator
).It does not support
Clear
s, norlatest-at
sparse-filling, nor PoVs, nor index sampling. Yet.Upcoming PRs will be all about fixing these shortcomings one by one.
It should look somewhat familiar:
No tests until we have the guarantee that these are the semantics we will commit to.
Checklist
main
build: rerun.io/viewernightly
build: rerun.io/viewerCHANGELOG.md
and the migration guideTo run all checks from
main
, comment on the PR with@rerun-bot full-check
.