-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataframe API v2 p0: Chunk support for dedupe-latest semantics #7558
Conversation
That's just me being dumb: I keep forgetting that these things are not expressible given how Unfortunately that does mean that my current, fairly readable approach of index-deduping all relevant chunks and then doing a somewhat straightforward streaming-join, sucks even more than I thought, performance-wise. That's fine, we can always make it fast in a follow-up PR by adding even more cursor shenanigans instead... but let's integrate it all and cement the semantics first before trying to go there. I still want to keep this implementation around as it materializes explicitly the semantics we expect, and can in fact be made pretty fast once we actually have a |
Yeah, this definitely seems like an area where that would be nice. At least it looks like the relevant PRs are still moving forward on that one.
Strongly agree. At least at this phase no individual value should be copied more than once. |
let indices = ArrowPrimitiveArray::from_vec( | ||
(0..untaken.0.len() as i32) | ||
.filter(|i| i % 2 == 0) | ||
.collect_vec(), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we take a contiguous slice? Is there internal optimization for slicing when possible, or does it always copy blindly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always copies.
Co-authored-by: Jeremy Leibs <[email protected]>
The new public API definition and nothing else. Speak now. * Part of #7495 * Requires #7558 --------- Co-authored-by: Jeremy Leibs <[email protected]>
A first implementation of the new dataframe APIs. The name is now very misleading though: there isn't anything dataframe-y left in here, it is a row-based iterator with Rerun semantics baked in, driven by a sorted streaming join. It is rather slow (related: #7558 (comment)), lacks many features and is full of edge cases, but it works. It does support dedupe-latest semantics (slowly), view contents and selections, chunk overlaps, and pagination (horribly, by virtue of implementing `Iterator`). It does _not_ support `Clear`s, nor `latest-at` sparse-filling, nor PoVs, nor index sampling. Yet. Upcoming PRs will be all about fixing these shortcomings one by one. It should look somewhat familiar: ```rust let query_cache = QueryCache::new(store); let query_engine = QueryEngine { store, cache: &query_cache, }; let mut query = QueryExpression2::new(timeline); query.view_contents = Some( query_engine .iter_entity_paths(&entity_path_filter) .map(|entity_path| (entity_path, None)) .collect(), ); query.filtered_index_range = Some(ResolvedTimeRange::new(time_from, time_to)); eprintln!("{query:#?}:"); let query_handle = query_engine.query(query.clone()); // eprintln!("{:#?}", query_handle.selected_contents()); for batch in query_handle.into_batch_iter().skip(offset).take(len) { eprintln!("{batch}"); } ``` No tests until we have the guarantee that these are the semantics we will commit to. * Part of #7495 * Requires #7559
Implements support for the new
dedupe-latest
™️ semantics onChunk
. This is one of the fundamental primitives required for the new upcoming dataframe APIs.This requires the use of the
take
Arrow kernel.Unfortunately I made the mistake of testing that new kernel, which revealed that it is allocating a lot of data when it shouldn't, so I'll have to fix it at some point in a future PR.
Checklist
main
build: rerun.io/viewernightly
build: rerun.io/viewerCHANGELOG.md
and the migration guideTo run all checks from
main
, comment on the PR with@rerun-bot full-check
.