-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client-side chunks 4: integrations #6441
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
teh-cmc
added
🐍 Python API
Python logging API
🦀 Rust API
Rust logging API
🌊 C++ API
C/C++ API specific
labels
May 27, 2024
This was referenced May 27, 2024
teh-cmc
force-pushed
the
cmc/dense_chunks_3_batching
branch
from
May 27, 2024 16:45
9d0d4bc
to
69895e0
Compare
teh-cmc
force-pushed
the
cmc/dense_chunks_4_integration
branch
2 times, most recently
from
May 27, 2024 16:58
18d360d
to
98095a5
Compare
teh-cmc
force-pushed
the
cmc/dense_chunks_3_batching
branch
from
May 29, 2024 07:35
69895e0
to
7b87a78
Compare
teh-cmc
force-pushed
the
cmc/dense_chunks_4_integration
branch
from
May 29, 2024 07:35
98095a5
to
4fa737b
Compare
@rerun-bot full-check |
Started a full build: https://github.com/rerun-io/rerun/actions/runs/9282238447 |
jleibs
approved these changes
May 30, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This already feels like an improvement without even considering the future benefits.
teh-cmc
force-pushed
the
cmc/dense_chunks_3_batching
branch
from
May 31, 2024 07:56
7b87a78
to
08999d7
Compare
teh-cmc
force-pushed
the
cmc/dense_chunks_4_integration
branch
from
May 31, 2024 07:57
88abaec
to
1292e9d
Compare
teh-cmc
added a commit
that referenced
this pull request
May 31, 2024
This new and improved `re_format_arrow` ™️ brings two major improvements: - It is now designed to format standard Arrow dataframes (aka chunks or batches), i.e. a `Schema` and a `Chunk`. In particular: chunk-level and field-level schema metadata will now be rendered properly with the rest of the table. - Tables larger than your terminal will now do their best to fit in, while making sure to still show just enough data. E.g. here's an excerpt of a real-world Rerun dataframe from our `helix` example: ``` cargo r -p rerun-cli --no-default-features --features native_viewer -- print helix.rrd --verbose ``` before (`main`): ![image](https://github.com/rerun-io/rerun/assets/2910679/99169b2a-d972-439d-900a-8f122a4d5ca3) and after: ![image](https://github.com/rerun-io/rerun/assets/2910679/3fe7acce-d646-4ff2-bfae-eb5073d17741) --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441
teh-cmc
added a commit
that referenced
this pull request
May 31, 2024
…6438) Introduces the new `re_chunk` crate: > A chunk of Rerun data, encoded using Arrow. Used for logging, transport, storage and compute. Specifically, it introduces the `Chunk` type itself, and all methods and helpers related to sorting. A `Chunk` is self-describing: it contains all the data _and_ metadata needed to index it into storage. There are a lot of things that need to be sorted within a `Chunk`, and as such we must make sure to keep track of what is or isn't sorted at all times, to avoid needlessly re-sorting things everytime a chunk changes hands. This necessitates a bunch of sanity checking all over the place to make sure we never end up in undefined states. `Chunk` is not about transport, it's about providing a nice-to-work with representation when manipulating a chunk in memory. Transporting a `Chunk` happens in the next PR. - Fixes #1981 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441
teh-cmc
added a commit
that referenced
this pull request
May 31, 2024
A `TransportChunk` is a `Chunk` that is ready for transport and/or storage. It is very cheap to go from `Chunk` to a `TransportChunk` and vice-versa. A `TransportChunk` maps 1:1 to a native Arrow `RecordBatch`. It has a stable ABI, and can be cheaply send across process boundaries. `arrow2` has no `RecordBatch` type; we will get one once we migrate to `arrow-rs`. A `TransportChunk` is self-describing: it contains all the data _and_ metadata needed to index it into storage. We rely heavily on chunk-level and field-level metadata to communicate Rerun-specific semantics over the wire, e.g. whether some columns are already properly sorted. The Arrow metadata system is fairly limited -- it's all untyped strings --, but for now that seems good enough. It will be trivial to switch to something else later, if need be. - Fixes #1760 - Fixes #1692 - Fixes #3360 - Fixes #1696 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441
teh-cmc
force-pushed
the
cmc/dense_chunks_3_batching
branch
from
May 31, 2024 08:44
08999d7
to
22f7e61
Compare
teh-cmc
added a commit
that referenced
this pull request
May 31, 2024
This is a fork of the old `DataTable` batcher, and works very similarly. Like before, this batcher will micro-batch using both space and time thresholds. There are two main differences: - This batcher maintains a dataframe per-entity, as opposed to the old one which worked globally. - Once a threshold is reached, this batcher further splits the incoming batch in order to fulfill these invariants: ```rust /// In particular, a [`Chunk`] cannot: /// * contain data for more than one entity path /// * contain rows with different sets of timelines /// * use more than one datatype for a given component /// * contain more rows than a pre-configured threshold if one or more timelines are unsorted ``` Most of the code is the same, the real interesting piece is `PendingRow::many_into_chunks`, as well as the newly added tests. - Fixes #4431 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441
teh-cmc
force-pushed
the
cmc/dense_chunks_4_integration
branch
from
May 31, 2024 08:48
1292e9d
to
acd8cd8
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
🌊 C++ API
C/C++ API specific
include in changelog
🐍 Python API
Python logging API
🦀 Rust API
Rust logging API
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Integrate the new chunk batcher in all SDKs, and get rid of the old one.
On the backend, we make sure to deserialize incoming chunks into the old
DataTable
s, so business can continue as usual.Although the new batcher has a much more complicated task with all these sub-splits to manage, it is somehow already more performant than the old one 🤷♂️:
Notice the massive difference in user time.
Part of a PR series to implement our new chunk-based data model on the client-side (SDKs):
Chunk
and its suffle/sort routines #6438TransportChunk
#6439Checklist
main
build: rerun.io/viewernightly
build: rerun.io/viewerTo run all checks from
main
, comment on the PR with@rerun-bot full-check
.