Add a "Getting started" guide for the dataframe API (#7643)

### What WIP getting started guide. A skeleton with some working code for now. Very much in need of feedback. may need this to pass CI: - #7720 ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested the web demo (if applicable): * Using examples from latest `main` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7643?manifest_url=https://app.rerun.io/version/main/examples_manifest.json) * Using full set of examples from `nightly` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7643?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG * [x] If applicable, add a new check to the [release checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)! * [x] If have noted any breaking changes to the log API in `CHANGELOG.md` and the migration guide - [PR Build Summary](https://build.rerun.io/pr/7643) - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html) To run all checks from `main`, comment on the PR with `@rerun-bot full-check`. --------- Co-authored-by: gavrelina <[email protected]> Co-authored-by: Zeljko Mihaljcic <[email protected]> Co-authored-by: Andreas Reich <[email protected]>
rerun-io · Oct 16, 2024 · 072070f · 072070f
1 parent d911c5c
commit 072070f
Show file tree

Hide file tree

Showing 8 changed files with 441 additions and 13 deletions.
diff --git a/docs/content/getting-started/data-out.md b/docs/content/getting-started/data-out.md
@@ -0,0 +1,16 @@
+---
+title: Get data out of Rerun
+order: 450
+---
+
+At its core, Rerun is a database. The viewer includes the [dataframe view](../reference/types/views/dataframe_view) to explore data in tabular form, and the SDK includes an API to export the data as dataframes from the recording. These features can be used, for example, to perform analysis on the data and log back the results to the original recording.
+
+In this three-part guide, we explore such a workflow by implementing an "open jaw detector" on top of our [face tracking example](https://rerun.io/examples/video-image/face_tracking). This process is split into three steps:
+
+1. [Explore a recording with the dataframe view](data-out/explore-as-dataframe)
+2. [Export the dataframe](data-out/export-dataframe)
+3. [Analyze the data and log the results](data-out/analyze-and-log)
+
+Note: this guide uses the popular [Pandas](https://pandas.pydata.org) dataframe package. The same concept however applies in the same way for alternative dataframe packages such as [Polars](https://pola.rs).
+
+If you just want to see the final result, jump to the [complete script](data-out/analyze-and-log.md#complete-script) at the end of the third section.
diff --git a/docs/content/getting-started/data-out/analyze-and-log.md b/docs/content/getting-started/data-out/analyze-and-log.md
@@ -0,0 +1,89 @@
+---
+title: Analyze the data and log the results
+order: 3
+---
+
+
+
+In the previous sections, we explored our data and exported it to a Pandas dataframe. In this section, we will analyze the data to extract a "jaw open state" signal and log it back to the viewer.
+
+
+
+## Analyze the data
+
+We already identified that thresholding the `jawOpen` signal at 0.15 is all we need to produce a binary "jaw open state" signal.
+
+In the [previous section](export-dataframe.md#inspect-the-dataframe), we prepared a flat, floating point column with the signal of interest called `"jawOpen"`. Let's add a boolean column to our Pandas dataframe to hold our jaw open state:
+
+```python
+df["jawOpenState"] = df["jawOpen"] > 0.15
+```
+
+
+## Log the result back to the viewer
+
+The first step is to initialize the logging SDK targeting the same recording we just analyzed.
+This requires matching both the application ID and recording ID precisely.
+By using the same identifiers, we're appending new data to an existing recording.
+If the recording is currently open in the viewer (and it's listening for new connections), this approach enables us to seamlessly add the new data to the ongoing session.
+
+```python
+rr.init(
+    recording.application_id(),
+    recording_id=recording.recording_id(),
+)
+rr.connect()
+```
+
+_Note_: When automating data analysis, it is typically preferable to log the results to an distinct RRD file next to the source RRD (using `rr.save()`). In such a situation, it is also valid to use the same app ID and recording ID. This allows opening both the source and result RRDs in the viewer, which will display data from both files under the same recording.
+
+We will log our jaw open state data in two forms:
+1. As a standalone `Scalar` component, to hold the raw data.
+2. As a `Text` component on the existing bounding box entity, such that we obtain a textual representation of the state in the visualization.
+
+Here is how to log the data as a scalar:
+
+```python
+rr.send_columns(
+    "/jaw_open_state",
+    times=[rr.TimeSequenceColumn("frame_nr", df["frame_nr"])],
+    components=[
+        rr.components.ScalarBatch(df["jawOpenState"]),
+    ],
+)
+```
+
+With use the [`rr.send_column()`](../../howto/send_columns.md) API to efficiently send the entire column of data in a single batch.
+
+Next, let's log the same data as `Text` component:
+
+```python
+target_entity = "/video/detector/faces/0/bbox"
+rr.log_components(target_entity, [rr.components.ShowLabels(True)], static=True)
+rr.send_columns(
+    target_entity,
+    times=[rr.TimeSequenceColumn("frame_nr", df["frame_nr"])],
+    components=[
+        rr.components.TextBatch(np.where(df["jawOpenState"], "OPEN", "CLOSE")),
+    ],
+)
+```
+
+Here we first log the [`ShowLabel`](../../reference/types/components/show_labels.md) component as static to enable the display of the label. Then, we use `rr.send_column()` again to send an entire batch of text labels. We use the [`np.where()`](https://numpy.org/doc/stable/reference/generated/numpy.where.html) to produce a label matching the state for each timestamp.
+
+### Final result
+
+With some adjustments to the viewer blueprint, we obtain the following result:
+
+<video width="100%" autoplay loop muted controls>
+    <source src="https://static.rerun.io/getting-started-data-out/data-out-final-vp8.webm" type="video/webm" />
+</video>
+
+The OPEN/CLOSE label is displayed along the bounding box on the 2D view, and the `/jaw_open_state` signal is visible in both the timeseries and dataframe views.
+
+
+### Complete script
+
+Here is the complete script used by this guide to load data, analyze it, and log the result back:
+
+snippet: tutorials/data_out
diff --git a/docs/content/getting-started/data-out/explore-as-dataframe.md b/docs/content/getting-started/data-out/explore-as-dataframe.md
@@ -0,0 +1,72 @@
+---
+title: Explore a recording with the dataframe view
+order: 1
+---
+
+
+
+
+In this first part of the guide, we run the [face tracking example](https://rerun.io/examples/video-image/face_tracking) and explore the data in the viewer.
+
+## Create a recording
+
+The first step is to create a recording in the viewer using the face tracking example. Check the [face tracking installation instruction](https://rerun.io/examples/video-image/face_tracking#run-the-code) for more information on how to run this example.
+
+Here is such a recording:
+
+<video width="100%" autoplay loop muted controls>
+    <source src="https://static.rerun.io/getting-started-data-out/data-out-first-look-vp8.webm" type="video/webm" />
+</video>
+
+A person's face is visible and being tracked. Their jaws occasionally open and close. In the middle of the recording, the face is also temporarily hidden and no longer tracked.
+
+
+## Explore the data
+
+Amongst other things, the [MediaPipe Face Landmark](https://ai.google.dev/edge/mediapipe/solutions/vision/face_landmarker) package used by the face tracking example outputs so-called blendshapes signals, which provide information on various aspects of the face expression. These signals are logged under the `/blendshapes` root entity by the face tracking example.
+
+One signal, `jawOpen` (logged under the `/blendshapes/0/jawOpen` entity as a [`Scalar`](../../reference/types/components/scalar.md) component), is of particular interest for our purpose. Let's inspect it further using a timeseries view:
+
+
+<picture>
+  <img src="https://static.rerun.io/data-out-jaw-open-signal/258f5ffe043b8affcc54d5ea1bc864efe7403f2c/full.png" alt="">
+  <source media="(max-width: 480px)" srcset="https://static.rerun.io/data-out-jaw-open-signal/258f5ffe043b8affcc54d5ea1bc864efe7403f2c/480w.png">
+  <source media="(max-width: 768px)" srcset="https://static.rerun.io/data-out-jaw-open-signal/258f5ffe043b8affcc54d5ea1bc864efe7403f2c/768w.png">
+  <source media="(max-width: 1024px)" srcset="https://static.rerun.io/data-out-jaw-open-signal/258f5ffe043b8affcc54d5ea1bc864efe7403f2c/1024w.png">
+  <source media="(max-width: 1200px)" srcset="https://static.rerun.io/data-out-jaw-open-signal/258f5ffe043b8affcc54d5ea1bc864efe7403f2c/1200w.png">
+</picture>
+
+This signal indeed seems to jump from approximately 0.0 to 0.5 whenever the jaws are open. We also notice a discontinuity in the middle of the recording. This is due to the blendshapes being [`Clear`](../../reference/types/archetypes/clear.md)ed when no face is detected.
+
+Let's create a dataframe view to further inspect the data:
+
+<picture>
+  <img src="https://static.rerun.io/data-out-jaw-open-dataframe/bde18eb7b159e3ea1166a61e4a334eaedf2e04f8/full.png" alt="">
+  <source media="(max-width: 480px)" srcset="https://static.rerun.io/data-out-jaw-open-dataframe/bde18eb7b159e3ea1166a61e4a334eaedf2e04f8/480w.png">
+  <source media="(max-width: 768px)" srcset="https://static.rerun.io/data-out-jaw-open-dataframe/bde18eb7b159e3ea1166a61e4a334eaedf2e04f8/768w.png">
+  <source media="(max-width: 1024px)" srcset="https://static.rerun.io/data-out-jaw-open-dataframe/bde18eb7b159e3ea1166a61e4a334eaedf2e04f8/1024w.png">
+  <source media="(max-width: 1200px)" srcset="https://static.rerun.io/data-out-jaw-open-dataframe/bde18eb7b159e3ea1166a61e4a334eaedf2e04f8/1200w.png">
+</picture>
+
+Here is how this view is configured:
+- Its content is set to `/blendshapes/0/jawOpen`. As a result, the table only contains columns pertaining to that entity (along with any timeline(s)). For this entity, a single column exists in the table, corresponding to entity's single component (a `Scalar`).
+- The `frame_nr` timeline is used as index for the table. This means that the table will contain one row for each distinct value of `frame_nr` for which data is available.
+- The rows can further be filtered by time range. In this case, we keep the default "infinite" boundaries, so no filtering is applied.
+- The dataframe view has other advanced features which we are not using here, including filtering rows based on the existence of data for a given column, or filling empty cells with latest-at data.
+
+<!-- TODO(#7499): add link to more information on filter-is-not-null and fill with latest-at -->
+
+Now, let's look at the actual data as represented in the above screenshot. At around frame #140, the jaws are open, and, accordingly, the `jawOpen` signal has values around 0.55. Shortly after, they close again and the signal decreases to below 0.1. Then, the signal becomes empty. This happens in rows corresponding to the period of time when the face cannot be tracked and all the signals are cleared.
+
+
+## Next steps
+
+Our exploration of the data in the viewer so far provided us with two important pieces of information useful to implement the jaw open detector.
+
+First, we identified that the `Scalar` value contained in `/blendshapes/0/jawOpen` contains relevant data. In particular, thresholding this signal with a value of 0.15 should provide us with a closed/opened jaw state binary indicator.
+
+Then, we explored the numerical data in a dataframe view. Importantly, the way we configured this view for our needs informs us on how to query the recording from code such as to obtain the correct output.
+
+<!-- TODO(#7462): improve the previous paragraph to mention copy-as-code instead -->
+
+From there, our next step is to query the recording and extract the data as a Pandas dataframe in Python. This is covered in the [next section](export-dataframe.md) of this guide.
diff --git a/docs/content/getting-started/data-out/export-dataframe.md b/docs/content/getting-started/data-out/export-dataframe.md
@@ -0,0 +1,204 @@
+---
+title: Export the dataframe
+order: 2
+---
+
+
+In the [previous section](explore-as-dataframe.md), we explored some face tracking data using the dataframe view. In this section, we will see how we can use the dataframe API of the Rerun SDK to export the same data into a [Pandas](https://pandas.pydata.org) dataframe to further inspect and process it.
+
+## Load the recording
+
+The dataframe SDK loads data from an .RRD file.
+The first step is thus to save the recording as RRD, which can be done from the Rerun menu:
+
+<picture style="zoom: 0.5">
+  <img src="https://static.rerun.io/save_recording/ece0f887428b1800a305a3e30faeb57fa3d77cd8/full.png" alt="">
+  <source media="(max-width: 480px)" srcset="https://static.rerun.io/save_recording/ece0f887428b1800a305a3e30faeb57fa3d77cd8/480w.png">
+</picture>
+
+We can then load the recording in a Python script as follows:
+
+```python
+import rerun as rr
+import numpy as np # We'll need this later.
+
+# load the recording
+recording = rr.dataframe.load_recording("face_tracking.rrd")
+```
+
+
+## Query the data
+
+Once we loaded a recording, we can query it to extract some data. Here is how it is done:
+
+```python
+# query the recording into a pandas dataframe
+view = recording.view(
+    index="frame_nr",
+    contents="/blendshapes/0/jawOpen"
+)
+table = view.select().read_all()
+```
+
+A lot is happening here, let's go step by step:
+1. We first create a _view_ into the recording. The view specifies which index column we want to use (in this case the `"frame_nr"` timeline), and which other content we want to consider (here, only the `/blendshapes/0/jawOpen` entity). The view defines a subset of all the data contained in the recording where each row has a unique value for the index, and columns are filtered based on the value(s) provided as `contents` argument.
+2. A view can then be queried. Here we use the simplest possible form of querying by calling `select()`. No filtering is applied, and all view columns are selected. The result thus corresponds to the entire view.
+3. The object returned by `select()` is a [`pyarrow.RecordBatchReader`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html). This is essentially an iterator that returns the stream of [`pyarrow.RecordBatch`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow-recordbatch)es containing the query data.
+4. Finally, we use the [`pyarrow.RecordBatchReader.read_all()`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html#pyarrow.RecordBatchReader.read_all) function to read all record batches as a [`pyarrow.Table`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).
+
+**Note**: queries can be further narrowed by filtering rows and/or selecting a subset of the view columns. See the reference documentation for more information.
+
+<!-- TODO(#7499): add a link to the reference documentation -->
+
+Let's have a look at the resulting table:
+
+```python
+print(table)
+```
+
+Here is the result:
+```
+pyarrow.Table
+frame_nr: int64
+frame_time: timestamp[ns]
+log_tick: int64
+log_time: timestamp[ns]
+/blendshapes/0/jawOpen:Scalar: list<item: double>
+  child 0, item: double
+----
+frame_nr: [[0],[1],...,[412],[413]]
+frame_time: [[1970-01-01 00:00:00.000000000],[1970-01-01 00:00:00.040000000],...,[1970-01-01 00:00:16.480000000],[1970-01-01 00:00:16.520000000]]
+log_tick: [[34],[92],...,[22077],[22135]]
+log_time: [[2024-10-13 08:26:46.819571000],[2024-10-13 08:26:46.866358000],...,[2024-10-13 08:27:01.722971000],[2024-10-13 08:27:01.757358000]]
+/blendshapes/0/jawOpen:Scalar: [[[0.03306490555405617]],[[0.03812221810221672]],...,[[0.06996039301156998]],[[0.07366073131561279]]]
+```
+
+Again, this is a [PyArrow](https://arrow.apache.org/docs/python/index.html) table which contains the result of our query. Further exploring Arrow structures is beyond the scope of this guide. Yet, it is a reminder that Rerun natively stores—and returns—data in arrow format. As such, it efficiently interoperates with other Arrow-native and/or compatible tools such as [Polars](https://pola.rs) or [DuckDB](https://duckdb.org).
+
+
+## Create a Pandas dataframe
+
+Before exploring the data further, let's convert the table to a Pandas dataframe:
+
+```python
+df = table.to_pandas()
+```
+
+Alternatively, the dataframe can be created directly, without using the intermediate PyArrow table:
+
+```python
+df = view.select().read_pandas()
+```
+
+
+## Inspect the dataframe
+
+Let's have a first look at this dataframe:
+
+```python
+print(df)
+```
+
+Here is the result:
+
+<!-- NOLINT_START -->
+
+```
+     frame_nr              frame_time  log_tick                   log_time /blendshapes/0/jawOpen:Scalar
+0           0 1970-01-01 00:00:00.000        34 2024-10-13 08:26:46.819571         [0.03306490555405617]
+1           1 1970-01-01 00:00:00.040        92 2024-10-13 08:26:46.866358         [0.03812221810221672]
+2           2 1970-01-01 00:00:00.080       150 2024-10-13 08:26:46.899699        [0.027743922546505928]
+3           3 1970-01-01 00:00:00.120       208 2024-10-13 08:26:46.934704        [0.024137917906045914]
+4           4 1970-01-01 00:00:00.160       266 2024-10-13 08:26:46.967762        [0.022867577150464058]
+..        ...                     ...       ...                        ...                           ...
+409       409 1970-01-01 00:00:16.360     21903 2024-10-13 08:27:01.619732         [0.07283800840377808]
+410       410 1970-01-01 00:00:16.400     21961 2024-10-13 08:27:01.656455         [0.07037288695573807]
+411       411 1970-01-01 00:00:16.440     22019 2024-10-13 08:27:01.689784         [0.07556036114692688]
+412       412 1970-01-01 00:00:16.480     22077 2024-10-13 08:27:01.722971         [0.06996039301156998]
+413       413 1970-01-01 00:00:16.520     22135 2024-10-13 08:27:01.757358         [0.07366073131561279]
+
+[414 rows x 5 columns]
+```
+
+<!-- NOLINT_END -->
+
+We can make several observations from this output.
+
+- The first four columns are timeline columns. These are the various timelines the data is logged to in this recording.
+- The last columns is named `/blendshapes/0/jawOpen:Scalar`. This is what we call a _component column_, and it corresponds to the [Scalar](../../reference/types/components/scalar.md) component logged to the `/blendshapes/0/jawOpen` entity.
+- Each row in the `/blendshapes/0/jawOpen:Scalar` column consists of a _list_ of (typically one) scalar.
+
+This last point may come as a surprise but is a consequence of Rerun's data model where components are always stored as arrays. This enables, for example, to log an entire point cloud using the [`Points3D`](../../reference/types/archetypes/points3d.md) archetype under a single entity and at a single timestamp.
+
+Let's explore this further, recalling that, in our recording, no face was detected at around frame #170:
+
+```python
+print(df["/blendshapes/0/jawOpen:Scalar"][160:180])
+```
+
+Here is the result:
+
+```
+160      [0.0397215373814106]
+161    [0.037685077637434006]
+162      [0.0402931347489357]
+163     [0.04329492896795273]
+164      [0.0394592322409153]
+165    [0.020853394642472267]
+166                        []
+167                        []
+168                        []
+169                        []
+170                        []
+171                        []
+172                        []
+173                        []
+174                        []
+175                        []
+176                        []
+177                        []
+178                        []
+179                        []
+Name: /blendshapes/0/jawOpen:Scalar, dtype: object
+```
+
+We note that the data contains empty lists when no face is detected. When the blendshapes entities are [`Clear`](../../reference/types/archetypes/clear.md)ed, this happens for the corresponding timestamps and all further timestamps until a new value is logged.
+
+While this data representation is in general useful, a flat floating point representation with NaN for missing values is typically more convenient for scalar data. This is achieved using the [`explode()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) method:
+
+```python
+df["jawOpen"] = df["/blendshapes/0/jawOpen:Scalar"].explode().astype(float)
+print(df["jawOpen"][160:180])
+```
+Here is the result:
+```
+160    0.039722
+161    0.037685
+162    0.040293
+163    0.043295
+164    0.039459
+165    0.020853
+166         NaN
+167         NaN
+168         NaN
+169         NaN
+170         NaN
+171         NaN
+172         NaN
+173         NaN
+174         NaN
+175         NaN
+176         NaN
+177         NaN
+178         NaN
+179         NaN
+Name: jawOpen, dtype: float64
+```
+
+This confirms that the newly created `"jawOpen"` column now contains regular, 64-bit float numbers, and missing values are represented by NaNs.
+
+_Note_: should you want to filter out the NaNs, you may use the [`dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method.
+
+## Next steps
+
+With this, we are ready to analyze the data and log back the result to the Rerun viewer, which is covered in the [next section](analyze-and-log.md) of this guide.
diff --git a/docs/content/getting-started/troubleshooting.md b/docs/content/getting-started/troubleshooting.md
@@ -1,6 +1,6 @@
 ---
 title: Troubleshooting
-order: 600
+order: 800
 ---
 
 You can set `RUST_LOG=debug` before running to get some verbose logging output.