Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Improved docs on Transforms #2655

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 3 additions & 34 deletions doc/user_guide/encodings/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@ Encoding Shorthands
For convenience, Altair allows the specification of the variable name along
with the aggregate and type within a simple shorthand string syntax.
This makes use of the type shorthand codes listed in :ref:`encoding-data-types`
as well as the aggregate names listed in :ref:`encoding-aggregates`.
as well as the aggregate names listed in :ref:`agg-func-table`.
The following table shows examples of the shorthand specification alongside
the long-form equivalent:

Expand Down Expand Up @@ -369,38 +369,7 @@ represents the mean of a third quantity, such as acceleration:
color='mean(Acceleration):Q'
)

Aggregation Functions
^^^^^^^^^^^^^^^^^^^^^

In addition to ``count`` and ``mean``, there are a large number of available
aggregation functions built into Altair:

========= =========================================================================== =====================================
Aggregate Description Example
========= =========================================================================== =====================================
argmin An input data object containing the minimum field value. N/A
argmax An input data object containing the maximum field value. :ref:`gallery_line_chart_with_custom_legend`
average The mean (average) field value. Identical to mean. :ref:`gallery_layer_line_color_rule`
count The total count of data objects in the group. :ref:`gallery_simple_heatmap`
distinct The count of distinct field values. N/A
max The maximum field value. :ref:`gallery_boxplot`
mean The mean (average) field value. :ref:`gallery_scatter_with_layered_histogram`
median The median field value :ref:`gallery_boxplot`
min The minimum field value. :ref:`gallery_boxplot`
missing The count of null or undefined field values. N/A
q1 The lower quartile boundary of values. :ref:`gallery_boxplot`
q3 The upper quartile boundary of values. :ref:`gallery_boxplot`
ci0 The lower boundary of the bootstrapped 95% confidence interval of the mean. :ref:`gallery_sorted_error_bars_with_ci`
ci1 The upper boundary of the bootstrapped 95% confidence interval of the mean. :ref:`gallery_sorted_error_bars_with_ci`
stderr The standard error of the field values. N/A
stdev The sample standard deviation of field values. N/A
stdevp The population standard deviation of field values. N/A
sum The sum of field values. :ref:`gallery_streamgraph`
valid The count of field values that are not null or undefined. N/A
values A list of data objects in the group. N/A
variance The sample variance of field values. N/A
variancep The population variance of field values. N/A
========= =========================================================================== =====================================
For a full list of available aggregates, see :ref:`agg-func-table`.


Sort Option
Expand Down Expand Up @@ -486,7 +455,7 @@ x-axis, using the barley dataset:
)

The last two charts are the same because the default aggregation
(see :ref:`encoding-aggregates`) is ``mean``. To highlight the
(see :doc:`transform/aggregate`) is ``mean``. To highlight the
difference between sorting via channel and sorting via field consider the
following example where we don't aggregate the data
and use the `op` parameter to specify a different aggregation than `mean`
Expand Down
133 changes: 129 additions & 4 deletions doc/user_guide/transform/aggregate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ There are two ways to aggregate data within Altair: within the encoding itself,
or using a top level aggregate transform.

The aggregate property of a field definition can be used to compute aggregate
summary statistics (e.g., median, min, max) over groups of data.
summary statistics (e.g., :code:`median`, :code:`min`, :code:`max`) over groups of data.

If at least one fields in the specified encoding channels contain aggregate,
Copy link
Contributor

@dsmedia dsmedia Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: the sentence beginning, "If at least one fields..." --> I think this sentence could be rewritten while we're at it

the resulting visualization will show aggregate data. In this case, all
Expand Down Expand Up @@ -43,9 +43,9 @@ is made available for convenience, and is equivalent to the longer form::
# ...

For more information on shorthand encodings specifications, see
:ref:`encoding-aggregates`.
:ref:`shorthand-description`.

The same plot can be shown using an explicitly computed aggregation, using the
The same plot can be shown via an explicitly computed aggregation, using the
:meth:`~Chart.transform_aggregate` method:

.. altair-plot::
Expand All @@ -58,7 +58,96 @@ The same plot can be shown using an explicitly computed aggregation, using the
groupby=["Cylinders"]
)

For a list of available aggregates, see :ref:`encoding-aggregates`.
The alternative to using aggregate functions is to preprocess the data with
Pandas, and then plot the resulting DataFrame:

.. altair-plot::

cars_df = data.cars()
source = (
cars_df.groupby('Cylinders')
.Acceleration
.mean()
.reset_index()
.rename(columns={'Acceleration': 'mean_acc'})
)

alt.Chart(source).mark_bar().encode(
y='Cylinders:O',
x='mean_acc:Q'
)

**Note:** As mentioned in :doc:`../data`, this approach of transforming the
data with Pandas is preferable if we already have the DataFrame at hand.
Comment on lines +80 to +81
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider 1) being more explicit about what exactly is meant by the term "at hand" and 2) being upfront in this sentence about the reason or reasons for Pandas transformations being preferable when the DataFrame is "at hand" (automatic type inference? something else also?)

Also, this suggests that data.html discusses these benefits of when a Pandas transformation is preferable, but it wasn't immediately obvious which part of this section of the docs it is referring to.


Because :code:`Cylinders` is of type :code:`int64` in the :code:`source`
DataFrame, Altair would have treated it as a :code:`qualitative` --instead of
:code:`ordinal`-- type, had we not specified it. Making the type of data
explicit is important since it affects the resulting plot; see
:ref:`type-legend-scale` and :ref:`type-axis-scale` for two illustrated
examples. As a rule of thumb, it is better to make the data type explicit,
instead of relying on an implicit type conversion.

Functions Without Arguments
^^^^^^^^^^^^^^^^^^^^^^^^^^^

It is possible for aggregate functions to not
have an argument. In this case, aggregation will be performed on the column
used in the other axis.
Comment on lines +94 to +96
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is possible for aggregate functions to not
have an argument. In this case, aggregation will be performed on the column
used in the other axis.
Aggregate functions can be used without arguments.
In such cases, the function will automatically aggregate
the data from the column specified in the other axis.```


The following chart demonstrates this by counting the number of cars with
respect to their country of origin.

.. altair-plot::

alt.Chart(cars).mark_bar().encode(
y='Origin:N',
# shorthand form of alt.Y(aggregate='count')
x='count()'
)

**Note:** The :code:`count` aggregate function is of type
:code:`quantitative` by default, it does not matter if the source data is a
DataFrame, URL pointer, CSV file or JSON file.

Functions that handle categorical data (such as :code:`count`,
:code:`missing`, :code:`distinct` and :code:`valid`) are the ones that get
the most out of this feature.

Argmin / Argmax
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Argmin / Argmax
Argmin and Argmax Functions

^^^^^^^^^^^^^^^
Both :code:`argmin` and :code:`argmax` aggregate functions can only be used
with the :meth:`~Chart.transform_aggregate` method. Trying to use their
respective shorthand notations will result in an error. This is due to the fact
that either :code:`argmin` or :code:`argmax` functions return an object, not
values. This object then specifies the values to be selected from other
columns when encoding. One can think of the returned object as being a
dictionary, while the column serves the purpose of being a key, which then
obtains its respective value.

The true value of these functions is appreciated when we want to compare the
most **distinctive** samples from two sets of data with respect to another set
of data.

As an example, suppose we want to compare the weight of the strongest cars,
with respect to their country/region of origin. This can be done using
:code:`argmax`:

.. altair-plot::

alt.Chart(cars).mark_bar().encode(
x='greatest_hp[Weight_in_lbs]:Q',
y='Origin:N'
).transform_aggregate(
greatest_hp='argmax(Horsepower)',
groupby=['Origin']
)

It is clear that Japan's strongest car is also the lightest, while that of USA
is the heaviest.
Comment on lines +119 to +147
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Both :code:`argmin` and :code:`argmax` aggregate functions can only be used
with the :meth:`~Chart.transform_aggregate` method. Trying to use their
respective shorthand notations will result in an error. This is due to the fact
that either :code:`argmin` or :code:`argmax` functions return an object, not
values. This object then specifies the values to be selected from other
columns when encoding. One can think of the returned object as being a
dictionary, while the column serves the purpose of being a key, which then
obtains its respective value.
The true value of these functions is appreciated when we want to compare the
most **distinctive** samples from two sets of data with respect to another set
of data.
As an example, suppose we want to compare the weight of the strongest cars,
with respect to their country/region of origin. This can be done using
:code:`argmax`:
.. altair-plot::
alt.Chart(cars).mark_bar().encode(
x='greatest_hp[Weight_in_lbs]:Q',
y='Origin:N'
).transform_aggregate(
greatest_hp='argmax(Horsepower)',
groupby=['Origin']
)
It is clear that Japan's strongest car is also the lightest, while that of USA
is the heaviest.
The :code:`argmin` and :code:`argmax` functions help you find values from
one field that correspond to the minimum or maximum values in another
field. For example, you might want to find the production budget of
movies that earned the highest gross revenue in each genre.
These functions must be used with the :meth:`~Chart.transform_aggregate`
method rather than their shorthand notations. They return objects that act
as selectors for values in other columns, rather than returning values
directly. You can think of the returned object as a dictionary where the
column serves as a key to retrieve corresponding values.
To illustrate this, let's compare the weights of cars with the highest
horsepower across different regions of origin:
.. altair-plot::
alt.Chart(cars).mark_bar().encode(
x='greatest_hp[Weight_in_lbs]:Q',
y='Origin:N'
).transform_aggregate(
greatest_hp='argmax(Horsepower)',
groupby=['Origin']
)
This visualization reveals an interesting contrast: among cars with the
highest horsepower in their respective regions, Japanese cars are notably
lighter, while American cars are substantially heavier.


See :ref:`gallery_line_chart_with_custom_legend` for another example that uses
:code:`argmax`. The case of :code:`argmin` is completely similar.

Transform Options
^^^^^^^^^^^^^^^^^
Expand All @@ -70,3 +159,39 @@ class, which has the following options:
The :class:`~AggregatedFieldDef` objects have the following options:

.. altair-object-table:: altair.AggregatedFieldDef

.. _agg-func-table:

List of Aggregation Functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In addition to ``count`` and ``average``, there are a large number of available
aggregation functions built into Altair; they are listed in the following table:

========= =========================================================================== =====================================
Aggregate Description Example
========= =========================================================================== =====================================
Comment on lines +171 to +173
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vega-lite docs appear to list these in a more logical (if implicit) order, starting with count-related functions (including count, valid, values, missing, and distinct), moving to basic mathematical operations (sum, product), then to central tendency measures (mean/average, variance/variancep, stdev/stdevp, stderr, median), followed by distribution statistics (q1, q3, ci0, ci1), and finally ending with range functions (min/argmin, max/argmax). The ordering here appears to be in alphabetial order, though it's not strictly so (e.g. ci01). I would have a slight preference for the vega-lite-style functional organization scheme (and with explicit headings for the categories).

argmin An input data object containing the minimum field value. N/A
argmax An input data object containing the maximum field value. :ref:`gallery_line_chart_with_custom_legend`
average The mean (average) field value. Identical to mean. :ref:`gallery_layer_line_color_rule`
count The total count of data objects in the group. :ref:`gallery_simple_heatmap`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vega-Lite docs also state

Note: ‘count’ operates directly on the input objects and return the same value regardless of the provided field.

Just mentioning in case it's worth adding here as well?

distinct The count of distinct field values. N/A
max The maximum field value. :ref:`gallery_boxplot`
mean The mean (average) field value. :ref:`gallery_scatter_with_layered_histogram`
median The median field value :ref:`gallery_boxplot`
min The minimum field value. :ref:`gallery_boxplot`
missing The count of null or undefined field values. N/A
q1 The lower quartile boundary of values. :ref:`gallery_boxplot`
q3 The upper quartile boundary of values. :ref:`gallery_boxplot`
ci0 The lower boundary of the bootstrapped 95% confidence interval of the mean. :ref:`gallery_sorted_error_bars_with_ci`
ci1 The upper boundary of the bootstrapped 95% confidence interval of the mean. :ref:`gallery_sorted_error_bars_with_ci`
stderr The standard error of the field values. N/A
stdev The sample standard deviation of field values. N/A
stdevp The population standard deviation of field values. N/A
sum The sum of field values. :ref:`gallery_streamgraph`
product The product of field values. N/A
valid The count of field values that are not null or undefined. N/A
values ?? N/A
variance The sample variance of field values. N/A
variancep The population variance of field values. N/A
========= =========================================================================== =====================================
Loading