Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Improved docs on Transforms #2655

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 3 additions & 34 deletions doc/user_guide/encodings/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@ Encoding Shorthands
For convenience, Altair allows the specification of the variable name along
with the aggregate and type within a simple shorthand string syntax.
This makes use of the type shorthand codes listed in :ref:`encoding-data-types`
as well as the aggregate names listed in :ref:`encoding-aggregates`.
as well as the aggregate names listed in :ref:`agg-func-table`.
The following table shows examples of the shorthand specification alongside
the long-form equivalent:

Expand Down Expand Up @@ -369,38 +369,7 @@ represents the mean of a third quantity, such as acceleration:
color='mean(Acceleration):Q'
)

Aggregation Functions
^^^^^^^^^^^^^^^^^^^^^

In addition to ``count`` and ``mean``, there are a large number of available
aggregation functions built into Altair:

========= =========================================================================== =====================================
Aggregate Description Example
========= =========================================================================== =====================================
argmin An input data object containing the minimum field value. N/A
argmax An input data object containing the maximum field value. :ref:`gallery_line_chart_with_custom_legend`
average The mean (average) field value. Identical to mean. :ref:`gallery_layer_line_color_rule`
count The total count of data objects in the group. :ref:`gallery_simple_heatmap`
distinct The count of distinct field values. N/A
max The maximum field value. :ref:`gallery_boxplot`
mean The mean (average) field value. :ref:`gallery_scatter_with_layered_histogram`
median The median field value :ref:`gallery_boxplot`
min The minimum field value. :ref:`gallery_boxplot`
missing The count of null or undefined field values. N/A
q1 The lower quartile boundary of values. :ref:`gallery_boxplot`
q3 The upper quartile boundary of values. :ref:`gallery_boxplot`
ci0 The lower boundary of the bootstrapped 95% confidence interval of the mean. :ref:`gallery_sorted_error_bars_with_ci`
ci1 The upper boundary of the bootstrapped 95% confidence interval of the mean. :ref:`gallery_sorted_error_bars_with_ci`
stderr The standard error of the field values. N/A
stdev The sample standard deviation of field values. N/A
stdevp The population standard deviation of field values. N/A
sum The sum of field values. :ref:`gallery_streamgraph`
valid The count of field values that are not null or undefined. N/A
values A list of data objects in the group. N/A
variance The sample variance of field values. N/A
variancep The population variance of field values. N/A
========= =========================================================================== =====================================
For a full list of available aggregates, see :ref:`agg-func-table`.


Sort Option
Expand Down Expand Up @@ -486,7 +455,7 @@ x-axis, using the barley dataset:
)

The last two charts are the same because the default aggregation
(see :ref:`encoding-aggregates`) is ``mean``. To highlight the
(see :doc:`transform/aggregate`) is ``mean``. To highlight the
difference between sorting via channel and sorting via field consider the
following example where we don't aggregate the data
and use the `op` parameter to specify a different aggregation than `mean`
Expand Down
133 changes: 129 additions & 4 deletions doc/user_guide/transform/aggregate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ There are two ways to aggregate data within Altair: within the encoding itself,
or using a top level aggregate transform.

The aggregate property of a field definition can be used to compute aggregate
summary statistics (e.g., median, min, max) over groups of data.
summary statistics (e.g., :code:`median`, :code:`min`, :code:`max`) over groups of data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think these should have some markup, but since they aren't functions - median etc seems like the wrong choice.

Something like "median(...)" would link more closely to how you'd use it


If at least one fields in the specified encoding channels contain aggregate,
dsmedia marked this conversation as resolved.
Show resolved Hide resolved
the resulting visualization will show aggregate data. In this case, all
Expand Down Expand Up @@ -43,9 +43,9 @@ is made available for convenience, and is equivalent to the longer form::
# ...

For more information on shorthand encodings specifications, see
:ref:`encoding-aggregates`.
:ref:`shorthand-description`.
dangotbanned marked this conversation as resolved.
Show resolved Hide resolved

The same plot can be shown using an explicitly computed aggregation, using the
The same plot can be shown via an explicitly computed aggregation, using the
:meth:`~Chart.transform_aggregate` method:

.. altair-plot::
Expand All @@ -58,7 +58,96 @@ The same plot can be shown using an explicitly computed aggregation, using the
groupby=["Cylinders"]
)

For a list of available aggregates, see :ref:`encoding-aggregates`.
The alternative to using aggregate functions is to preprocess the data with
Pandas, and then plot the resulting DataFrame:

.. altair-plot::

cars_df = data.cars()
source = (
cars_df.groupby('Cylinders')
.Acceleration
.mean()
.reset_index()
.rename(columns={'Acceleration': 'mean_acc'})
)

alt.Chart(source).mark_bar().encode(
y='Cylinders:O',
x='mean_acc:Q'
)

**Note:** As mentioned in :doc:`../data`, this approach of transforming the
data with Pandas is preferable if we already have the DataFrame at hand.
Comment on lines +80 to +81
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider 1) being more explicit about what exactly is meant by the term "at hand" and 2) being upfront in this sentence about the reason or reasons for Pandas transformations being preferable when the DataFrame is "at hand" (automatic type inference? something else also?)

Also, this suggests that data.html discusses these benefits of when a Pandas transformation is preferable, but it wasn't immediately obvious which part of this section of the docs it is referring to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this suggests that data.html discusses these benefits of when a Pandas transformation is preferable, but it wasn't immediately obvious which part of this section of the docs it is referring to.

I think it should be referencing data-transformations


Because :code:`Cylinders` is of type :code:`int64` in the :code:`source`
DataFrame, Altair would have treated it as a :code:`qualitative` --instead of
:code:`ordinal`-- type, had we not specified it. Making the type of data
explicit is important since it affects the resulting plot; see
:ref:`type-legend-scale` and :ref:`type-axis-scale` for two illustrated
examples. As a rule of thumb, it is better to make the data type explicit,
instead of relying on an implicit type conversion.

Functions Without Arguments
^^^^^^^^^^^^^^^^^^^^^^^^^^^

It is possible for aggregate functions to not
have an argument. In this case, aggregation will be performed on the column
used in the other axis.
dangotbanned marked this conversation as resolved.
Show resolved Hide resolved

The following chart demonstrates this by counting the number of cars with
respect to their country of origin.

.. altair-plot::

alt.Chart(cars).mark_bar().encode(
y='Origin:N',
# shorthand form of alt.Y(aggregate='count')
x='count()'
)
Comment on lines +103 to +107
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment seems like it meant alt.X(aggregate='count'); but I think we can do without

Suggested change
alt.Chart(cars).mark_bar().encode(
y='Origin:N',
# shorthand form of alt.Y(aggregate='count')
x='count()'
)
alt.Chart(cars).mark_bar().encode(
x='count()',
y='Origin:N'
)


**Note:** The :code:`count` aggregate function is of type
:code:`quantitative` by default, it does not matter if the source data is a
DataFrame, URL pointer, CSV file or JSON file.
Comment on lines +109 to +111
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Note:** The :code:`count` aggregate function is of type
:code:`quantitative` by default, it does not matter if the source data is a
DataFrame, URL pointer, CSV file or JSON file.
.. note::
The :code:`count` aggregate function is of type :code:`quantitative` by default,
it does not matter if the source data is a DataFrame, URL pointer, CSV file or JSON file.


Functions that handle categorical data (such as :code:`count`,
:code:`missing`, :code:`distinct` and :code:`valid`) are the ones that get
the most out of this feature.

Argmin / Argmax
dangotbanned marked this conversation as resolved.
Show resolved Hide resolved
^^^^^^^^^^^^^^^
Both :code:`argmin` and :code:`argmax` aggregate functions can only be used
with the :meth:`~Chart.transform_aggregate` method. Trying to use their
respective shorthand notations will result in an error. This is due to the fact
that either :code:`argmin` or :code:`argmax` functions return an object, not
values. This object then specifies the values to be selected from other
columns when encoding. One can think of the returned object as being a
dictionary, while the column serves the purpose of being a key, which then
obtains its respective value.

The true value of these functions is appreciated when we want to compare the
most **distinctive** samples from two sets of data with respect to another set
of data.

As an example, suppose we want to compare the weight of the strongest cars,
with respect to their country/region of origin. This can be done using
:code:`argmax`:

.. altair-plot::

alt.Chart(cars).mark_bar().encode(
x='greatest_hp[Weight_in_lbs]:Q',
y='Origin:N'
).transform_aggregate(
greatest_hp='argmax(Horsepower)',
groupby=['Origin']
)

It is clear that Japan's strongest car is also the lightest, while that of USA
is the heaviest.
dangotbanned marked this conversation as resolved.
Show resolved Hide resolved

See :ref:`gallery_line_chart_with_custom_legend` for another example that uses
:code:`argmax`. The case of :code:`argmin` is completely similar.

Transform Options
^^^^^^^^^^^^^^^^^
Expand All @@ -70,3 +159,39 @@ class, which has the following options:
The :class:`~AggregatedFieldDef` objects have the following options:

.. altair-object-table:: altair.AggregatedFieldDef

.. _agg-func-table:

List of Aggregation Functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In addition to ``count`` and ``average``, there are a large number of available
aggregation functions built into Altair; they are listed in the following table:

========= =========================================================================== =====================================
Aggregate Description Example
========= =========================================================================== =====================================
Comment on lines +170 to +172
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vega-lite docs appear to list these in a more logical (if implicit) order, starting with count-related functions (including count, valid, values, missing, and distinct), moving to basic mathematical operations (sum, product), then to central tendency measures (mean/average, variance/variancep, stdev/stdevp, stderr, median), followed by distribution statistics (q1, q3, ci0, ci1), and finally ending with range functions (min/argmin, max/argmax). The ordering here appears to be in alphabetial order, though it's not strictly so (e.g. ci01). I would have a slight preference for the vega-lite-style functional organization scheme (and with explicit headings for the categories).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on changing the order.

I'd probably need to see the end result of adding categories though.
The naive approach of just adding a category field would add a lot of repetition

argmin An input data object containing the minimum field value. N/A
argmax An input data object containing the maximum field value. :ref:`gallery_line_chart_with_custom_legend`
average The mean (average) field value. Identical to mean. :ref:`gallery_layer_line_color_rule`
count The total count of data objects in the group. :ref:`gallery_simple_heatmap`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vega-Lite docs also state

Note: ‘count’ operates directly on the input objects and return the same value regardless of the provided field.

Just mentioning in case it's worth adding here as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vega-Lite docs also state

Note: ‘count’ operates directly on the input objects and return the same value regardless of the provided field.

Just mentioning in case it's worth adding here as well?

Maybe that phrasing could replace

"... in the other axis" (#2655 (comment))

distinct The count of distinct field values. N/A
max The maximum field value. :ref:`gallery_boxplot`
mean The mean (average) field value. :ref:`gallery_scatter_with_layered_histogram`
median The median field value :ref:`gallery_boxplot`
min The minimum field value. :ref:`gallery_boxplot`
missing The count of null or undefined field values. N/A
q1 The lower quartile boundary of values. :ref:`gallery_boxplot`
q3 The upper quartile boundary of values. :ref:`gallery_boxplot`
ci0 The lower boundary of the bootstrapped 95% confidence interval of the mean. :ref:`gallery_sorted_error_bars_with_ci`
ci1 The upper boundary of the bootstrapped 95% confidence interval of the mean. :ref:`gallery_sorted_error_bars_with_ci`
stderr The standard error of the field values. N/A
stdev The sample standard deviation of field values. N/A
stdevp The population standard deviation of field values. N/A
sum The sum of field values. :ref:`gallery_streamgraph`
product The product of field values. N/A
valid The count of field values that are not null or undefined. N/A
values ?? N/A
dangotbanned marked this conversation as resolved.
Show resolved Hide resolved
variance The sample variance of field values. N/A
variancep The population variance of field values. N/A
========= =========================================================================== =====================================
Loading