Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Improved docs on Transforms #2655

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

tempdata73
Copy link
Contributor

Notes:

  • I didn't use vl's example of using argmax on the movies dataset because this whole page uses the cars dataset as a guide. I felt like using the former would break the flow of thought. Nonetheless, I can happily incorporate it if you guys prefer that one.

Sorry for messing up the other pull request (#2654), but I finally fixed my commits and branches.

@mattijn
Copy link
Contributor

mattijn commented Jul 11, 2022

Thanks for the PR! No problem of messing up the commits. Me or @joelostblom will do a review somewhere in coming days.

@dangotbanned
Copy link
Member

dangotbanned commented Dec 23, 2024

Thanks for the PR! No problem of messing up the commits. Me or @joelostblom will do a review somewhere in coming days.

@mattijn, @joelostblom I'm combing through old issues and came across this PR that apparently closes #2645

Obviously we'd need to get this branch up-to-date, but I wanted to check-in to see if this had actually resolved #2645?

Update

I think I've got this conflict-free now with main 😌

@dangotbanned dangotbanned changed the title Improved docs on Transforms docs: Improved docs on Transforms Dec 23, 2024
@dangotbanned dangotbanned requested a review from dsmedia December 23, 2024 19:46
@dangotbanned
Copy link
Member

@dsmedia don't feel obligated to review this, just curious if you had any thoughts - since you've done a few doc PRs before?

@dsmedia
Copy link
Contributor

dsmedia commented Dec 23, 2024

@dsmedia don't feel obligated to review this, just curious if you had any thoughts - since you've done a few doc PRs before?

Sure. Will have a look this evening.

Copy link
Contributor

@dsmedia dsmedia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great doc additions! I've made some recommendations / edits here for consideration.

Comment on lines +80 to +81
**Note:** As mentioned in :doc:`../data`, this approach of transforming the
data with Pandas is preferable if we already have the DataFrame at hand.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider 1) being more explicit about what exactly is meant by the term "at hand" and 2) being upfront in this sentence about the reason or reasons for Pandas transformations being preferable when the DataFrame is "at hand" (automatic type inference? something else also?)

Also, this suggests that data.html discusses these benefits of when a Pandas transformation is preferable, but it wasn't immediately obvious which part of this section of the docs it is referring to.

Comment on lines +94 to +96
It is possible for aggregate functions to not
have an argument. In this case, aggregation will be performed on the column
used in the other axis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is possible for aggregate functions to not
have an argument. In this case, aggregation will be performed on the column
used in the other axis.
Aggregate functions can be used without arguments.
In such cases, the function will automatically aggregate
the data from the column specified in the other axis.```

:code:`missing`, :code:`distinct` and :code:`valid`) are the ones that get
the most out of this feature.

Argmin / Argmax
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Argmin / Argmax
Argmin and Argmax Functions

Comment on lines +119 to +147
Both :code:`argmin` and :code:`argmax` aggregate functions can only be used
with the :meth:`~Chart.transform_aggregate` method. Trying to use their
respective shorthand notations will result in an error. This is due to the fact
that either :code:`argmin` or :code:`argmax` functions return an object, not
values. This object then specifies the values to be selected from other
columns when encoding. One can think of the returned object as being a
dictionary, while the column serves the purpose of being a key, which then
obtains its respective value.

The true value of these functions is appreciated when we want to compare the
most **distinctive** samples from two sets of data with respect to another set
of data.

As an example, suppose we want to compare the weight of the strongest cars,
with respect to their country/region of origin. This can be done using
:code:`argmax`:

.. altair-plot::

alt.Chart(cars).mark_bar().encode(
x='greatest_hp[Weight_in_lbs]:Q',
y='Origin:N'
).transform_aggregate(
greatest_hp='argmax(Horsepower)',
groupby=['Origin']
)

It is clear that Japan's strongest car is also the lightest, while that of USA
is the heaviest.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Both :code:`argmin` and :code:`argmax` aggregate functions can only be used
with the :meth:`~Chart.transform_aggregate` method. Trying to use their
respective shorthand notations will result in an error. This is due to the fact
that either :code:`argmin` or :code:`argmax` functions return an object, not
values. This object then specifies the values to be selected from other
columns when encoding. One can think of the returned object as being a
dictionary, while the column serves the purpose of being a key, which then
obtains its respective value.
The true value of these functions is appreciated when we want to compare the
most **distinctive** samples from two sets of data with respect to another set
of data.
As an example, suppose we want to compare the weight of the strongest cars,
with respect to their country/region of origin. This can be done using
:code:`argmax`:
.. altair-plot::
alt.Chart(cars).mark_bar().encode(
x='greatest_hp[Weight_in_lbs]:Q',
y='Origin:N'
).transform_aggregate(
greatest_hp='argmax(Horsepower)',
groupby=['Origin']
)
It is clear that Japan's strongest car is also the lightest, while that of USA
is the heaviest.
The :code:`argmin` and :code:`argmax` functions help you find values from
one field that correspond to the minimum or maximum values in another
field. For example, you might want to find the production budget of
movies that earned the highest gross revenue in each genre.
These functions must be used with the :meth:`~Chart.transform_aggregate`
method rather than their shorthand notations. They return objects that act
as selectors for values in other columns, rather than returning values
directly. You can think of the returned object as a dictionary where the
column serves as a key to retrieve corresponding values.
To illustrate this, let's compare the weights of cars with the highest
horsepower across different regions of origin:
.. altair-plot::
alt.Chart(cars).mark_bar().encode(
x='greatest_hp[Weight_in_lbs]:Q',
y='Origin:N'
).transform_aggregate(
greatest_hp='argmax(Horsepower)',
groupby=['Origin']
)
This visualization reveals an interesting contrast: among cars with the
highest horsepower in their respective regions, Japanese cars are notably
lighter, while American cars are substantially heavier.

argmin An input data object containing the minimum field value. N/A
argmax An input data object containing the maximum field value. :ref:`gallery_line_chart_with_custom_legend`
average The mean (average) field value. Identical to mean. :ref:`gallery_layer_line_color_rule`
count The total count of data objects in the group. :ref:`gallery_simple_heatmap`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vega-Lite docs also state

Note: ‘count’ operates directly on the input objects and return the same value regardless of the provided field.

Just mentioning in case it's worth adding here as well?

Comment on lines +171 to +173
========= =========================================================================== =====================================
Aggregate Description Example
========= =========================================================================== =====================================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vega-lite docs appear to list these in a more logical (if implicit) order, starting with count-related functions (including count, valid, values, missing, and distinct), moving to basic mathematical operations (sum, product), then to central tendency measures (mean/average, variance/variancep, stdev/stdevp, stderr, median), followed by distribution statistics (q1, q3, ci0, ci1), and finally ending with range functions (min/argmin, max/argmax). The ordering here appears to be in alphabetial order, though it's not strictly so (e.g. ci01). I would have a slight preference for the vega-lite-style functional organization scheme (and with explicit headings for the categories).

@@ -8,7 +8,7 @@ There are two ways to aggregate data within Altair: within the encoding itself,
or using a top level aggregate transform.

The aggregate property of a field definition can be used to compute aggregate
summary statistics (e.g., median, min, max) over groups of data.
summary statistics (e.g., :code:`median`, :code:`min`, :code:`max`) over groups of data.

If at least one fields in the specified encoding channels contain aggregate,
Copy link
Contributor

@dsmedia dsmedia Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: the sentence beginning, "If at least one fields..." --> I think this sentence could be rewritten while we're at it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

improve documentation on aggregation
4 participants