-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Improved docs on Transforms #2655
base: main
Are you sure you want to change the base?
Conversation
Thanks for the PR! No problem of messing up the commits. Me or @joelostblom will do a review somewhere in coming days. |
@mattijn, @joelostblom I'm combing through old issues and came across this PR that apparently closes #2645 Obviously we'd need to get this branch up-to-date, but I wanted to check-in to see if this had actually resolved #2645? UpdateI think I've got this conflict-free now with |
Previous merge was super messy, due to 2 year old PR
@dsmedia don't feel obligated to review this, just curious if you had any thoughts - since you've done a few doc PRs before? |
Sure. Will have a look this evening. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great doc additions! I've made some recommendations / edits here for consideration.
**Note:** As mentioned in :doc:`../data`, this approach of transforming the | ||
data with Pandas is preferable if we already have the DataFrame at hand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider 1) being more explicit about what exactly is meant by the term "at hand" and 2) being upfront in this sentence about the reason or reasons for Pandas transformations being preferable when the DataFrame is "at hand" (automatic type inference? something else also?)
Also, this suggests that data.html discusses these benefits of when a Pandas transformation is preferable, but it wasn't immediately obvious which part of this section of the docs it is referring to.
It is possible for aggregate functions to not | ||
have an argument. In this case, aggregation will be performed on the column | ||
used in the other axis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible for aggregate functions to not | |
have an argument. In this case, aggregation will be performed on the column | |
used in the other axis. | |
Aggregate functions can be used without arguments. | |
In such cases, the function will automatically aggregate | |
the data from the column specified in the other axis.``` |
:code:`missing`, :code:`distinct` and :code:`valid`) are the ones that get | ||
the most out of this feature. | ||
|
||
Argmin / Argmax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argmin / Argmax | |
Argmin and Argmax Functions |
Both :code:`argmin` and :code:`argmax` aggregate functions can only be used | ||
with the :meth:`~Chart.transform_aggregate` method. Trying to use their | ||
respective shorthand notations will result in an error. This is due to the fact | ||
that either :code:`argmin` or :code:`argmax` functions return an object, not | ||
values. This object then specifies the values to be selected from other | ||
columns when encoding. One can think of the returned object as being a | ||
dictionary, while the column serves the purpose of being a key, which then | ||
obtains its respective value. | ||
|
||
The true value of these functions is appreciated when we want to compare the | ||
most **distinctive** samples from two sets of data with respect to another set | ||
of data. | ||
|
||
As an example, suppose we want to compare the weight of the strongest cars, | ||
with respect to their country/region of origin. This can be done using | ||
:code:`argmax`: | ||
|
||
.. altair-plot:: | ||
|
||
alt.Chart(cars).mark_bar().encode( | ||
x='greatest_hp[Weight_in_lbs]:Q', | ||
y='Origin:N' | ||
).transform_aggregate( | ||
greatest_hp='argmax(Horsepower)', | ||
groupby=['Origin'] | ||
) | ||
|
||
It is clear that Japan's strongest car is also the lightest, while that of USA | ||
is the heaviest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both :code:`argmin` and :code:`argmax` aggregate functions can only be used | |
with the :meth:`~Chart.transform_aggregate` method. Trying to use their | |
respective shorthand notations will result in an error. This is due to the fact | |
that either :code:`argmin` or :code:`argmax` functions return an object, not | |
values. This object then specifies the values to be selected from other | |
columns when encoding. One can think of the returned object as being a | |
dictionary, while the column serves the purpose of being a key, which then | |
obtains its respective value. | |
The true value of these functions is appreciated when we want to compare the | |
most **distinctive** samples from two sets of data with respect to another set | |
of data. | |
As an example, suppose we want to compare the weight of the strongest cars, | |
with respect to their country/region of origin. This can be done using | |
:code:`argmax`: | |
.. altair-plot:: | |
alt.Chart(cars).mark_bar().encode( | |
x='greatest_hp[Weight_in_lbs]:Q', | |
y='Origin:N' | |
).transform_aggregate( | |
greatest_hp='argmax(Horsepower)', | |
groupby=['Origin'] | |
) | |
It is clear that Japan's strongest car is also the lightest, while that of USA | |
is the heaviest. | |
The :code:`argmin` and :code:`argmax` functions help you find values from | |
one field that correspond to the minimum or maximum values in another | |
field. For example, you might want to find the production budget of | |
movies that earned the highest gross revenue in each genre. | |
These functions must be used with the :meth:`~Chart.transform_aggregate` | |
method rather than their shorthand notations. They return objects that act | |
as selectors for values in other columns, rather than returning values | |
directly. You can think of the returned object as a dictionary where the | |
column serves as a key to retrieve corresponding values. | |
To illustrate this, let's compare the weights of cars with the highest | |
horsepower across different regions of origin: | |
.. altair-plot:: | |
alt.Chart(cars).mark_bar().encode( | |
x='greatest_hp[Weight_in_lbs]:Q', | |
y='Origin:N' | |
).transform_aggregate( | |
greatest_hp='argmax(Horsepower)', | |
groupby=['Origin'] | |
) | |
This visualization reveals an interesting contrast: among cars with the | |
highest horsepower in their respective regions, Japanese cars are notably | |
lighter, while American cars are substantially heavier. |
argmin An input data object containing the minimum field value. N/A | ||
argmax An input data object containing the maximum field value. :ref:`gallery_line_chart_with_custom_legend` | ||
average The mean (average) field value. Identical to mean. :ref:`gallery_layer_line_color_rule` | ||
count The total count of data objects in the group. :ref:`gallery_simple_heatmap` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vega-Lite docs also state
Note: ‘count’ operates directly on the input objects and return the same value regardless of the provided field.
Just mentioning in case it's worth adding here as well?
========= =========================================================================== ===================================== | ||
Aggregate Description Example | ||
========= =========================================================================== ===================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vega-lite docs appear to list these in a more logical (if implicit) order, starting with count-related functions (including count
, valid
, values
, missing
, and distinct
), moving to basic mathematical operations (sum
, product
), then to central tendency measures (mean
/average
, variance
/variancep
, stdev
/stdevp
, stderr
, median
), followed by distribution statistics (q1
, q3
, ci0
, ci1
), and finally ending with range functions (min
/argmin
, max
/argmax
). The ordering here appears to be in alphabetial order, though it's not strictly so (e.g. ci01
). I would have a slight preference for the vega-lite-style functional organization scheme (and with explicit headings for the categories).
@@ -8,7 +8,7 @@ There are two ways to aggregate data within Altair: within the encoding itself, | |||
or using a top level aggregate transform. | |||
|
|||
The aggregate property of a field definition can be used to compute aggregate | |||
summary statistics (e.g., median, min, max) over groups of data. | |||
summary statistics (e.g., :code:`median`, :code:`min`, :code:`max`) over groups of data. | |||
|
|||
If at least one fields in the specified encoding channels contain aggregate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: the sentence beginning, "If at least one fields..." --> I think this sentence could be rewritten while we're at it
Notes:
Sorry for messing up the other pull request (#2654), but I finally fixed my commits and branches.