-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: make pandas and NumPy optional dependencies, don't require PyArrow for plotting with Polars/Modin/cuDF #3452
Conversation
@MarcoGorelli if all the tests are passing locally for you, try running these two: hatch run generate-schema-wrapper
hatch run update-init-file If there are any changes other than whitespace, you'll get a more informative error than https://github.com/vega/altair/actions/runs/9733301918/job/26860053484?pr=3452 Otherwise those two should clean everything up. |
https://github.com/vega/altair/actions/runs/9734437115/job/26862450764?pr=3452 @MarcoGorelli try this to run the tests on the matrix locally: hatch test --all The help for hatch test -h |
@MarcoGorelli these were two minor things. I couldn't figure out a way to request these another way. fix: add missing
|
Thank you very much for making this Pull Request. That is very much appreciated! This PR introduces a new dependency on narwhals. Narwhals is described as a lightweight and extensible compatibility layer between multiple dataframe libraries. Next to this it enables support for native dataframes (pandas still included) without converting these to a pyarrow Tables through the dataframe interchange protocol. The latest release of Altair is ~850 kB and after this PR, Altair will have the following hard dependencies:
When using Altair with only remote endpoints (like APIs that are returning JSON-data), I've the feeling that we do not need to rely on numpy. I've a few questions regarding this PR and narwhals.
if dtype == nw.Date:
columns.append(nw.col(name).dt.to_string("%Y-%m-%d")) And change a test from: "a": {"date": "2019-01-01T00:00:00"}, To: "a": {"date": "2019-01-01"}, Upon first sight, I think we cannot introduce this breaking change. Does this change still adhere to full ISO-8601 dates representation as is required by Javascript? See also past issues like this: #1027.
data = nw.from_native(data, eager_only=True, strict=False)
if isinstance(data, nw.DataFrame):
return data
if isinstance(data, DataFrameLike):
from altair.utils.data import arrow_table_from_dfi_dataframe
pa_table = arrow_table_from_dfi_dataframe(data)
return nw.from_native(pa_table, eager_only=True) I'm a bit surprised that the
DataType: TypeAlias = Union[
Dict[Any, Any], "IntoDataFrame", SupportsGeoInterface, DataFrameLike
] I remember I have discussed this with @dangotbanned recently in #3431, see #3431 (comment) onwards, but I can't find a clear answer anymore on this topic. Thanks again for this PR! Really promising! (also this PR will supersede #3384) |
Thanks for the ping @mattijn The good news is you can trust the All of the imports within the AFAIK, this should be fine to move to a regular import though
|
The user will see this message ``` altair\utils\data.py:17:5: TID251 `typing.Optional` is banned: Use `Union[T, None]` instead. `typing.Optional` is likely to be confused with `altair.Optional`, which have similar but different semantic meaning. See vega#3449 Found 1 error. ``` Partial fix for vega#3452 (comment)
Thanks so much for your quick and careful reviews @mattijn and @dangotbanned 🙏 ! Much appreciated
"in for a penny, in for a pound" - I've added a commit to do this. NumPy is used very little, mostly isinstance checks and I've caught up with Polars devs, and they're on board with using Altair in |
Thanks @MarcoGorelli When I originally suggested
I think most people would agree on the value of this, but probably best for another issue. I would expect there would be more support if the implementation was different to the existing hvPlot (polars) / hvPlot accessor. My suggestion would be for |
Thanks @dangotbanned !
Definitely -
Anyway, we can figure out the details later, whether or not Altair decides to go ahead with Narwhals (for the record, I think hvplot is amazing and its developers are really friendly, so I'd still advocating for it to be advertised in the Polars docs - but if Altair could provide a more lightweight alternative which doesn't require other heavy libraries, I think it'd be a more suitable default) |
Really awesome to see you were able to remove numpy as a hard dependency too! Its great to see this happening! A few more comments while you might still have this PR open. In my opinion, there is no need to serialize microseconds for type A single comment like Line 477 in 7312f25
This line can be removed: Line 451 in 7312f25
Thanks again for pushing this! |
A bit off topic, but regarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me this is a +1, it would be great if there is another maintainer doing a review.
This is exciting! :) I'd also like to have a quick look. Hopefully get to it this weekend. @jonmmease if you have a moment you might want to have a look as well as you've worked on the interchange protocol implementation and hence might be aware of edge cases we need to consider. |
Thanks @jonmmease for your careful review, really appreciate it! 🙏
Aah, that's amazing! I'd missed that. Like this,
(and maybe polars.LazyFrame should also implement the interchange protocol just at the metadata level, but that's one for a separate discussion) |
Thanks for the changes @MarcoGorelli! I still have one question which I cannot really understand. Given the following example: import datetime
import polars as pl
import pyarrow as pa
import narwhals as nw
data_type_date = [{"date": datetime.date(2024,7,15)}]
local_iso_fmt_string = "%Y-%m-%dT%H:%M:%S"
nw_selector = [nw.col('date').cast(nw.Datetime).dt.to_string(local_iso_fmt_string)]
pl_df = pl.DataFrame(data_type_date)
nw_pl_df = nw.from_native(pl_df, eager_only=True)
nw_pl_df.select(nw_selector).rows(named=True) This returns
But if I do: py_tb = pa.Table.from_pylist(data_type_date)
nw_py_tb = nw.from_native(py_tb, eager_only=True)
nw_py_tb.select(nw_selector).rows(named=True) It returns
I expected these output to be equal. |
thanks @mattijn for taking a look! Libraries differ in how they interpret strftime directives - there's an issue about this in Arrow apache/arrow#20146 Practically speaking, if
does that cause any issues? Note that, if the datetime contains fractional seconds, then the outputs match In [2]: import datetime
...: import polars as pl
...: import pyarrow as pa
...: import narwhals as nw
...:
...: data_type_date = [{"date": datetime.datetime(2024,7,15,1,2,3,123456)}]
...: local_iso_fmt_string = "%Y-%m-%dT%H:%M:%S%.f"
...: nw_selector = [nw.col('date').cast(nw.Datetime).dt.to_string(local_iso_fmt_string)]
...:
...: pl_df = pl.DataFrame(data_type_date)
...: nw_pl_df = nw.from_native(pl_df, eager_only=True)
...: nw_pl_df.select(nw_selector).rows(named=True)
Out[2]: [{'date': '2024-07-15T01:02:03.123456'}]
In [3]: py_tb = pa.Table.from_pylist(data_type_date)
...: nw_py_tb = nw.from_native(py_tb, eager_only=True)
...: nw_py_tb.select(nw_selector).rows(named=True)
Out[3]: [{'date': '2024-07-15T01:02:03.123456'}] |
Yes for type One change with your latest comment compared to my comment above is that for me the I thought it was an issue with pyarrow, but if I do: import pyarrow.compute as pc
pc.strftime(py_tb['date'], format=local_iso_fmt_string).to_pylist() It returns
That seems right. And in the narwhals docs you link to (thanks for the link!) it is mentioned that:
That makes me even more confused with your statement:
Thanks for answering so promptly! Appreciated. |
The difference is the In [1]: import pyarrow as pa
In [2]: from datetime import date, datetime
In [3]: import pyarrow.compute as pc
In [4]: pc.strftime(pa.array([datetime(2020, 1, 1)]), '%Y-%m-%dT%H%:%M%:%S')
Out[4]:
<pyarrow.lib.StringArray object at 0x7fd513149cc0>
[
"2020-01-01T00%:00%:00.000000"
] So, in summary:
|
Interesting... Let me see if I can find something on this at the pyarrow library. Thanks! |
We have to cast the Lines 451 to 456 in 8b4b3db
|
Sure - if you're OK with a teeny-tiny bit of Polars-specific logic, this can work 👍 if dtype == nw.Date and nw.get_native_namespace(data) is get_polars():
# Polars doesn't allow formatting `Date` with time directives.
# The date -> datetime cast is extremely fast compared with `to_string`
columns.append(
nw.col(name).cast(nw.Datetime).dt.to_string(local_iso_fmt_string)
)
elif dtype == nw.Date:
columns.append(nw.col(name).dt.to_string(local_iso_fmt_string))
elif dtype == nw.Datetime:
columns.append(nw.col(name).dt.to_string(f"{local_iso_fmt_string}%.f")) Just pushed a commit to do that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address this here, we've made a new release of Narwhals which includes interchange-level support - that is to say, if the input object implements the interchange protocol, then we let you inspect the schema
Amazing! This looks great, and having the narwhals metadata interface to all DataFrame interchange protocol objects is really nice.
@mattijn I don't the difference in the inclusion of fractional seconds will make any different to Vega in the browser, so I wouldn't worry about it. That said, @MarcoGorelli 's small polars workaround looks fine as well, so I'm happy to keep that since it maintains identical results.
Thanks again for this great PR @MarcoGorelli, and for all of the work you put into narwhals to make it possible (even during the review process!). I'm ready for this to be merged, but want to pass it back over to you one last time in case any of my review comments shake something loose for you.
Please add a comment with a ✅ when you're ready for it to merge.
and column.describe_categorical["is_ordered"] | ||
and column.describe_categorical["categories"] is not None | ||
nw.is_ordered_categorical(column) | ||
and not (categories := column.cat.get_categories()).is_empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice. I had missed that dropping Python 3.7 means we can use the walrus operator!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄 there's a few others if you're brave enough to try out auto-walrus
altair/utils/core.py
Outdated
if isinstance(data, DataFrameLike): | ||
from altair.utils.data import arrow_table_from_dfi_dataframe | ||
|
||
pa_table = arrow_table_from_dfi_dataframe(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a priority now (and I can't think of any specific scenarios that would benefit), was just curious.
For posterity, this ibis team is in the process of removing the hard pyarrow dependency for some use cases (ibis-project/ibis#9552), but it will still be required for many (all?) specific backends I think.
altair/utils/core.py
Outdated
# Early return if already a Narwhals DataFrame | ||
return data | ||
# Using `strict=False` will return `data` as-is if the object cannot be converted. | ||
data = nw.from_native(data, eager_only=True, strict=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so they get wrapped by narwhals fine (just as normal pandas DataFrames), but then we use from_native
whenever we need to be able to tell the difference between a GeoDataFrame and a regular DataFrame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the file name itself. It's currently test_dataframe_interchange.py
, and I was suggesting renaming to test_to_values_narwhals.py
.
thanks @jonmmease ! review comments look great to me, much appreciated 🙏 ✅ |
Thanks to everyone involved! This is very exciting and we should cut a release soon :) I'll open a new discussion for it. |
* refactor(typing): reduce type ignores in `api.py` * feat(typing): adds `_is_undefined` guard Direct copy from another PR, but is helpful here https://github.com/dangotbanned/altair/blob/d607c70824860ef3478d7d3996d26dc516d69369/altair/vegalite/v5/api.py#L504 * fix(typing): `ignore[index]` from #3455 * fix(typing): `[ignore[arg-type]` from #3452 * refactor(typing): use `_is_undefined` in some cases * refactor: reduce code duplication in `Chart.to_dict` * fix(typing): Recover `TopLevelMixin.resolve_` return types I removed these in 6adf564 due to how complex they were getting, but both `mypy` and `pyright` seem happy with this solution * fix(typing): narrow annotated types for `_check_if_valid_subspec` Following investigation during #3480 (comment) * refactor: remove unused `_wrapper_classes` argument from `SchemaBase` Fixes: - #3480 (comment) - #3480 (comment) - #3480 (comment)
Suggested in #3445 (comment)
closes #3456
closes #3445
closes #3473
closes #3471(this wasn't an issue to begin with)Demo
Polars can work without PyArrow nor pandas installed (EDIT: and without NumPy either!):
PyArrow table is still supported
(notice the
.to_arrow
)Ibis is still supported, and now Date dtype works too!
(previously, it would have been necessary to manually cast 'date' to 'datetime' type before plotting)
Performance
There's also a perf improvement to staying Polars-native where possible - for example, if I run
in IPython, then I see:
main
branch: 0.033584889399935494MarcoGorelli:narwhalify
branch: 0.024283270300657023That's for a (4475, 3)-shaped dataframe - for dataframe with many columns (esp. with strings / categoricals) I'd expect the difference to grow
Addendum: I hadn't noticed this originally, but this PR also makes Altair's import times about twice as fast