-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip-feat: pandas as soft dependency #3384
Changes from all commits
110d4cd
f40d951
ba4b778
21910b1
88a8870
c14d94a
d00cd10
bd70bf4
8b41305
62ab14d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -95,3 +95,27 @@ def pyarrow_available() -> bool: | |||||
return True | ||||||
except ImportError: | ||||||
return False | ||||||
|
||||||
|
||||||
def import_pandas() -> ModuleType: | ||||||
min_version = "0.25" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add a comment in |
||||||
try: | ||||||
version = importlib_version("pandas") | ||||||
if Version(version) < Version(min_version): | ||||||
raise RuntimeError( | ||||||
f"The pandas package must be version {min_version} or greater. " | ||||||
f"Found version {version}" | ||||||
) | ||||||
import pandas as pd | ||||||
|
||||||
return pd | ||||||
except ImportError as err: | ||||||
raise ImportError( | ||||||
f"Serialization of the DataFrame requires\n" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
It can also be a |
||||||
f"version {min_version} or greater of the 'pandas' package. \n" | ||||||
f"This can be installed with pip using:\n" | ||||||
f' pip install "pandas>={min_version}"\n' | ||||||
"or conda:\n" | ||||||
f' conda install -c conda-forge "pandas>={min_version}"\n\n' | ||||||
f"ImportError: {err.args[0]}" | ||||||
) from err |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,9 +25,7 @@ | |
from types import ModuleType | ||
|
||
import jsonschema | ||
import pandas as pd | ||
import numpy as np | ||
from pandas.api.types import infer_dtype | ||
|
||
from altair.utils.schemapi import SchemaBase | ||
from altair.utils._dfi_types import Column, DtypeKind, DataFrame as DfiDataFrame | ||
|
@@ -40,6 +38,7 @@ | |
from typing import Literal, Protocol, TYPE_CHECKING, runtime_checkable | ||
|
||
if TYPE_CHECKING: | ||
import pandas as pd | ||
from pandas.core.interchange.dataframe_protocol import Column as PandasColumn | ||
|
||
V = TypeVar("V") | ||
|
@@ -53,6 +52,11 @@ def __dataframe__( | |
) -> DfiDataFrame: ... | ||
|
||
|
||
def _is_pandas_dataframe(obj: Any) -> bool: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could this function be a simple There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for start reviewing this PR @binste! I don't think I can do this without importing pandas first. I tried setting up a function on which I can do some duck typing def instance(obj):
return type(obj).__name__ But found out that both polars and pandas are using the instance type There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I'm missing something but couldn't we call the pandas import function you created in here and if it raises an importerror, we know it's not a pandas dataframe anyway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's pragmatic, I admit. But that would be an unnecessary import of pandas if it is available in the environment, but if the data object is something else. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's the optional import logic I added to plotly.py a while back: https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/_plotly_utils/optional_imports.py if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A trick I learned from scikit-learn is to check if
If pandas was never imported, then (this is also what we do in Narwhals, were pandas/polars/etc are never explicitly imported) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just see your response here @MarcoGorelli! I also made this observation recently, see the comment I just added #3384 (comment)... |
||
"""Check if the object is an instance of a pandas DataFrame.""" | ||
return all(attr in dir(obj) for attr in ["iloc", "columns", "index"]) | ||
|
||
|
||
TYPECODE_MAP = { | ||
"ordinal": "O", | ||
"nominal": "N", | ||
|
@@ -208,7 +212,10 @@ def infer_vegalite_type( | |
---------- | ||
data: object | ||
""" | ||
typ = infer_dtype(data, skipna=False) | ||
from altair.utils._importers import import_pandas | ||
|
||
pd = import_pandas() | ||
typ = pd.api.types.infer_dtype(data, skipna=False) | ||
|
||
if typ in [ | ||
"floating", | ||
|
@@ -299,7 +306,7 @@ def numpy_is_subtype(dtype: Any, subtype: Any) -> bool: | |
return False | ||
|
||
|
||
def sanitize_dataframe(df: pd.DataFrame) -> pd.DataFrame: # noqa: C901 | ||
def sanitize_dataframe(df: "pd.DataFrame") -> "pd.DataFrame": # noqa: C901 | ||
"""Sanitize a DataFrame to prepare it for serialization. | ||
|
||
* Make a copy | ||
|
@@ -316,6 +323,9 @@ def sanitize_dataframe(df: pd.DataFrame) -> pd.DataFrame: # noqa: C901 | |
* convert dedicated string column to objects and replace NaN with None | ||
* Raise a ValueError for TimeDelta dtypes | ||
""" | ||
from altair.utils._importers import import_pandas | ||
|
||
pd = import_pandas() | ||
df = df.copy() | ||
|
||
if isinstance(df.columns, pd.RangeIndex): | ||
|
@@ -448,7 +458,7 @@ def sanitize_arrow_table(pa_table): | |
|
||
def parse_shorthand( | ||
shorthand: Union[Dict[str, Any], str], | ||
data: Optional[Union[pd.DataFrame, DataFrameLike]] = None, | ||
data: Optional[Union[DataFrameLike, "pd.DataFrame"]] = None, | ||
parse_aggregates: bool = True, | ||
parse_window_ops: bool = False, | ||
parse_timeunits: bool = True, | ||
|
@@ -601,15 +611,15 @@ def parse_shorthand( | |
# Fall back to pandas-based inference. | ||
# Note: The AttributeError catch is a workaround for | ||
# https://github.com/pandas-dev/pandas/issues/55332 | ||
if isinstance(data, pd.DataFrame): | ||
if _is_pandas_dataframe(data): | ||
attrs["type"] = infer_vegalite_type(data[unescaped_field]) | ||
else: | ||
raise | ||
|
||
if isinstance(attrs["type"], tuple): | ||
attrs["sort"] = attrs["type"][1] | ||
attrs["type"] = attrs["type"][0] | ||
elif isinstance(data, pd.DataFrame): | ||
elif _is_pandas_dataframe(data): | ||
# Fallback if pyarrow is not installed or if pandas is older than 1.5 | ||
# | ||
# Remove escape sequences so that types can be inferred for columns with special characters | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -22,6 +22,7 @@ | |||||
overload, | ||||||
Literal, | ||||||
TypeVar, | ||||||
TYPE_CHECKING, | ||||||
) | ||||||
from itertools import zip_longest | ||||||
from importlib.metadata import version as importlib_version | ||||||
|
@@ -31,7 +32,6 @@ | |||||
import jsonschema.exceptions | ||||||
import jsonschema.validators | ||||||
import numpy as np | ||||||
import pandas as pd | ||||||
from packaging.version import Version | ||||||
|
||||||
# This leads to circular imports with the vegalite module. Currently, this works | ||||||
|
@@ -44,6 +44,15 @@ | |||||
else: | ||||||
from typing_extensions import Self | ||||||
|
||||||
if TYPE_CHECKING: | ||||||
pass | ||||||
|
||||||
Comment on lines
+47
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Aware that it's just a wip PR, thought I'd just note it anyway :) |
||||||
|
||||||
class _PandasTimestamp: | ||||||
def isoformat(self): | ||||||
return "dummy_isoformat" # Return a dummy ISO format string | ||||||
Comment on lines
+51
to
+53
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should inherit from a Protocol as a pd.Timestamp is not an instance of |
||||||
|
||||||
|
||||||
TSchemaBase = TypeVar("TSchemaBase", bound=Type["SchemaBase"]) | ||||||
|
||||||
ValidationErrorList = List[jsonschema.exceptions.ValidationError] | ||||||
|
@@ -477,7 +486,9 @@ def _todict(obj: Any, context: Optional[Dict[str, Any]]) -> Any: | |||||
return obj.to_dict() | ||||||
elif isinstance(obj, np.number): | ||||||
return float(obj) | ||||||
elif isinstance(obj, (pd.Timestamp, np.datetime64)): | ||||||
elif isinstance(obj, (_PandasTimestamp, np.datetime64)): | ||||||
import pandas as pd | ||||||
|
||||||
return pd.Timestamp(obj).isoformat() | ||||||
else: | ||||||
return obj | ||||||
|
@@ -936,7 +947,7 @@ def to_dict( | |||||
# parsed_shorthand is removed from context if it exists so that it is | ||||||
# not passed to child to_dict function calls | ||||||
parsed_shorthand = context.pop("parsed_shorthand", {}) | ||||||
# Prevent that pandas categorical data is automatically sorted | ||||||
# Prevent that categorical data is automatically sorted | ||||||
# when a non-ordinal data type is specifed manually | ||||||
# or if the encoding channel does not support sorting | ||||||
if "sort" in parsed_shorthand and ( | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,11 +4,11 @@ | |
|
||
import numpy as np | ||
import pandas as pd | ||
from pandas.api.types import infer_dtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's make the tests also run without pandas installed so that we can run the whole test suite once with pandas installed and once without. Prevents us from accidentally reintroducing a hard dependency again in the future |
||
import pytest | ||
|
||
import altair as alt | ||
from altair.utils.core import parse_shorthand, update_nested, infer_encoding_types | ||
from altair.utils.core import infer_dtype | ||
|
||
json_schema_specification = alt.load_schema()["$schema"] | ||
json_schema_dict_str = f'{{"$schema": "{json_schema_specification}"}}' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding correct that this line is only reached if it's an old Pandas version which does not support the dataframe interchange protocol? Else it would already stop at line 43, right?
If yes, could you add a comment about this?