Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add replacement class for dataframes #1234

Merged
merged 1 commit into from
May 31, 2024
Merged

Add replacement class for dataframes #1234

merged 1 commit into from
May 31, 2024

Conversation

plypaul
Copy link
Contributor

@plypaul plypaul commented May 30, 2024

Description

This PR adds a replacement class for pandas.Dataframe that has the minimum functionality as required by MF. This replacement will allow removal of the pandas dependency later on.

@cla-bot cla-bot bot added the cla:yes label May 30, 2024
Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@plypaul plypaul marked this pull request as ready for review May 30, 2024 01:52
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits inline. If we don't know about the float -> string key thing let's just proceed and deal with it if it comes up. In terms of the overall approach, I'm assuming there isn't an easier way that works reasonable well across the Pandas -> Agate boundary.

I hope for a future where we can move to some kind of standard. Seems like things are converging on Arrow, but we aren't there yet.

column_name: str
column_type: Type[CellValue]

def lower_case_column_name(self) -> ColumnDescription: # noqa: D102
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: given that method name I'd expect this to be a string. Given how it's currently used I think this is better:

def with_lower_case_column_name(self) -> ColumnDescription:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

def __iter__(self) -> Iterator[ColumnDescription]: # noqa: D105
return iter(self.column_descriptions)

@cached_property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh neat!



@dataclass(frozen=True, eq=False)
class MetricFlowDataTable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is way more than I was expecting. I was thinking we could define a protocol that would let us implement wrappers around Agate/Pandas interfaces, cut over, and remove Pandas that way, not that we'd have a full bore data container class, but if this is what it takes let's just proceed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this would be small going in, but the type-handling threw a wrench in there. We could still move to a protocol and this can help to come up with that definition since it has a limited interface.

def _cell_sort_key(cell: CellValue) -> str:
if isinstance(cell, datetime.datetime):
return cell.isoformat()
return str(cell)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this potentially non-deterministic for floats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be tricky - did you have thoughts on an approach?

Comment on lines +59 to +64
# It should be safe to remove this once `MetricFlowDataTable` is validated as it doesn't allow timezones.
elif (
isinstance(expected_value, datetime.datetime)
and isinstance(actual_value, datetime.datetime)
# If expected has no tz but actual is UTC, remove timezone. Some engines add UTC by default.
and (actual_value.tzinfo == "UTC" and expected_value.tzinfo is None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, timezone support is coming, this may get gnarly.....

Those are tomorrow problems, though, for now copying what we've got is good.


def test_column_values_iterator(example_table: MetricFlowDataTable) -> None: # noqa: D103
assert tuple(example_table.column_values_iterator(0)) == (0, 1)
assert tuple(example_table.column_values_iterator(0)) == (0, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this supposed to be:

Suggested change
assert tuple(example_table.column_values_iterator(0)) == (0, 1)
assert tuple(example_table.column_values_iterator(1)) == ('a', 'b')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, updated.

@plypaul
Copy link
Contributor Author

plypaul commented May 31, 2024

I hope for a future where we can move to some kind of standard. Seems like things are converging on Arrow, but we aren't there yet.

I took a look at pyarrow, but that had a numpy dependency and versioning requirements.

Base automatically changed from p--py312--02 to main May 31, 2024 01:32
@plypaul plypaul merged commit 3d26700 into main May 31, 2024
15 checks passed
@plypaul plypaul deleted the p--py312--03 branch May 31, 2024 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants