-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add replacement class for dataframes #1234
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits inline. If we don't know about the float -> string key thing let's just proceed and deal with it if it comes up. In terms of the overall approach, I'm assuming there isn't an easier way that works reasonable well across the Pandas -> Agate boundary.
I hope for a future where we can move to some kind of standard. Seems like things are converging on Arrow, but we aren't there yet.
metricflow/data_table/mf_column.py
Outdated
column_name: str | ||
column_type: Type[CellValue] | ||
|
||
def lower_case_column_name(self) -> ColumnDescription: # noqa: D102 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: given that method name I'd expect this to be a string. Given how it's currently used I think this is better:
def with_lower_case_column_name(self) -> ColumnDescription:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
def __iter__(self) -> Iterator[ColumnDescription]: # noqa: D105 | ||
return iter(self.column_descriptions) | ||
|
||
@cached_property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh neat!
|
||
|
||
@dataclass(frozen=True, eq=False) | ||
class MetricFlowDataTable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is way more than I was expecting. I was thinking we could define a protocol that would let us implement wrappers around Agate/Pandas interfaces, cut over, and remove Pandas that way, not that we'd have a full bore data container class, but if this is what it takes let's just proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this would be small going in, but the type-handling threw a wrench in there. We could still move to a protocol and this can help to come up with that definition since it has a limited interface.
def _cell_sort_key(cell: CellValue) -> str: | ||
if isinstance(cell, datetime.datetime): | ||
return cell.isoformat() | ||
return str(cell) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this potentially non-deterministic for floats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be tricky - did you have thoughts on an approach?
# It should be safe to remove this once `MetricFlowDataTable` is validated as it doesn't allow timezones. | ||
elif ( | ||
isinstance(expected_value, datetime.datetime) | ||
and isinstance(actual_value, datetime.datetime) | ||
# If expected has no tz but actual is UTC, remove timezone. Some engines add UTC by default. | ||
and (actual_value.tzinfo == "UTC" and expected_value.tzinfo is None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof, timezone support is coming, this may get gnarly.....
Those are tomorrow problems, though, for now copying what we've got is good.
|
||
def test_column_values_iterator(example_table: MetricFlowDataTable) -> None: # noqa: D103 | ||
assert tuple(example_table.column_values_iterator(0)) == (0, 1) | ||
assert tuple(example_table.column_values_iterator(0)) == (0, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this supposed to be:
assert tuple(example_table.column_values_iterator(0)) == (0, 1) | |
assert tuple(example_table.column_values_iterator(1)) == ('a', 'b') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, updated.
I took a look at |
Description
This PR adds a replacement class for
pandas.Dataframe
that has the minimum functionality as required by MF. This replacement will allow removal of thepandas
dependency later on.