Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinkedDataFrame: Handle overwriting of columns with linkages / linking on linkages more explicitly #5

Open
aclarry opened this issue Apr 22, 2022 · 3 comments

Comments

@aclarry
Copy link
Member

aclarry commented Apr 22, 2022

Overview

When linking LinkedDataFrames, it is possible to effectively overwrite columns without explicitly doing so. Combined with the fact that LDFs will naïvely try to to link with a linkage "column" specified as an on attribute, the resulting error can be non-obvious.

This caused a bug in the GGHM demand model, presumably as a change from previous LinkedDataFrame/pandas version, when two LDFs were being linked on each other.

Example

In cheval 0.2 with pandas 1.4:

df1 = LinkedDataFrame(pd.DataFrame({"df2": [1, 2, 3, 1, 2, 3]}))
df2 = LinkedDataFrame(pd.DataFrame({"col1": ["a", "b", "c"]}))

df1.link_to(df2, "df2", on_self="df2") # The original df1["df2"] column is inaccessible
df2.link_to(df1, "df1", on_other="df2") # AttributeError: to_numpy

Here, the original column df1["df2"] which provided the index to join on df2 is inaccessible from:

  • Item lookup df1["df2"]
  • Attribute access df1.df2
  • Future linkages df2.link_to(df1, "df1", on_other="df2")

The last item is the cause of the specific issue in the GGHM model - it produced an error because df2 was trying to use the linkage df1.df2 as an index.

Explanation

In normal pandas usage, it is impossible to "accidentally" mutate/overwrite a column, whereas in LinkedDataFrames, "columns" are created implicitly by link_to. LinkedDataFrames will handle linkages as columns everywhere, including in link_to calls, which results in an error which may be non-obvious in the source, and an non-specific error message raised from pandas.

Proposed Solutions

  • Issue a warning when a linkage is created which supersedes an existing column (or have an explicit overwrite kwarg in link_to)
  • Allow LDFs to be linked back on an linkage (df2.link_to(df1, "df1", on_other="df2")), since this is realistically the only place where this issue would come up. This could either refer back to the original linkage column, or use the linkage itself to provide the index for the linkage.
  • Check if the LDF is trying to link using a linkage as an "on" instead of a normal pd.Series, and raise an explicit Exception if it does so.
@PeterKucirek
Copy link
Collaborator

The issue is in usage: the column in inaccessible because the link has been named the same as an existing column, which is bad practice, in my opinion. Ideally link names should be:

  • Pythonic (meet variable naming conventions);
  • Not override Python reserved words like for, list, else etc; and
  • Should not be named the same as an existing column in either frames

However I acknowledge that this is not explicitly stated anywhere, nor does the design do enough to protect against it. I support checking the linkage name against the columns and issuing a custom warning if there is risk of name collision.

@bccheung
Copy link
Member

Agreed, the best course of action here to perform the check for existing column names and to raise an exception if collision occurs. The onus should be on the user to use unique link and column names.

@PeterKucirek
Copy link
Collaborator

A simple name in self.columns or name in other.columns check is all that is needed, since pandas.Index implements fast containment checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants