You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When linking LinkedDataFrames, it is possible to effectively overwrite columns without explicitly doing so. Combined with the fact that LDFs will naïvely try to to link with a linkage "column" specified as an on attribute, the resulting error can be non-obvious.
This caused a bug in the GGHM demand model, presumably as a change from previous LinkedDataFrame/pandas version, when two LDFs were being linked on each other.
Example
In cheval 0.2 with pandas 1.4:
df1=LinkedDataFrame(pd.DataFrame({"df2": [1, 2, 3, 1, 2, 3]}))
df2=LinkedDataFrame(pd.DataFrame({"col1": ["a", "b", "c"]}))
df1.link_to(df2, "df2", on_self="df2") # The original df1["df2"] column is inaccessibledf2.link_to(df1, "df1", on_other="df2") # AttributeError: to_numpy
Here, the original column df1["df2"] which provided the index to join on df2 is inaccessible from:
The last item is the cause of the specific issue in the GGHM model - it produced an error because df2 was trying to use the linkage df1.df2 as an index.
Explanation
In normal pandas usage, it is impossible to "accidentally" mutate/overwrite a column, whereas in LinkedDataFrames, "columns" are created implicitly by link_to. LinkedDataFrames will handle linkages as columns everywhere, including in link_to calls, which results in an error which may be non-obvious in the source, and an non-specific error message raised from pandas.
Proposed Solutions
Issue a warning when a linkage is created which supersedes an existing column (or have an explicit overwrite kwarg in link_to)
Allow LDFs to be linked back on an linkage (df2.link_to(df1, "df1", on_other="df2")), since this is realistically the only place where this issue would come up. This could either refer back to the original linkage column, or use the linkage itself to provide the index for the linkage.
Check if the LDF is trying to link using a linkage as an "on" instead of a normal pd.Series, and raise an explicit Exception if it does so.
The text was updated successfully, but these errors were encountered:
The issue is in usage: the column in inaccessible because the link has been named the same as an existing column, which is bad practice, in my opinion. Ideally link names should be:
Pythonic (meet variable naming conventions);
Not override Python reserved words like for, list, else etc; and
Should not be named the same as an existing column in either frames
However I acknowledge that this is not explicitly stated anywhere, nor does the design do enough to protect against it. I support checking the linkage name against the columns and issuing a custom warning if there is risk of name collision.
Agreed, the best course of action here to perform the check for existing column names and to raise an exception if collision occurs. The onus should be on the user to use unique link and column names.
Overview
When linking LinkedDataFrames, it is possible to effectively overwrite columns without explicitly doing so. Combined with the fact that LDFs will naïvely try to to link with a linkage "column" specified as an
on
attribute, the resulting error can be non-obvious.This caused a bug in the GGHM demand model, presumably as a change from previous LinkedDataFrame/pandas version, when two LDFs were being linked on each other.
Example
In cheval 0.2 with pandas 1.4:
Here, the original column
df1["df2"]
which provided the index to join ondf2
is inaccessible from:df1["df2"]
df1.df2
df2.link_to(df1, "df1", on_other="df2")
The last item is the cause of the specific issue in the GGHM model - it produced an error because
df2
was trying to use the linkagedf1.df2
as an index.Explanation
In normal pandas usage, it is impossible to "accidentally" mutate/overwrite a column, whereas in LinkedDataFrames, "columns" are created implicitly by
link_to
. LinkedDataFrames will handle linkages as columns everywhere, including inlink_to
calls, which results in an error which may be non-obvious in the source, and an non-specific error message raised from pandas.Proposed Solutions
overwrite
kwarg inlink_to
)df2.link_to(df1, "df1", on_other="df2")
), since this is realistically the only place where this issue would come up. This could either refer back to the original linkage column, or use the linkage itself to provide the index for the linkage.The text was updated successfully, but these errors were encountered: