You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had to recreate this issue because for some reason couldn't reopen the original one.
I have tested the fix from the main branch but it seems it is not working as expected. It continues to generate novel/new values when the column is string and contains numerical values.
I have added a zip with a notebook that demonstrates the case.
The text was updated successfully, but these errors were encountered:
echatzikyriakidis
changed the title
Bug in model.sample() when column contains integer values while column type is string. #31
Bug in model.sample() when column contains integer values while column type is string.
Jul 16, 2023
Are there any news on this? The PR solution seems that is not working. The correct thing to do is to not try to parse columns containing strings as ints/floats/datetimes even if that is possible. If a column contain strings, it is a string column. We need this refactoring to let REalTabFormer handle the string/text columns as categorical and not generate new values because they will be parsed to int/float/datetime.
Maybe we could use the following functions in the library to identify if a pd.Series column is text, integer, float, etc. and only then behave accordingly.
When data is loaded from databases (instead of loading them from CSVs) using pandas SQL sometimes the values are not python's int/float but numpy's int/float. So, that's why we have also np.integer/np.float in the above functions. The np.integer will match both np.int32 and np.int64 and np.float similarly will match both np.float16 and np.float32. The functions also check the first non-null value because this can also be possible as some columns might have missing values.
Is it possible to make this refactoring? Could you please help us on this?
Hi @avsolatorio,
I had to recreate this issue because for some reason couldn't reopen the original one.
I have tested the fix from the main branch but it seems it is not working as expected. It continues to generate novel/new values when the column is string and contains numerical values.
I have added a zip with a notebook that demonstrates the case.
What do you think?
Originally posted by @echatzikyriakidis in #31 (comment)
The text was updated successfully, but these errors were encountered: