Bug in model.sample() when column contains integer values while column type is string. #36

echatzikyriakidis · 2023-07-16T10:44:08Z

I had to recreate this issue because for some reason couldn't reopen the original one.

I have tested the fix from the main branch but it seems it is not working as expected. It continues to generate novel/new values when the column is string and contains numerical values.

I have added a zip with a notebook that demonstrates the case.

What do you think?

Originally posted by @echatzikyriakidis in #31 (comment)

echatzikyriakidis · 2023-09-15T13:43:17Z

Hi @avsolatorio !

Are there any news on this? The PR solution seems that is not working. The correct thing to do is to not try to parse columns containing strings as ints/floats/datetimes even if that is possible. If a column contain strings, it is a string column. We need this refactoring to let REalTabFormer handle the string/text columns as categorical and not generate new values because they will be parsed to int/float/datetime.

Maybe we could use the following functions in the library to identify if a pd.Series column is text, integer, float, etc. and only then behave accordingly.

def is_first_non_na_value_text(series_values):
    return isinstance(series_values.dropna() [0], str)

def is_first_non_na_value_integer(series_values):
    return isinstance(series_values.dropna() [0], (int, np.integer))

def is_first_non_na_value_numerical(series_values):
    return isinstance(series_values.dropna() [0], (float, np.float))

When data is loaded from databases (instead of loading them from CSVs) using pandas SQL sometimes the values are not python's int/float but numpy's int/float. So, that's why we have also np.integer/np.float in the above functions. The np.integer will match both np.int32 and np.int64 and np.float similarly will match both np.float16 and np.float32. The functions also check the first non-null value because this can also be possible as some columns might have missing values.

Is it possible to make this refactoring? Could you please help us on this?

Thanks!

echatzikyriakidis changed the title ~~Bug in model.sample() when column contains integer values while column type is string. #31~~ Bug in model.sample() when column contains integer values while column type is string. Jul 16, 2023

efstathios-chatzikyriakidis mentioned this issue Feb 23, 2024

Python datetime.date data type is handled as str and datatype handling in general #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in model.sample() when column contains integer values while column type is string. #36

Bug in model.sample() when column contains integer values while column type is string. #36

echatzikyriakidis commented Jul 16, 2023

echatzikyriakidis commented Sep 15, 2023 •

edited

Loading

Bug in model.sample() when column contains integer values while column type is string. #36

Bug in model.sample() when column contains integer values while column type is string. #36

Comments

echatzikyriakidis commented Jul 16, 2023

echatzikyriakidis commented Sep 15, 2023 • edited Loading

echatzikyriakidis commented Sep 15, 2023 •

edited

Loading