Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to roundtrip spss file, with no apparent cause #149

Closed
eirki opened this issue Nov 1, 2021 · 3 comments
Closed

Unable to roundtrip spss file, with no apparent cause #149

eirki opened this issue Nov 1, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@eirki
Copy link

eirki commented Nov 1, 2021

I am struggling with an absolutely puzzling bug trying to save and load a dataframe to SPSS.
Could be related to #128 and WizardMac/ReadStat#241.

Steps to reproduce:

from numpy import nan
import pandas as pd
import pyreadstat

df = pd.DataFrame(
    {
        "ct": ["D", "D", nan, nan, "T"],
        "br": [2.0, 2.0, 1.0, 1.0, 2.0],
        "cn": ["D", "D", nan, nan, "T"],
    }
)

missing_ranges = {
    "ct": [
        {"lo": "88", "hi": "88"},
    ],
    "cn": [
        {"lo": "66", "hi": "66"},
    ],
}

labels = {"03": "aa", "04": "aa", "05": "aa"}

variable_value_labels = {
    "ct": labels,
    "cn": labels,
}


path = "~/myfile.sav"
pyreadstat.write_sav(
    df=df,
    dst_path=path,
    missing_ranges=missing_ranges,
    variable_value_labels=variable_value_labels,
)
pyreadstat.read_sav(path)

The traceback is:

---------------------------------------------------------------------------
ReadstatError                             Traceback (most recent call last)
<ipython-input-3-a4bee8e4a788> in <module>
     44     variable_value_labels=variable_value_labels,
     45 )
---> 46 pyreadstat.read_sav(path)

pyreadstat/pyreadstat.pyx in pyreadstat.pyreadstat.read_sav()

pyreadstat/_readstat_parser.pyx in pyreadstat._readstat_parser.run_conversion()

pyreadstat/_readstat_parser.pyx in pyreadstat._readstat_parser.run_readstat_parser()

pyreadstat/_readstat_parser.pyx in pyreadstat._readstat_parser.check_exit_status()

ReadstatError: Unable to convert string to the requested encoding (invalid byte sequence)

The puzzling part is that if I remove any party of the above script, it runs just fine. If I change the structure of the DataFrame or metadata, or change some of the strings, it will run. It will also run if I read it with encoding="LATIN1". I am at a loss about what could cause this error, and would be grateful for any help.

Setup Information:
pandas==1.2.4
pyreadstat==1.1.3

Platform:
Ubuntu 18.04

@ofajardo
Copy link
Collaborator

ofajardo commented Nov 9, 2021

I confirm I can reproduce the issue (thanks for submitting a nice and reproducible report!). I don't know yet what is going on, however if I upgrade pandas to version 1.3.4 the error disappears. I will investigate to see if the issue is on the pyreadstat side or pandas, or if there is any workaround to get older pandas versions to work. If you get any other hint on what can be the difference, please share.

@ofajardo ofajardo added the bug Something isn't working label Nov 9, 2021
@eirki
Copy link
Author

eirki commented Nov 12, 2021

Thanks for looking into it! I'll make sure to upgrade Pandas, and let you know if come any closer to figuring out what's causing this.

@ofajardo
Copy link
Collaborator

Even more interesting, the bug is indeed there if using ubuntu 18.04, but if using ubuntu 20.4 (linux mint actually to be precise) there is no bug, even with the old pandas and pyreadstat version.

So this is going to be difficult to debug, but as it can be cured very easily upgrading pandas, I am going to close it, hope it is OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants