Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long string handling #118

Closed
mnizol opened this issue Mar 13, 2021 · 3 comments
Closed

Long string handling #118

mnizol opened this issue Mar 13, 2021 · 3 comments

Comments

@mnizol
Copy link

mnizol commented Mar 13, 2021

I have a task to convert a data set containing several variables with very long strings (~64K characters each) to the SPSS SAV format. Given the SPSS variable-length limit, I started by splitting those variables into smaller segments of 32K characters each. Then, I wrote the data set to a SAV file using pyreadstat. However, I found that when reading the SAV file generated by pyreadstat back into SPSS, there were a large number of new variables in the data set (i.e., variables that I did not explicitly create), each starting with "v" and ending with an alphanumeric suffix. These variables appear to correspond to 255-byte segments of the original long string.

If I read the SAV file back into Pandas using pyreadstat, those additional variable segments are not present; working only in Python, everything behaves as expected. It is only when I try to read the generated SAV file in SPSS itself that I encounter the issue. Note, we are using SPSS v25.

Looking through the issue history of both pyreadstat and readstat, I came across the following two issues, which appear to be related to my issue:

Based on some hints from WizardMac/ReadStat#122 (comment), there seem to be two changes required to work-around this issue:

  1. Modify the variable name of long-string columns such that the name contains 5 characters or less. This is apparently to account for suffixes that readstat adds behind the scenes when it generates variable segments.
  2. Split the variables into smaller segments of <= 9180 characters each.
    After making these two changes, everything works as expected (including in SPSS).

So: I have a viable work-around for this issue, but the behavior is quite unexpected and the work-around is not at all obvious. At the very least, generating some kind of warning in the Python code if pyreadstat detects a scenario that could lead to this issue may be helpful to others.

@ofajardo
Copy link
Collaborator

I think the best would be to file an issue in Readstat so that the issue can be corrected. Would you like to do that? You will need a sample file.

@ofajardo
Copy link
Collaborator

ofajardo commented Apr 2, 2021

Reported to Readstat for them to take a look.

@ofajardo
Copy link
Collaborator

ofajardo commented May 5, 2021

Closing this as this issue is going to be tracked in #119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants