-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long string handling #118
Comments
I think the best would be to file an issue in Readstat so that the issue can be corrected. Would you like to do that? You will need a sample file. |
Reported to Readstat for them to take a look. |
Closing this as this issue is going to be tracked in #119 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have a task to convert a data set containing several variables with very long strings (~64K characters each) to the SPSS SAV format. Given the SPSS variable-length limit, I started by splitting those variables into smaller segments of 32K characters each. Then, I wrote the data set to a SAV file using pyreadstat. However, I found that when reading the SAV file generated by pyreadstat back into SPSS, there were a large number of new variables in the data set (i.e., variables that I did not explicitly create), each starting with "v" and ending with an alphanumeric suffix. These variables appear to correspond to 255-byte segments of the original long string.
If I read the SAV file back into Pandas using pyreadstat, those additional variable segments are not present; working only in Python, everything behaves as expected. It is only when I try to read the generated SAV file in SPSS itself that I encounter the issue. Note, we are using SPSS v25.
Looking through the issue history of both pyreadstat and readstat, I came across the following two issues, which appear to be related to my issue:
Based on some hints from WizardMac/ReadStat#122 (comment), there seem to be two changes required to work-around this issue:
After making these two changes, everything works as expected (including in SPSS).
So: I have a viable work-around for this issue, but the behavior is quite unexpected and the work-around is not at all obvious. At the very least, generating some kind of warning in the Python code if pyreadstat detects a scenario that could lead to this issue may be helpful to others.
The text was updated successfully, but these errors were encountered: