Long string handling #118

mnizol · 2021-03-13T20:48:48Z

I have a task to convert a data set containing several variables with very long strings (~64K characters each) to the SPSS SAV format. Given the SPSS variable-length limit, I started by splitting those variables into smaller segments of 32K characters each. Then, I wrote the data set to a SAV file using pyreadstat. However, I found that when reading the SAV file generated by pyreadstat back into SPSS, there were a large number of new variables in the data set (i.e., variables that I did not explicitly create), each starting with "v" and ending with an alphanumeric suffix. These variables appear to correspond to 255-byte segments of the original long string.

If I read the SAV file back into Pandas using pyreadstat, those additional variable segments are not present; working only in Python, everything behaves as expected. It is only when I try to read the generated SAV file in SPSS itself that I encounter the issue. Note, we are using SPSS v25.

Looking through the issue history of both pyreadstat and readstat, I came across the following two issues, which appear to be related to my issue:

Based on some hints from WizardMac/ReadStat#122 (comment), there seem to be two changes required to work-around this issue:

Modify the variable name of long-string columns such that the name contains 5 characters or less. This is apparently to account for suffixes that readstat adds behind the scenes when it generates variable segments.
Split the variables into smaller segments of <= 9180 characters each.
After making these two changes, everything works as expected (including in SPSS).

So: I have a viable work-around for this issue, but the behavior is quite unexpected and the work-around is not at all obvious. At the very least, generating some kind of warning in the Python code if pyreadstat detects a scenario that could lead to this issue may be helpful to others.

ofajardo · 2021-03-13T21:25:08Z

I think the best would be to file an issue in Readstat so that the issue can be corrected. Would you like to do that? You will need a sample file.

ofajardo · 2021-04-02T13:11:31Z

Reported to Readstat for them to take a look.

ofajardo · 2021-05-05T08:40:14Z

Closing this as this issue is going to be tracked in #119

ofajardo mentioned this issue Apr 2, 2021

long string variable split when reading in SPSS WizardMac/ReadStat#236

Open

ofajardo mentioned this issue Apr 2, 2021

Reading and Writing of long String Variables from SPSS #119

Open

ofajardo closed this as completed May 5, 2021

sjkiss mentioned this issue Jul 5, 2022

Virtual variables in ReadStat-created SAV files with long strings are visible un-merged in SPSS WizardMac/ReadStat#122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long string handling #118

Long string handling #118

mnizol commented Mar 13, 2021 •

edited

Loading

ofajardo commented Mar 13, 2021

ofajardo commented Apr 2, 2021

ofajardo commented May 5, 2021

Long string handling #118

Long string handling #118

Comments

mnizol commented Mar 13, 2021 • edited Loading

ofajardo commented Mar 13, 2021

ofajardo commented Apr 2, 2021

ofajardo commented May 5, 2021

mnizol commented Mar 13, 2021 •

edited

Loading