Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sav files created by IBM Proprietary SPSS Modeler Software on IBM Cloud are not properly readable by Pyreadstat(python)/Haven(R) #319

Open
ananjay-gurjar-ibm opened this issue Oct 21, 2024 · 0 comments

Comments

@ananjay-gurjar-ibm
Copy link

Consider the given sav files sav-read-issue.zip created by IBM Proprietery SPSS Modeler on IBM Cloud. The original data and corresponding metadata written in the file as read by SPSS Statistics
image
image

The same file when read from Python's Pyreadstat and R's Haven library shows up as below:

image

Clearly from the metadata in the above screenshot python was able to figure out that string column is A16 (i.e. alphanumeric string of 16 bytes) but it ended up reading only 8 bytes of data.
The metadata Internal Type code which is used to specify length of a string column is correctly set in the given file (ref screenshot)
image

Now since sav file is continuous bytes of data it messes up the whole structure which explains the garbage value in double column(i.e. num2).

This problem also leads to python and R giving out Unable to convert string to the requested encoding (invalid byte sequence) incase of file containing multiple lines which I suspect is coming from library(ReadStat) trying to decode bytes written for double data to string (as string is utf-8 encoded) from the second line.

cc: @sainathmekala22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant