Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyreadstat.write_sav() altering str formats and storage widths #321

Open
NERNST02 opened this issue Nov 21, 2024 · 0 comments
Open

Pyreadstat.write_sav() altering str formats and storage widths #321

NERNST02 opened this issue Nov 21, 2024 · 0 comments

Comments

@NERNST02
Copy link

I am having this same issue as this error reported on pyreadstats forum. The dev has said this needs to be elevated to you all. Here are some further details:

In my case. I have found it is an issue within the pyreadstat.write_sav() function, rather than the pyreadstat.read_sav() function. When reading in the data and modifying, the metadata is contained correctly, however, it is altered after writing the sav file. I had the file print the metadata formats and storage widths before writing the sav, and cross referenced with the metadata formats and storage widths from the output sav:

Metadata before saving:
Variable Formats: {'ResponseId': 'A50', 'IPAddress': 'A16000', 'LocationLatitude': 'A16000', 'LocationLongitude': 'A16000', 'ResponseDurationSeconds': 'F40.2', 'Progress': 'F40.2', 'Finished': 'F40.0', 'StartDate': 'DATETIME20', 'EndDate': 'DATETIME20', 'RecordedDate': 'DATETIME20', 'DueDate': 'DATETIME11', 'SurveyType': 'A200', 'RespondentType': 'A200', 'ClubIndex': 'A200', 'ClubName': 'A200', 'ClubCategory': 'A32', 'ClubTotalMemberCount': 'F8.0', 'ClubTotalSpouseCount': 'F8.0', 'ClubResponseMember': 'F8.0', 'ClubResponseSpouse': 'F8.0', 'TotalClubResponse': 'F8.0', 'ClubMainInitiationFee': 'DOLLAR12.0', 'ClubMainAnnualFee': 'DOLLAR12.0', 'ClubType': 'F8.0', 'ClubAddressRegion': 'F8.0', 'ClubAddressCity': 'A200', 'ClubAddressState': 'A200', 'ClubAddressZip': 'A200', 'Q2_2': 'A16000', 'Q2_3': 'A16000', 'QA1': 'F8.0', 'QA2': 'F8.0', 'QA3': 'F8.0', 'QA4': 'F8.0', ....

Metadata after saving:
Variable Formats: {'ResponseId': 'A50', 'IPAddress': 'A255', 'LocationLatitude': 'A255', 'LocationLongitude': 'A255', 'ResponseDurationSeconds': 'F40.2', 'Progress': 'F40.2', 'Finished': 'F40.0', 'StartDate': 'DATETIME20', 'EndDate': 'DATETIME20', 'RecordedDate': 'DATETIME20', 'DueDate': 'DATETIME11', 'SurveyType': 'A200', 'RespondentType': 'A200', 'ClubIndex': 'A200', 'ClubName': 'A200', 'ClubCategory': 'A32', 'ClubTotalMemberCount': 'F8.0', 'ClubTotalSpouseCount': 'F8.0', 'ClubResponseMember': 'F8.0', 'ClubResponseSpouse': 'F8.0', 'TotalClubResponse': 'F8.0', 'ClubMainInitiationFee': 'DOLLAR12', 'ClubMainAnnualFee': 'DOLLAR12', 'ClubType': 'F8.0', 'ClubAddressRegion': 'F8.0', 'ClubAddressCity': 'A200', 'ClubAddressState': 'A200', 'ClubAddressZip': 'A200', 'Q2_2': 'A255', 'Q2_3': 'A255', 'QA1': 'F8.0', 'QA2': 'F8.0', 'QA3': 'F8.0', 'QA4': 'F8.0', ....

Furthermore, string variables with data more than 255 bytes are altered to have a container width of what I presume to be the longest actual value in the dataset, i.e.

'QE4': 'A1000'
'QG5': 'A1002'
'QG4': 'A1014' ... and so on, despite explicitly setting the metadata format and storage width to be A16000 and 16000 respectively

I am building a database with these files, so standardization of formats is vital in the merging stage for me. I have the full script here, but the part of the script where I believe the bug is happing is as follows:

output_file_path = os.path.join(BASE_PATH, OUTPUT_FILE) print("Saving merged file with metadata...") pyreadstat.write_sav( merged_data, output_file_path, variable_value_labels=merged_meta.variable_value_labels, column_labels=merged_meta.column_labels, variable_display_width=merged_meta.variable_display_width, variable_measure=merged_meta.variable_measure, variable_format=merged_meta.original_variable_types )

Would appreciate any advice or fixes for this bug, again my main concern is that this needs to be run for many different survey datasets that will eventually be merged. As I understand, the metadata for variables, in this case format and storage width (my next step is to rename the variables, so those will be standardized too) need to be the same in order to merge two variables with the same column name, which cannot be achieved with the current way the str variables are output, unless I manually modify them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant