Pyreadstat.write_sav() altering str formats and storage widths #321

NERNST02 · 2024-11-21T16:52:15Z

I am having this same issue as this error reported on pyreadstats forum. The dev has said this needs to be elevated to you all. Here are some further details:

In my case. I have found it is an issue within the pyreadstat.write_sav() function, rather than the pyreadstat.read_sav() function. When reading in the data and modifying, the metadata is contained correctly, however, it is altered after writing the sav file. I had the file print the metadata formats and storage widths before writing the sav, and cross referenced with the metadata formats and storage widths from the output sav:

Metadata before saving:
Variable Formats: {'ResponseId': 'A50', 'IPAddress': 'A16000', 'LocationLatitude': 'A16000', 'LocationLongitude': 'A16000', 'ResponseDurationSeconds': 'F40.2', 'Progress': 'F40.2', 'Finished': 'F40.0', 'StartDate': 'DATETIME20', 'EndDate': 'DATETIME20', 'RecordedDate': 'DATETIME20', 'DueDate': 'DATETIME11', 'SurveyType': 'A200', 'RespondentType': 'A200', 'ClubIndex': 'A200', 'ClubName': 'A200', 'ClubCategory': 'A32', 'ClubTotalMemberCount': 'F8.0', 'ClubTotalSpouseCount': 'F8.0', 'ClubResponseMember': 'F8.0', 'ClubResponseSpouse': 'F8.0', 'TotalClubResponse': 'F8.0', 'ClubMainInitiationFee': 'DOLLAR12.0', 'ClubMainAnnualFee': 'DOLLAR12.0', 'ClubType': 'F8.0', 'ClubAddressRegion': 'F8.0', 'ClubAddressCity': 'A200', 'ClubAddressState': 'A200', 'ClubAddressZip': 'A200', 'Q2_2': 'A16000', 'Q2_3': 'A16000', 'QA1': 'F8.0', 'QA2': 'F8.0', 'QA3': 'F8.0', 'QA4': 'F8.0', ....

Metadata after saving:
Variable Formats: {'ResponseId': 'A50', 'IPAddress': 'A255', 'LocationLatitude': 'A255', 'LocationLongitude': 'A255', 'ResponseDurationSeconds': 'F40.2', 'Progress': 'F40.2', 'Finished': 'F40.0', 'StartDate': 'DATETIME20', 'EndDate': 'DATETIME20', 'RecordedDate': 'DATETIME20', 'DueDate': 'DATETIME11', 'SurveyType': 'A200', 'RespondentType': 'A200', 'ClubIndex': 'A200', 'ClubName': 'A200', 'ClubCategory': 'A32', 'ClubTotalMemberCount': 'F8.0', 'ClubTotalSpouseCount': 'F8.0', 'ClubResponseMember': 'F8.0', 'ClubResponseSpouse': 'F8.0', 'TotalClubResponse': 'F8.0', 'ClubMainInitiationFee': 'DOLLAR12', 'ClubMainAnnualFee': 'DOLLAR12', 'ClubType': 'F8.0', 'ClubAddressRegion': 'F8.0', 'ClubAddressCity': 'A200', 'ClubAddressState': 'A200', 'ClubAddressZip': 'A200', 'Q2_2': 'A255', 'Q2_3': 'A255', 'QA1': 'F8.0', 'QA2': 'F8.0', 'QA3': 'F8.0', 'QA4': 'F8.0', ....

Furthermore, string variables with data more than 255 bytes are altered to have a container width of what I presume to be the longest actual value in the dataset, i.e.

'QE4': 'A1000'
'QG5': 'A1002'
'QG4': 'A1014' ... and so on, despite explicitly setting the metadata format and storage width to be A16000 and 16000 respectively

I am building a database with these files, so standardization of formats is vital in the merging stage for me. I have the full script here, but the part of the script where I believe the bug is happing is as follows:

output_file_path = os.path.join(BASE_PATH, OUTPUT_FILE) print("Saving merged file with metadata...") pyreadstat.write_sav( merged_data, output_file_path, variable_value_labels=merged_meta.variable_value_labels, column_labels=merged_meta.column_labels, variable_display_width=merged_meta.variable_display_width, variable_measure=merged_meta.variable_measure, variable_format=merged_meta.original_variable_types )

Would appreciate any advice or fixes for this bug, again my main concern is that this needs to be run for many different survey datasets that will eventually be merged. As I understand, the metadata for variables, in this case format and storage width (my next step is to rename the variables, so those will be standardized too) need to be the same in order to merge two variables with the same column name, which cannot be achieved with the current way the str variables are output, unless I manually modify them.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyreadstat.write_sav() altering str formats and storage widths #321

Pyreadstat.write_sav() altering str formats and storage widths #321

NERNST02 commented Nov 21, 2024

Pyreadstat.write_sav() altering str formats and storage widths #321

Pyreadstat.write_sav() altering str formats and storage widths #321

Comments

NERNST02 commented Nov 21, 2024