-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to specify text encoding or disable transcoding #39
Comments
Hi, I will need an updated link to the test file, as it appears to have been deleted from Dropbox. |
Asking the original reporter to upload the file again ... |
One possibility is that the file self-reports as CP1252, but contains strings in another encoding. This would produce the BAD_STRING error. |
Here the file ... https://github.com/ofajardo/readstat_test_files/blob/master/tip2020.rda |
Debugging a bit I am seeing this 11-byte hex string stored in a string vector:
Not sure what this is supposed to be, but |
Looking through the file, the strings looks like nonsense - so I am wondering if the real encoding is something non-ASCII-based. It would help to have more information about where this file came from. |
@69hed could you please provide more information on how this file was generated/where it comes from? Looking at it in R, it looks OK, interestingly it says that for most character values the encoding is "unknown", but some of them are UTF-8 (see arrow) And there are a few nonsense values as well (few). Looking at the content my guess would be that it is coming from an online survey/feedback webpage, where the user is allowed to type whatever, or copy paste, giving you inconsistent encodings across the same field (I have seen such situation before) ... |
more examples of values in the "text" column with international characters. Some values appear to have only ascii characters:
|
@ofajardo The additional context helps - I guess it will be mostly UTF-8 even though the file header indicates CP1252. I'm not sure what the correct behavior is on the librdata side. Maybe provide an encoding override or the ability to request no recoding (similar to the ReadStat API). |
I think that makes sense |
@ofajardo All right - I will change this issue to an "enhancement" and leave it open since the library is currently behaving as expected for the provided file. |
thanks! |
my personal preference would be to allow specifying the encoding (I think that's what Readstat does?) ... because on the python side I am expecting UTF-8. The user could loop through a bunch of encodings to see which one does the job |
hi
It was reported here in pyreadr that trying to open this file raises the following error:
i.e RDATA_ERROR_CONVERT_BAD_STRING
Looking at the first 30 bytes of the files I got the impression the file is in CP1252 (maybe I am looking at a completely wrong pace, I actually don't know how this file is structured):
Looking at the source code I was expecting to get RDATA_ERROR_UNSUPPORTED_CHARSET instead. Maybe librdata is not extracting the encoding correctly for this file?
And actually, would it be possible to support non UTF-8 files?
thanks!
The text was updated successfully, but these errors were encountered: