Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to specify text encoding or disable transcoding #39

Open
ofajardo opened this issue Feb 1, 2021 · 13 comments
Open

Add ability to specify text encoding or disable transcoding #39

ofajardo opened this issue Feb 1, 2021 · 13 comments

Comments

@ofajardo
Copy link

ofajardo commented Feb 1, 2021

hi

It was reported here in pyreadr that trying to open this file raises the following error:

Unable to convert string to the requested encoding (invalid byte sequence)

i.e RDATA_ERROR_CONVERT_BAD_STRING

Looking at the first 30 bytes of the files I got the impression the file is in CP1252 (maybe I am looking at a completely wrong pace, I actually don't know how this file is structured):

RDX3\nX\n\x00\x00\x00\x03\x00\x03\x06\x01\x00\x03\x05\x00\x00\x00\x00\x06CP1252\x00

Looking at the source code I was expecting to get RDATA_ERROR_UNSUPPORTED_CHARSET instead. Maybe librdata is not extracting the encoding correctly for this file?

And actually, would it be possible to support non UTF-8 files?

thanks!

@evanmiller
Copy link
Collaborator

Hi, I will need an updated link to the test file, as it appears to have been deleted from Dropbox.

@ofajardo
Copy link
Author

Asking the original reporter to upload the file again ...

@evanmiller
Copy link
Collaborator

One possibility is that the file self-reports as CP1252, but contains strings in another encoding. This would produce the BAD_STRING error.

@ofajardo
Copy link
Author

@evanmiller
Copy link
Collaborator

Debugging a bit I am seeing this 11-byte hex string stored in a string vector:

\x81\x84\xe3\x81\x84\xe3\x81\xad\x5e\x5f\x5e

Not sure what this is supposed to be, but \x81 is unused by Code Page 1252. As a workaround I can add //IGNORE to the iconv command to skip unrecognized characters, but this might produce unexpected output.

@evanmiller
Copy link
Collaborator

Looking through the file, the strings looks like nonsense - so I am wondering if the real encoding is something non-ASCII-based. It would help to have more information about where this file came from.

@ofajardo
Copy link
Author

@69hed could you please provide more information on how this file was generated/where it comes from?

Looking at it in R, it looks OK, interestingly it says that for most character values the encoding is "unknown", but some of them are UTF-8 (see arrow) And there are a few nonsense values as well (few).

image

Looking at the content my guess would be that it is coming from an online survey/feedback webpage, where the user is allowed to type whatever, or copy paste, giving you inconsistent encodings across the same field (I have seen such situation before) ...

@ofajardo
Copy link
Author

more examples of values in the "text" column with international characters. Some values appear to have only ascii characters:

[954] "$7 Saké Wed Nights"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 [955] "Two visits, two phenomenal sandwiches. The seasonal jalapeño with corn crema and the egg roll were perfect. Love this place!"                                                                                                                                                                                                                                                                                                                                                                                                    
 [956] "Does mot spécialisé in iced tea"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
 [957] "Sitzplatzempfehlung für freien Blick zur Bühne Tisch 12 Platz 1&2"                                                                                                                                                                                                                                                                                                                                                                                                                                                               
 [958] "Je hungré fo some frieeesss!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [959] "Coupon in VEGAS2GO® guide offers free ticket with the purchase of one. There is also an active Yelp Deal that offers a (not as good) discount if you cannot locate a guide. - E"                                                                                                                                                                                                                                                                                                                                                 
 [960] "服务员是真的傻逼,加个座位会挡住你走路,怕是挡住你的棺材路哦,纯几把傻"  

@evanmiller
Copy link
Collaborator

@ofajardo The additional context helps - I guess it will be mostly UTF-8 even though the file header indicates CP1252. I'm not sure what the correct behavior is on the librdata side. Maybe provide an encoding override or the ability to request no recoding (similar to the ReadStat API).

@ofajardo
Copy link
Author

I think that makes sense

@evanmiller evanmiller changed the title Problems with rda files not in UTF8 Add ability to specify text encoding or disable transcoding Mar 29, 2021
@evanmiller
Copy link
Collaborator

@ofajardo All right - I will change this issue to an "enhancement" and leave it open since the library is currently behaving as expected for the provided file.

@ofajardo
Copy link
Author

thanks!

@ofajardo
Copy link
Author

my personal preference would be to allow specifying the encoding (I think that's what Readstat does?) ... because on the python side I am expecting UTF-8. The user could loop through a bunch of encodings to see which one does the job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants