Add ability to specify text encoding or disable transcoding #39

ofajardo · 2021-02-01T09:29:38Z

hi

It was reported here in pyreadr that trying to open this file raises the following error:

Unable to convert string to the requested encoding (invalid byte sequence)

i.e RDATA_ERROR_CONVERT_BAD_STRING

Looking at the first 30 bytes of the files I got the impression the file is in CP1252 (maybe I am looking at a completely wrong pace, I actually don't know how this file is structured):

RDX3\nX\n\x00\x00\x00\x03\x00\x03\x06\x01\x00\x03\x05\x00\x00\x00\x00\x06CP1252\x00

Looking at the source code I was expecting to get RDATA_ERROR_UNSUPPORTED_CHARSET instead. Maybe librdata is not extracting the encoding correctly for this file?

And actually, would it be possible to support non UTF-8 files?

thanks!

The text was updated successfully, but these errors were encountered:

evanmiller · 2021-03-27T13:17:07Z

Hi, I will need an updated link to the test file, as it appears to have been deleted from Dropbox.

ofajardo · 2021-03-27T15:48:22Z

Asking the original reporter to upload the file again ...

evanmiller · 2021-03-28T13:07:46Z

One possibility is that the file self-reports as CP1252, but contains strings in another encoding. This would produce the BAD_STRING error.

ofajardo · 2021-03-29T13:22:20Z

Here the file ...

https://github.com/ofajardo/readstat_test_files/blob/master/tip2020.rda

evanmiller · 2021-03-29T14:01:48Z

Debugging a bit I am seeing this 11-byte hex string stored in a string vector:

\x81\x84\xe3\x81\x84\xe3\x81\xad\x5e\x5f\x5e

Not sure what this is supposed to be, but \x81 is unused by Code Page 1252. As a workaround I can add //IGNORE to the iconv command to skip unrecognized characters, but this might produce unexpected output.

evanmiller · 2021-03-29T14:11:33Z

Looking through the file, the strings looks like nonsense - so I am wondering if the real encoding is something non-ASCII-based. It would help to have more information about where this file came from.

ofajardo · 2021-03-29T14:28:29Z

@69hed could you please provide more information on how this file was generated/where it comes from?

Looking at it in R, it looks OK, interestingly it says that for most character values the encoding is "unknown", but some of them are UTF-8 (see arrow) And there are a few nonsense values as well (few).

Looking at the content my guess would be that it is coming from an online survey/feedback webpage, where the user is allowed to type whatever, or copy paste, giving you inconsistent encodings across the same field (I have seen such situation before) ...

ofajardo · 2021-03-29T14:36:28Z

more examples of values in the "text" column with international characters. Some values appear to have only ascii characters:

[954] "$7 Saké Wed Nights"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 [955] "Two visits, two phenomenal sandwiches. The seasonal jalapeño with corn crema and the egg roll were perfect. Love this place!"                                                                                                                                                                                                                                                                                                                                                                                                    
 [956] "Does mot spécialisé in iced tea"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
 [957] "Sitzplatzempfehlung für freien Blick zur Bühne Tisch 12 Platz 1&2"                                                                                                                                                                                                                                                                                                                                                                                                                                                               
 [958] "Je hungré fo some frieeesss!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [959] "Coupon in VEGAS2GO® guide offers free ticket with the purchase of one. There is also an active Yelp Deal that offers a (not as good) discount if you cannot locate a guide. - E"                                                                                                                                                                                                                                                                                                                                                 
 [960] "服务员是真的傻逼，加个座位会挡住你走路，怕是挡住你的棺材路哦，纯几把傻"

evanmiller · 2021-03-29T14:55:51Z

@ofajardo The additional context helps - I guess it will be mostly UTF-8 even though the file header indicates CP1252. I'm not sure what the correct behavior is on the librdata side. Maybe provide an encoding override or the ability to request no recoding (similar to the ReadStat API).

ofajardo · 2021-03-29T15:01:29Z

I think that makes sense

evanmiller · 2021-03-29T15:03:19Z

@ofajardo All right - I will change this issue to an "enhancement" and leave it open since the library is currently behaving as expected for the provided file.

ofajardo · 2021-03-29T15:04:03Z

thanks!

ofajardo · 2021-03-29T15:08:59Z

my personal preference would be to allow specifying the encoding (I think that's what Readstat does?) ... because on the python side I am expecting UTF-8. The user could loop through a bunch of encodings to see which one does the job

evanmiller changed the title ~~Problems with rda files not in UTF8~~ Add ability to specify text encoding or disable transcoding Mar 29, 2021

evanmiller added the enhancement label Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to specify text encoding or disable transcoding #39

Add ability to specify text encoding or disable transcoding #39

ofajardo commented Feb 1, 2021 •

edited

Loading

evanmiller commented Mar 27, 2021

ofajardo commented Mar 27, 2021

evanmiller commented Mar 28, 2021

ofajardo commented Mar 29, 2021

evanmiller commented Mar 29, 2021

evanmiller commented Mar 29, 2021

ofajardo commented Mar 29, 2021

ofajardo commented Mar 29, 2021

evanmiller commented Mar 29, 2021

ofajardo commented Mar 29, 2021

evanmiller commented Mar 29, 2021

ofajardo commented Mar 29, 2021

ofajardo commented Mar 29, 2021

Add ability to specify text encoding or disable transcoding #39

Add ability to specify text encoding or disable transcoding #39

Comments

ofajardo commented Feb 1, 2021 • edited Loading

evanmiller commented Mar 27, 2021

ofajardo commented Mar 27, 2021

evanmiller commented Mar 28, 2021

ofajardo commented Mar 29, 2021

evanmiller commented Mar 29, 2021

evanmiller commented Mar 29, 2021

ofajardo commented Mar 29, 2021

ofajardo commented Mar 29, 2021

evanmiller commented Mar 29, 2021

ofajardo commented Mar 29, 2021

evanmiller commented Mar 29, 2021

ofajardo commented Mar 29, 2021

ofajardo commented Mar 29, 2021

ofajardo commented Feb 1, 2021 •

edited

Loading