-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non ASCII characters support #1
Comments
Hello, thank you for pointing this out. It seems to be an encoding issue for which I cannot find a quick fix, but I will keep trying. Just to make sure:
Anyway, I will continue looking for solutions, and I thank you for having pointed it out. |
Thanks a lot for your reply and your best efforts ! |
@andreaspacher I ran into the same issue when working with the csv-files. Thinking about a solution: could you perhaps try to specify the encoding as UTF-8 when writing the data to csv with write.csv? As such: Also, when you read the current file(s) back into R on your system, do the special characters display correctly for you? Happy to try and help troubleshoot this further, as your data is super useful! |
In the CSV-files, most of the encoding problems should be largely fixed now (with a few exceptions, e.g. some Chinese characters - I will take a look into these last few issues too soon). I added the fix for the wrongful hex-codes in d1fb71b, and for most of the wrongful unicodes in e2448c1. I resorted to a rather manual cleaning as Perhaps the fact that your code, @bmkramer, resulted in And thank you, @bmkramer, for your suggestaion regarding an explicit reading/writing of CSV-files in UTF-8. This is certainly helpful in the future - I integrated this (e.g. in 5bf111e). As regards the online version at https://openeditors.ooir.org, I will correct the data in a few days. |
I fixed most of the issues in both the CSV and the online-web version. A few unicodes that I could not properly identify remained in the dataset; the same applies to names in Chinese characters, of which there were a few (but most often with pinyin-transcriptions anyway). Most of them form part of the journals As a note to myself, I used the following code to fix (as an example) the wrongful hex-codes for the web version (in MySQL):
|
Thanks @andreaspacher for fixing the encoding issues! Unfortunately, something apparently still happens along the way that causes the csv's to open with the unicode/ASCII codes on my system [no idea why...], but the code you included makes it easy to redo the fixes and proceed :-) I used this in af88e49 as part of a workflow to match editor affiliations to ROR IDs. |
There seems to be something going on with encoding detection upstream. For instance this title It may actually stem from the Scopus reader, as that is loading an xlsx file with Latin1 encoding and not UTF8 (although I don't see the em-dash in the Scopus list for this title, only the short, ascii dash). It is hard to tell where it comes from exactly as the outputs in the publishers repo doesn't have the individual csv outputs stored, only the final merges. |
Hi,
Bravo for this great initiative !
I suppose you already know that non ASCII strings are not well supported in your data. They seem to be filtered out of the strings : erased or replaced.
Example from this search : https://openeditors.ooir.org/index.php?editor_query=Nantes :
. Journal Title : 'Archives de Pdiatrie' should be 'Archives de Pédiatrie' > character erased
. University name : 'Universit de Nantes; Nantes, France' should be 'Université de Nantes; Nantes, France' > character erased
. Editor name : 'Francois Galgani' should be ''François Galgani'' > character 'ç' replaced by 'c'
If all characters could be preserved in Unicode, it would be eprfect !
The text was updated successfully, but these errors were encountered: