-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Automatically encode and decode Cyrillic, Hebrew, and KS C 5601 #127
base: master
Are you sure you want to change the base?
Conversation
@@ -26,7 +26,7 @@ | |||
"mocha": "2.x" | |||
}, | |||
"dependencies": { | |||
"iconv-lite": "0.x", | |||
"iconv-lite": ">= 0.4 < 1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why limit upper bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it was limited before this change too (to 0.x
, effectively < 1
). Probably good to keep it like that for compatibility reasons. Added lower bound only because it's required for ksc5601
.
Hi @scop thanks for the PR I've reviewed encoding/decoding for Cyrillic and Hebrew and they look good (I cannot find a ksc5601 charmap to review but as long as it's using iconv...) Would you mind review my code comment and discuss why you think matching will cause false positives / unwanted results? maybe a solution with a regex such as GSM? |
Regarding the auto detection, as said, it's mostly a hunch, but let me elaborate a bit. First, I guess e.g. cyrillic and hebrew have some overlaps which can make detection produce unwanted (if not false) results between them and also ISO-8859-1, and I have no idea whatsoever about KS C 5601. Second, I think detection ending up using one of these encodings should in the end not matter much at all. What counts much more in my experience is whether the message can be eventually -- when sending it to its final destination -- encoded in GSM, or if fallback to UCS-2 is needed. These intermediate encodings aren't that interesting. Third, introducing an uncommon encoding like these could trigger issues elsewhere -- messages that were previously detected as (the widely supported) UCS-2 may now start to be encoded as one of these for little gain, but possibility in non-support in other systems where the messages are relayed to. Because of the above, amplified by the fact that I'm not personally interested in the detection mode at all currently and thus don't want to spend time around that can of worms ;), I decided to play it safe and exclude these from the detection. If someone wants to change that, it can be done later, but I suggest not delaying getting this in for that. |
Re KS C 5601, based on info at https://en.wikipedia.org/wiki/KS_X_1001 I ended up generating the test sample's expected bytes encoding with
I'm not sure if euc-kr is always a good substitute for it, but for this sample it works and I guess that's as far as code here should be concerned, the rest can be left as an assumption that iconv-lite's If you want to play with it more, maybe this would be useful? |
Anything more I can do here? |
I'm going to hold this till we figure out encoding detection |
Did not include these in auto detection, I have a hunch they'd cause false positives / unwanted results.
The test samples are "hello" translations from Google Translate.