-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify how Hyperglot handles composed and decomposed combinations, e.g. accented letters #149
Comments
Note that for German, while it is not common, one may still get composed characters when dealing with files on the macOS file system as strings are stored in something close to NFD rather than NFC. Usually the UI file dialog or Finder normalize strings when names are copied but the name can still be in decomposed forms if obtained otherwise, like in some applications or when files are copied to some other operating systems. One can also easily stumble upon decomposed forms in library catalogues. For example in the LOC catalogue or the NYPL catalogue. The nature of the Unicode model means these decomposed forms should also be supported in German, even if they are less common in the large corpus of German data. |
I think there is a difference between being able to represent an orthography accurately and being able to represent all legacy input sequences of an orthography (such as encountered in digital texts, which is what the PR comment we concerned with). We have If this were an issue, this would apply to all orthographies and all characters that have decomposable unicodes. |
@moyogo I agree with that, but what you describe sounds like a recommended best practice, not a minimal requirement for language support (good-enough practice). We do not want to fail detecting a font if it is good-enough. Minimality is a key notion in Hyperglot. (I would have loved to call it a principle, but I am unable to support it with a clear definition.) Frankly, I forgot about the global switch for the CLI when I wrote the issue, but the issue still stands. For some languages supporting decomposed solution may be an essential feature, for others it seems non-essential. In theory at least. In order to add a note in the README and elsewhere I would like to clarify our position. Sorry, for the latency in my replies. |
From the README:
I think changing the |
I consider this clarified :) |
This needs to be clarified in the web app about still. |
Idea: For the CLI, output a short preamble before the test result that clarifies marks, decomposition and shaping checks, as well as opt-in flags, and how they affect the result. |
@moyogo made a the following comment in #147:
I think it makes sense to clarify this in the README, web app, and maybe even in CLI. But I wonder if it would be best as a point in some kind of language support checklist.
The handling could be potentially different for each language: requiring a combining dieresis for German is optional while for Tlingit both <Ḵ> and <ḵ> should be supported as precomposed and using combining marks (see #147). Perhaps this could be a flag for orthographies, @kontur ?
The text was updated successfully, but these errors were encountered: