-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new script Sunuwar first data #375
Conversation
I notice the date on |
(You also need to update Scripts.txt, which is what the invariants tests are complaining about.) |
This is a draft PR that I worked on during one live work session. A lot of data is missing. LineBreak.txt is an input file for the Unicode Tools. I believe that KenW has a C tool with heuristics that generates some initial data for Line_Break and other properties for new code points. As discussed elsewhere, one challenge will be how to keep these kinds of branches up to date, and later merge them, given that many changes touch the same files and lines and create conflicts. In some cases, we can probably simply regenerate the files, but that won't help when all new characters go into the same place. At a minimum, we need to try this out and write up instructions for how best to work through the various types of conflicts. |
@eggrobin I should rebase this PR before we continue. Do you have a link to the recipe for that? Such as, how to deal with conflicts during rebase, which files to revert (all & only the "extracted/" ones?), etc. |
I don’t rebase, I merge.
|
Thanks, I adjusted those commands to my environment:
|
@Ken-Whistler @Manishearth is U+11BE1 SUNUWAR SIGN PVO really punctuation (gc=Po)? From L2/21-157R:
That is, it is pronounced as part of the text, as its own specific syllable, rather than the usual function of punctuation of segmenting text, emphasizing, etc., which I think is usual silent or carried in pauses and intonation. Shouldn't a syllable have gc=Lo? |
FYI: I looked over the list of UCD files, and the list of properties in PropList.txt, and don't see what else we might need here. |
Yeah that sounds congruent to an Om to me ( Symbol might also make sense but I think Lo is the best choice here. (Also very similar to a shri श्री which is not encoded atomically in Unicode; but is similarly often ornate/distinct in some scripts and has caused problems in Tamil in the past since the ornateness has disconnected it from its etymological roots which match its encoding model) |
I see that the CI checks show a TestTestUnicodeInvariants failure, but I don't get that locally :-/ |
I don’t see that ; I see You got a failure on your merge commit, before updating Scripts.txt, which was because you had not updated Scripts.txt: https://github.com/unicode-org/unicodetools/actions/runs/6488178531/job/17620064012. But this is fixed now. |
Huh. My GitHub PR window kept updating itself with all kinds of changes, but not with the CI checks. After refreshing the window, all is well. |
Regarding sign pvo: The proposal has a Punctuation section: “There is no script-specific punctuation.” |
I disagree. Just because a punctuation sign has a conventional syllable associated with it doesn't make it a letter for our purposes. "[email protected]" is read ken at unicode dot org, and if people start decorating text with at signs or extra dots, you could read off that text, "@@@@flooby...." at at at at flooby dot dot dot dot That doesn't make "@" or "." letters, just because they are associated with conventional syllables in English (or other languages). If we change this one to gc=Lo in Sunuwar, then we are opening a can of worms for pushpika in Devanagari, as well as possibly siddham signs in various scripts. The proposal is inconsistent in claiming that it doesn't add script-specific punctuation, but turns around and encodes pushpika. That claim in the proposal is not dispositive. |
After adding the Script_Extensions data from the proposal, I get these failures:
|
This is a unit test for Unicode Tools class ScriptDetector, with expected sets of scripts for a small set of strings. |
This test code exercises CheckWholeScript.getConfusables() for some strings.
This seems troubling. Assuming that this code follows the security spec, it might tell us that we can expect lots of confusables implementations to get very different results for the affected common diacritics. |
It has a conventional syllable that is not otherwise representable in the script is what tips this over the edge for me, the script represents bilabial implosives but not unvoiced ones (which are an extremely rare sound in general). I'm still somewhat torn. I see a case for Lo, I see a case for Po1. I think it boils down to whether this is primarily a written register thing or also a spoken one: is it an utterance that has a corresponding glyph, or is it a glyph that has a corresponding "read out loud" utterance that would not be used otherwise in normal speech. The proposal itself doesn't give us a clear answer there; we could ask Anshu though. Somewhat reminded of Sumero-Akkadian DINGIR 𒀭 or LUGAL 𒈗 when used as determinatives for the latter case (even though those are encoded as letters: if they were only used in this fancy-marker way I would see the argument for Po). I'm not clear how pushpika is related (it's a lacuna marker, right?), but I do take the point about siddham. Consistency with siddham is an easy argument to make here and one I'd accept. cc @roozbehp Footnotes
|
Off-topic, but I don’t recall seeing 𒈗 used as a determinative (LÚ 𒇽 is). |
@roozbehp @Ken-Whistler the proposed additions of "Sunu" Script_Extensions for commonly used (sc=Inherited) combining marks like U+0300 are causing problems with the Unicode Tools confusables implementation, suggesting that they would also cause problems in real implementations. @eggrobin and I suspect that we should back those out. These characters are used in many scripts, and if the assignment of scx=Sunu is not some kind of new thing that is intentional, and intended to be expanded, then we should revert it. It seems like either we should have a comprehensive scx set of scripts for a combining mark or we should leave it Inherited with no scx. Do you agree? |
Re pvo:
|
Re scx=Sunu: SAH had agreed already to remove these, see https://github.com/unicode-org/sah/issues/32#issuecomment-1760147420 |
The choice of Po seems odd to me. My understanding is that this is not a case of /ɓ/ describing a punctuation mark (as in ken at unicode dot org), but rather that this character describes a spoken sound (like any other letter). Here are my notes:
If gc=Lo is problematic, i would be inclined to make this gc=So, rather than Po. |
Regarding sign PVO:
I created https://github.com/unicode-org/properties/issues/207. Please say there why you would prefer So over Po. |
[170-C8] Consensus: UTC accepts 44 Sunuwar characters in a new Sunuwar block (U+11BC0..U+11BFF), as documented in L2/21-157R, for encoding in a future version of the standard.
RMG tracking: https://github.com/unicode-org/utc-release-management/issues/32
Obsolete: https://github.com/unicode-org/properties/issues/45