Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new script Sunuwar first data #375

Merged
merged 9 commits into from
Oct 12, 2023

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Dec 20, 2022

[170-C8] Consensus: UTC accepts 44 Sunuwar characters in a new Sunuwar block (U+11BC0..U+11BFF), as documented in L2/21-157R, for encoding in a future version of the standard.

RMG tracking: https://github.com/unicode-org/utc-release-management/issues/32

Obsolete: https://github.com/unicode-org/properties/issues/45

@eggrobin
Copy link
Member

eggrobin commented Mar 6, 2023

I notice the date on LineBreak.txt is not getting bumped by this PR, nor does that one appear to be getting produced in Generated. Is that file manually maintained?

@eggrobin
Copy link
Member

eggrobin commented Mar 6, 2023

(You also need to update Scripts.txt, which is what the invariants tests are complaining about.)

@markusicu
Copy link
Member Author

This is a draft PR that I worked on during one live work session. A lot of data is missing.

LineBreak.txt is an input file for the Unicode Tools. I believe that KenW has a C tool with heuristics that generates some initial data for Line_Break and other properties for new code points.

As discussed elsewhere, one challenge will be how to keep these kinds of branches up to date, and later merge them, given that many changes touch the same files and lines and create conflicts. In some cases, we can probably simply regenerate the files, but that won't help when all new characters go into the same place. At a minimum, we need to try this out and write up instructions for how best to work through the various types of conflicts.

@markusicu
Copy link
Member Author

@eggrobin I should rebase this PR before we continue. Do you have a link to the recipe for that? Such as, how to deal with conflicts during rebase, which files to revert (all & only the "extracted/" ones?), etc.

@eggrobin
Copy link
Member

I don’t rebase, I merge.
I do that by pasting the first block of commands in https://gist.github.com/eggrobin/e902e30952c2f25b59fd0b285168a330#file-unicodetools-windows-cheat-sheet-md into my shell, but note:

  1. this is powershell and has backslashes;
  2. this assumes the remote corresponding to unicode-org is called la-vache, which might not be your case.

@markusicu
Copy link
Member Author

I don’t rebase, I merge. I do that by ...

Thanks, I adjusted those commands to my environment:

  • Linux
  • out-of-source build
  • Before doing the following, I sync'ed my local "main" with upstream as usual
git merge main
# complains about merge conflicts as expected
git checkout main unicodetools/data/ucd/dev/Derived*
git checkout main unicodetools/data/ucd/dev/extracted/*
git checkout main unicodetools/data/ucd/dev/auxiliary/*
rm -r ../Generated/BIN/16.0.0.0/
rm -r ../Generated/BIN/UCD_Data16.0.0.bin
mvn -s ~/.m2/settings.xml compile exec:java -Dexec.mainClass="org.unicode.text.UCD.Main"  -Dexec.args="version 16.0.0 build MakeUnicodeFiles" -am -pl unicodetools  -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd)  -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd)  -DUNICODETOOLS_REPO_DIR=$(pwd)  -DUVERSION=16.0.0
# fix merge conflicts in unicodetools/src/main/java/org/unicode/text/UCD/UCD_Types.java
#   and in UCD_Names.java
# rerun mvn
cp -r ../Generated/UCD/16.0.0/* unicodetools/data/ucd/dev
rm unicodetools/data/ucd/dev/ZZZ-UNCHANGED-*
rm unicodetools/data/ucd/dev/*/ZZZ-UNCHANGED-*
rm unicodetools/data/ucd/dev/extra/*
rm unicodetools/data/ucd/dev/cldr/*
git add unicodetools/src/main/java/org/unicode/text/UCD/UCD_Names.java
git add unicodetools/src/main/java/org/unicode/text/UCD/UCD_Types.java
git add unicodetools/data
git merge --continue

@markusicu
Copy link
Member Author

@Ken-Whistler @Manishearth is U+11BE1 SUNUWAR SIGN PVO really punctuation (gc=Po)?

From L2/21-157R:

The SUNUWAR SIGN PVO represents an auspicious syllable, which is articulated as the unvoiced bilabial implosive /ɓ̥/ (formerly /ƥ/) and transcribed as ‘pvo’. In spoken language, the syllable is often utter twice, ie. “pvo, pvo...”

That is, it is pronounced as part of the text, as its own specific syllable, rather than the usual function of punctuation of segmenting text, emphasizing, etc., which I think is usual silent or carried in pauses and intonation.

Shouldn't a syllable have gc=Lo?

@markusicu
Copy link
Member Author

FYI: I looked over the list of UCD files, and the list of properties in PropList.txt, and don't see what else we might need here.

@Manishearth
Copy link
Member

Manishearth commented Oct 11, 2023

Yeah that sounds congruent to an Om to me (ॐ, ꣽ, ଓଁ, ௐ). Lo makes sense.

Symbol might also make sense but I think Lo is the best choice here.

(Also very similar to a shri श्री which is not encoded atomically in Unicode; but is similarly often ornate/distinct in some scripts and has caused problems in Tamil in the past since the ornateness has disconnected it from its etymological roots which match its encoding model)

@markusicu markusicu added the question Further information is requested label Oct 11, 2023
@markusicu
Copy link
Member Author

I see that the CI checks show a TestTestUnicodeInvariants failure, but I don't get that locally :-/

@eggrobin
Copy link
Member

I see that the CI checks show a TestTestUnicodeInvariants failure

I don’t see that ; I see
✔️ build.md / Check UCD consistency, invariants, smoke-test generators (pull_request) Successful in 12m Details

You got a failure on your merge commit, before updating Scripts.txt, which was because you had not updated Scripts.txt: https://github.com/unicode-org/unicodetools/actions/runs/6488178531/job/17620064012. But this is fixed now.

@markusicu
Copy link
Member Author

I see that the CI checks show a TestTestUnicodeInvariants failure

I don’t see that ; I see ✔️ build.md ...

Huh. My GitHub PR window kept updating itself with all kinds of changes, but not with the CI checks. After refreshing the window, all is well.

@markusicu
Copy link
Member Author

Regarding sign pvo: The proposal has a Punctuation section: “There is no script-specific punctuation.”

@Ken-Whistler
Copy link
Contributor

I disagree. Just because a punctuation sign has a conventional syllable associated with it doesn't make it a letter for our purposes. "[email protected]" is read ken at unicode dot org, and if people start decorating text with at signs or extra dots, you could read off that text, "@@@@flooby...." at at at at flooby dot dot dot dot That doesn't make "@" or "." letters, just because they are associated with conventional syllables in English (or other languages). If we change this one to gc=Lo in Sunuwar, then we are opening a can of worms for pushpika in Devanagari, as well as possibly siddham signs in various scripts. The proposal is inconsistent in claiming that it doesn't add script-specific punctuation, but turns around and encodes pushpika. That claim in the proposal is not dispositive.

@markusicu
Copy link
Member Author

After adding the Script_Extensions data from the proposal, I get these failures:

Error:  Tests run: 9, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 11.961 s <<< FAILURE! - in org.unicode.test.TestSecurity
Error:  org.unicode.test.TestSecurity.TestScriptDetection  Time elapsed: 0.006 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: ̀ ==> expected: <[[Common]]> but was: <[[Sunuwar]]>
	at org.unicode.test.TestSecurity.TestScriptDetection(TestSecurity.java:473)

Error:  org.unicode.test.TestSecurity.TestWholeScripts  Time elapsed: 0.023 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: idSet=[:any:], source= ̀, scripts= [[Sunuwar]] ==> expected: <[SAME]> but was: <[COMMON]>
	at org.unicode.test.TestSecurity.TestWholeScripts(TestSecurity.java:569)

@markusicu
Copy link
Member Author

Error:  Tests run: 9, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 11.961 s <<< FAILURE! - in org.unicode.test.TestSecurity
Error:  org.unicode.test.TestSecurity.TestScriptDetection  Time elapsed: 0.006 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: ̀ ==> expected: <[[Common]]> but was: <[[Sunuwar]]>
	at org.unicode.test.TestSecurity.TestScriptDetection(TestSecurity.java:473)

This is a unit test for Unicode Tools class ScriptDetector, with expected sets of scripts for a small set of strings.
If the proposed Script_Extensions for Sunuwar are appropriate, then I guess we need to just update the expectations; possibly use different input strings for Common that still have no scx values.
@macchiati

@markusicu
Copy link
Member Author

Error:  org.unicode.test.TestSecurity.TestWholeScripts  Time elapsed: 0.023 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: idSet=[:any:], source= ̀, scripts= [[Sunuwar]] ==> expected: <[SAME]> but was: <[COMMON]>
	at org.unicode.test.TestSecurity.TestWholeScripts(TestSecurity.java:569)

This test code exercises CheckWholeScript.getConfusables() for some strings.
For "\u0300"

  • it expects to get the status SAME which means “There is a same-script confusable, like "google" and "goog1e"”
  • it now gets the status COMMON: “There is a common-only confusable, like "l" and "1"”
  • the set of scripts is from the ScriptDetector and now picks up the new scx value for this common diacritic

This seems troubling. Assuming that this code follows the security spec, it might tell us that we can expect lots of confusables implementations to get very different results for the affected common diacritics.
@macchiati @eggrobin @asmusf

@Manishearth
Copy link
Member

I disagree. Just because a punctuation sign has a conventional syllable associated with it doesn't make it a letter for our purposes

It has a conventional syllable that is not otherwise representable in the script is what tips this over the edge for me, the script represents bilabial implosives but not unvoiced ones (which are an extremely rare sound in general).

I'm still somewhat torn. I see a case for Lo, I see a case for Po1. I think it boils down to whether this is primarily a written register thing or also a spoken one: is it an utterance that has a corresponding glyph, or is it a glyph that has a corresponding "read out loud" utterance that would not be used otherwise in normal speech. The proposal itself doesn't give us a clear answer there; we could ask Anshu though.

Somewhat reminded of Sumero-Akkadian DINGIR 𒀭 or LUGAL 𒈗 when used as determinatives for the latter case (even though those are encoded as letters: if they were only used in this fancy-marker way I would see the argument for Po).

I'm not clear how pushpika is related (it's a lacuna marker, right?), but I do take the point about siddham. Consistency with siddham is an easy argument to make here and one I'd accept.

cc @roozbehp

Footnotes

  1. I also think there may be a case for Symbol but I don't care about that that much

@eggrobin
Copy link
Member

Sumero-Akkadian DINGIR 𒀭 or LUGAL 𒈗 when used as determinatives

Off-topic, but I don’t recall seeing 𒈗 used as a determinative (LÚ 𒇽 is).

@markusicu
Copy link
Member Author

@roozbehp @Ken-Whistler the proposed additions of "Sunu" Script_Extensions for commonly used (sc=Inherited) combining marks like U+0300 are causing problems with the Unicode Tools confusables implementation, suggesting that they would also cause problems in real implementations.

@eggrobin and I suspect that we should back those out. These characters are used in many scripts, and if the assignment of scx=Sunu is not some kind of new thing that is intentional, and intended to be expanded, then we should revert it.

It seems like either we should have a comprehensive scx set of scripts for a combining mark or we should leave it Inherited with no scx.

Do you agree?

This reverts commit 050d5a1.
This reverts commit bdc3e79.
@markusicu markusicu removed the question Further information is requested label Oct 12, 2023
@markusicu markusicu marked this pull request as ready for review October 12, 2023 18:24
@markusicu
Copy link
Member Author

markusicu commented Oct 12, 2023

Re pvo:

  • it is the Sunuwar version of a “sign siddham”
  • making it a letter auto-adds it into identifiers, for which we have no evidence
  • Ken thinks that for Om there was evidence of alternatively spelling it with regular letters, without real distinction

@markusicu
Copy link
Member Author

Re scx=Sunu: SAH had agreed already to remove these, see https://github.com/unicode-org/sah/issues/32#issuecomment-1760147420

@markusicu markusicu merged commit beeef7c into unicode-org:main Oct 12, 2023
9 of 10 checks passed
@markusicu markusicu deleted the add-script-sunuwar branch October 12, 2023 21:01
@r12a
Copy link

r12a commented Dec 6, 2023

The choice of Po seems odd to me. My understanding is that this is not a case of /ɓ/ describing a punctuation mark (as in ken at unicode dot org), but rather that this character describes a spoken sound (like any other letter). Here are my notes:

𑯡U+11BE1 SUNUWAR SIGN PVO represents an 'auspicious syllable', which is uttered, often twice, before a formulaic phrase. The sign is written in salutations and benedictions, and its basic trident shape can vary in the details.1

Information provided by Lal Rapacha: There is another bilabial voiceless implosive ƥ (also written as ɓ̥) as well but very rare, historical, and now obsolete or extinct. Kõits as one of the Western Kiranti languages of eastern Nepal has a relationship with Kiranti-Bayung, historically noted for such implosive sounds. Another neighbouring Kiranti-RaDhu or Wambule language has /ɓ/, /ƥ/, and /ɗ/ implosive sounds in its phonology.

If gc=Lo is problematic, i would be inclined to make this gc=So, rather than Po.

@markusicu
Copy link
Member Author

Regarding sign PVO:

If gc=Lo is problematic, i would be inclined to make this gc=So, rather than Po.

I created https://github.com/unicode-org/properties/issues/207. Please say there why you would prefer So over Po.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants