new script Sunuwar first data #375

markusicu · 2022-12-20T01:06:44Z

[170-C8] Consensus: UTC accepts 44 Sunuwar characters in a new Sunuwar block (U+11BC0..U+11BFF), as documented in L2/21-157R, for encoding in a future version of the standard.

RMG tracking: https://github.com/unicode-org/utc-release-management/issues/32

Obsolete: https://github.com/unicode-org/properties/issues/45

eggrobin · 2023-03-06T14:02:17Z

I notice the date on LineBreak.txt is not getting bumped by this PR, nor does that one appear to be getting produced in Generated. Is that file manually maintained?

eggrobin · 2023-03-06T15:17:48Z

(You also need to update Scripts.txt, which is what the invariants tests are complaining about.)

markusicu · 2023-03-06T16:29:47Z

This is a draft PR that I worked on during one live work session. A lot of data is missing.

LineBreak.txt is an input file for the Unicode Tools. I believe that KenW has a C tool with heuristics that generates some initial data for Line_Break and other properties for new code points.

As discussed elsewhere, one challenge will be how to keep these kinds of branches up to date, and later merge them, given that many changes touch the same files and lines and create conflicts. In some cases, we can probably simply regenerate the files, but that won't help when all new characters go into the same place. At a minimum, we need to try this out and write up instructions for how best to work through the various types of conflicts.

markusicu · 2023-10-11T16:57:12Z

@eggrobin I should rebase this PR before we continue. Do you have a link to the recipe for that? Such as, how to deal with conflicts during rebase, which files to revert (all & only the "extracted/" ones?), etc.

eggrobin · 2023-10-11T16:59:46Z

I don’t rebase, I merge.
I do that by pasting the first block of commands in https://gist.github.com/eggrobin/e902e30952c2f25b59fd0b285168a330#file-unicodetools-windows-cheat-sheet-md into my shell, but note:

this is powershell and has backslashes;
this assumes the remote corresponding to unicode-org is called la-vache, which might not be your case.

markusicu · 2023-10-11T21:28:45Z

I don’t rebase, I merge. I do that by ...

Thanks, I adjusted those commands to my environment:

Linux
out-of-source build
Before doing the following, I sync'ed my local "main" with upstream as usual

git merge main
# complains about merge conflicts as expected
git checkout main unicodetools/data/ucd/dev/Derived*
git checkout main unicodetools/data/ucd/dev/extracted/*
git checkout main unicodetools/data/ucd/dev/auxiliary/*
rm -r ../Generated/BIN/16.0.0.0/
rm -r ../Generated/BIN/UCD_Data16.0.0.bin
mvn -s ~/.m2/settings.xml compile exec:java -Dexec.mainClass="org.unicode.text.UCD.Main"  -Dexec.args="version 16.0.0 build MakeUnicodeFiles" -am -pl unicodetools  -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd)  -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd)  -DUNICODETOOLS_REPO_DIR=$(pwd)  -DUVERSION=16.0.0
# fix merge conflicts in unicodetools/src/main/java/org/unicode/text/UCD/UCD_Types.java
#   and in UCD_Names.java
# rerun mvn
cp -r ../Generated/UCD/16.0.0/* unicodetools/data/ucd/dev
rm unicodetools/data/ucd/dev/ZZZ-UNCHANGED-*
rm unicodetools/data/ucd/dev/*/ZZZ-UNCHANGED-*
rm unicodetools/data/ucd/dev/extra/*
rm unicodetools/data/ucd/dev/cldr/*
git add unicodetools/src/main/java/org/unicode/text/UCD/UCD_Names.java
git add unicodetools/src/main/java/org/unicode/text/UCD/UCD_Types.java
git add unicodetools/data
git merge --continue

markusicu · 2023-10-11T22:10:21Z

@Ken-Whistler @Manishearth is U+11BE1 SUNUWAR SIGN PVO really punctuation (gc=Po)?

From L2/21-157R:

The SUNUWAR SIGN PVO represents an auspicious syllable, which is articulated as the unvoiced bilabial implosive /ɓ̥/ (formerly /ƥ/) and transcribed as ‘pvo’. In spoken language, the syllable is often utter twice, ie. “pvo, pvo...”

That is, it is pronounced as part of the text, as its own specific syllable, rather than the usual function of punctuation of segmenting text, emphasizing, etc., which I think is usual silent or carried in pauses and intonation.

Shouldn't a syllable have gc=Lo?

markusicu · 2023-10-11T22:14:27Z

FYI: I looked over the list of UCD files, and the list of properties in PropList.txt, and don't see what else we might need here.

Manishearth · 2023-10-11T22:16:54Z

Yeah that sounds congruent to an Om to me (ॐ, ꣽ, ଓଁ, ௐ). Lo makes sense.

Symbol might also make sense but I think Lo is the best choice here.

(Also very similar to a shri श्री which is not encoded atomically in Unicode; but is similarly often ornate/distinct in some scripts and has caused problems in Tamil in the past since the ornateness has disconnected it from its etymological roots which match its encoding model)

markusicu · 2023-10-11T22:35:06Z

I see that the CI checks show a TestTestUnicodeInvariants failure, but I don't get that locally :-/

eggrobin · 2023-10-11T22:51:53Z

I see that the CI checks show a TestTestUnicodeInvariants failure

I don’t see that ; I see
✔️ build.md / Check UCD consistency, invariants, smoke-test generators (pull_request) Successful in 12m Details

You got a failure on your merge commit, before updating Scripts.txt, which was because you had not updated Scripts.txt: https://github.com/unicode-org/unicodetools/actions/runs/6488178531/job/17620064012. But this is fixed now.

markusicu · 2023-10-11T23:22:28Z

I see that the CI checks show a TestTestUnicodeInvariants failure

I don’t see that ; I see ✔️ build.md ...

Huh. My GitHub PR window kept updating itself with all kinds of changes, but not with the CI checks. After refreshing the window, all is well.

markusicu · 2023-10-11T23:24:55Z

Regarding sign pvo: The proposal has a Punctuation section: “There is no script-specific punctuation.”

Ken-Whistler · 2023-10-11T23:36:07Z

I disagree. Just because a punctuation sign has a conventional syllable associated with it doesn't make it a letter for our purposes. "[email protected]" is read ken at unicode dot org, and if people start decorating text with at signs or extra dots, you could read off that text, "@@@@flooby...." at at at at flooby dot dot dot dot That doesn't make "@" or "." letters, just because they are associated with conventional syllables in English (or other languages). If we change this one to gc=Lo in Sunuwar, then we are opening a can of worms for pushpika in Devanagari, as well as possibly siddham signs in various scripts. The proposal is inconsistent in claiming that it doesn't add script-specific punctuation, but turns around and encodes pushpika. That claim in the proposal is not dispositive.

markusicu · 2023-10-12T00:31:10Z

After adding the Script_Extensions data from the proposal, I get these failures:

Error:  Tests run: 9, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 11.961 s <<< FAILURE! - in org.unicode.test.TestSecurity
Error:  org.unicode.test.TestSecurity.TestScriptDetection  Time elapsed: 0.006 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: ̀ ==> expected: <[[Common]]> but was: <[[Sunuwar]]>
	at org.unicode.test.TestSecurity.TestScriptDetection(TestSecurity.java:473)

Error:  org.unicode.test.TestSecurity.TestWholeScripts  Time elapsed: 0.023 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: idSet=[:any:], source= ̀, scripts= [[Sunuwar]] ==> expected: <[SAME]> but was: <[COMMON]>
	at org.unicode.test.TestSecurity.TestWholeScripts(TestSecurity.java:569)

markusicu · 2023-10-12T00:40:54Z

Error:  Tests run: 9, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 11.961 s <<< FAILURE! - in org.unicode.test.TestSecurity
Error:  org.unicode.test.TestSecurity.TestScriptDetection  Time elapsed: 0.006 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: ̀ ==> expected: <[[Common]]> but was: <[[Sunuwar]]>
	at org.unicode.test.TestSecurity.TestScriptDetection(TestSecurity.java:473)

This is a unit test for Unicode Tools class ScriptDetector, with expected sets of scripts for a small set of strings.
If the proposed Script_Extensions for Sunuwar are appropriate, then I guess we need to just update the expectations; possibly use different input strings for Common that still have no scx values.
@macchiati

markusicu · 2023-10-12T00:53:50Z

Error:  org.unicode.test.TestSecurity.TestWholeScripts  Time elapsed: 0.023 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: idSet=[:any:], source= ̀, scripts= [[Sunuwar]] ==> expected: <[SAME]> but was: <[COMMON]>
	at org.unicode.test.TestSecurity.TestWholeScripts(TestSecurity.java:569)

This test code exercises CheckWholeScript.getConfusables() for some strings.
For "\u0300"

it expects to get the status SAME which means “There is a same-script confusable, like "google" and "goog1e"”
it now gets the status COMMON: “There is a common-only confusable, like "l" and "1"”
the set of scripts is from the ScriptDetector and now picks up the new scx value for this common diacritic

This seems troubling. Assuming that this code follows the security spec, it might tell us that we can expect lots of confusables implementations to get very different results for the affected common diacritics.
@macchiati @eggrobin @asmusf

Manishearth · 2023-10-12T03:02:26Z

I disagree. Just because a punctuation sign has a conventional syllable associated with it doesn't make it a letter for our purposes

It has a conventional syllable that is not otherwise representable in the script is what tips this over the edge for me, the script represents bilabial implosives but not unvoiced ones (which are an extremely rare sound in general).

I'm still somewhat torn. I see a case for Lo, I see a case for Po¹. I think it boils down to whether this is primarily a written register thing or also a spoken one: is it an utterance that has a corresponding glyph, or is it a glyph that has a corresponding "read out loud" utterance that would not be used otherwise in normal speech. The proposal itself doesn't give us a clear answer there; we could ask Anshu though.

Somewhat reminded of Sumero-Akkadian DINGIR 𒀭 or LUGAL 𒈗 when used as determinatives for the latter case (even though those are encoded as letters: if they were only used in this fancy-marker way I would see the argument for Po).

I'm not clear how pushpika is related (it's a lacuna marker, right?), but I do take the point about siddham. Consistency with siddham is an easy argument to make here and one I'd accept.

cc @roozbehp

I also think there may be a case for Symbol but I don't care about that that much ↩

eggrobin · 2023-10-12T07:08:55Z

Sumero-Akkadian DINGIR 𒀭 or LUGAL 𒈗 when used as determinatives

Off-topic, but I don’t recall seeing 𒈗 used as a determinative (LÚ 𒇽 is).

markusicu · 2023-10-12T17:16:50Z

@roozbehp @Ken-Whistler the proposed additions of "Sunu" Script_Extensions for commonly used (sc=Inherited) combining marks like U+0300 are causing problems with the Unicode Tools confusables implementation, suggesting that they would also cause problems in real implementations.

@eggrobin and I suspect that we should back those out. These characters are used in many scripts, and if the assignment of scx=Sunu is not some kind of new thing that is intentional, and intended to be expanded, then we should revert it.

It seems like either we should have a comprehensive scx set of scripts for a combining mark or we should leave it Inherited with no scx.

Do you agree?

This reverts commit 050d5a1.

This reverts commit bdc3e79.

markusicu · 2023-10-12T20:48:53Z

Re pvo:

it is the Sunuwar version of a “sign siddham”
making it a letter auto-adds it into identifiers, for which we have no evidence
Ken thinks that for Om there was evidence of alternatively spelling it with regular letters, without real distinction

markusicu · 2023-10-12T20:59:56Z

Re scx=Sunu: SAH had agreed already to remove these, see https://github.com/unicode-org/sah/issues/32#issuecomment-1760147420

r12a · 2023-12-06T05:16:58Z

The choice of Po seems odd to me. My understanding is that this is not a case of /ɓ/ describing a punctuation mark (as in ken at unicode dot org), but rather that this character describes a spoken sound (like any other letter). Here are my notes:

𑯡U+11BE1 SUNUWAR SIGN PVO represents an 'auspicious syllable', which is uttered, often twice, before a formulaic phrase. The sign is written in salutations and benedictions, and its basic trident shape can vary in the details.1

Information provided by Lal Rapacha: There is another bilabial voiceless implosive ƥ (also written as ɓ̥) as well but very rare, historical, and now obsolete or extinct. Kõits as one of the Western Kiranti languages of eastern Nepal has a relationship with Kiranti-Bayung, historically noted for such implosive sounds. Another neighbouring Kiranti-RaDhu or Wambule language has /ɓ/, /ƥ/, and /ɗ/ implosive sounds in its phonology.

If gc=Lo is problematic, i would be inclined to make this gc=So, rather than Po.

markusicu · 2023-12-06T22:08:01Z

Regarding sign PVO:

If gc=Lo is problematic, i would be inclined to make this gc=So, rather than Po.

I created https://github.com/unicode-org/properties/issues/207. Please say there why you would prefer So over Po.

markusicu added 2 commits December 19, 2022 17:05

new script Sunuwar first data

68b9004

Sunuwar first generated data files

616f2e8

eggrobin added the data-for-new label Apr 18, 2023

eggrobin added the pipeline-16.0 label Sep 22, 2023

Merge branch 'main' into add-script-sunuwar

9e852f0

markusicu added 2 commits October 11, 2023 14:58

script aliases & ranges

63b5563

regen Scripts.txt

17ef16d

markusicu added the question Further information is requested label Oct 11, 2023

markusicu added 2 commits October 11, 2023 16:26

scx from proposal

bdc3e79

regen scx

050d5a1

markusicu added 2 commits October 12, 2023 11:02

Revert "regen scx"

f43c415

This reverts commit 050d5a1.

Revert "scx from proposal"

0e3de85

This reverts commit bdc3e79.

markusicu removed the question Further information is requested label Oct 12, 2023

markusicu marked this pull request as ready for review October 12, 2023 18:24

markusicu requested review from Manishearth, eggrobin and josh-hadley October 12, 2023 18:26

Manishearth approved these changes Oct 12, 2023

View reviewed changes

markusicu merged commit beeef7c into unicode-org:main Oct 12, 2023
9 of 10 checks passed

markusicu deleted the add-script-sunuwar branch October 12, 2023 21:01

srl295 mentioned this pull request Oct 18, 2023

jsp: Update JSPs to 15.1.0 #577

Merged

eggrobin mentioned this pull request Oct 24, 2023

Check in my UCD checklist #585

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new script Sunuwar first data #375

new script Sunuwar first data #375

markusicu commented Dec 20, 2022 •

edited by eggrobin

Loading

eggrobin commented Mar 6, 2023

eggrobin commented Mar 6, 2023

markusicu commented Mar 6, 2023

markusicu commented Oct 11, 2023

eggrobin commented Oct 11, 2023

markusicu commented Oct 11, 2023

markusicu commented Oct 11, 2023

markusicu commented Oct 11, 2023

Manishearth commented Oct 11, 2023 •

edited

Loading

markusicu commented Oct 11, 2023

eggrobin commented Oct 11, 2023

markusicu commented Oct 11, 2023

markusicu commented Oct 11, 2023

Ken-Whistler commented Oct 11, 2023

markusicu commented Oct 12, 2023

markusicu commented Oct 12, 2023

markusicu commented Oct 12, 2023

Manishearth commented Oct 12, 2023

eggrobin commented Oct 12, 2023

markusicu commented Oct 12, 2023

markusicu commented Oct 12, 2023 •

edited

Loading

markusicu commented Oct 12, 2023

r12a commented Dec 6, 2023 •

edited

Loading

markusicu commented Dec 6, 2023

new script Sunuwar first data #375

new script Sunuwar first data #375

Conversation

markusicu commented Dec 20, 2022 • edited by eggrobin Loading

eggrobin commented Mar 6, 2023

eggrobin commented Mar 6, 2023

markusicu commented Mar 6, 2023

markusicu commented Oct 11, 2023

eggrobin commented Oct 11, 2023

markusicu commented Oct 11, 2023

markusicu commented Oct 11, 2023

markusicu commented Oct 11, 2023

Manishearth commented Oct 11, 2023 • edited Loading

markusicu commented Oct 11, 2023

eggrobin commented Oct 11, 2023

markusicu commented Oct 11, 2023

markusicu commented Oct 11, 2023

Ken-Whistler commented Oct 11, 2023

markusicu commented Oct 12, 2023

markusicu commented Oct 12, 2023

markusicu commented Oct 12, 2023

Manishearth commented Oct 12, 2023

Footnotes

eggrobin commented Oct 12, 2023

markusicu commented Oct 12, 2023

markusicu commented Oct 12, 2023 • edited Loading

markusicu commented Oct 12, 2023

r12a commented Dec 6, 2023 • edited Loading

markusicu commented Dec 6, 2023

markusicu commented Dec 20, 2022 •

edited by eggrobin

Loading

Manishearth commented Oct 11, 2023 •

edited

Loading

markusicu commented Oct 12, 2023 •

edited

Loading

r12a commented Dec 6, 2023 •

edited

Loading