Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCA 16.0 from Ken #571

Merged
merged 14 commits into from
Nov 23, 2023
Merged

UCA 16.0 from Ken #571

merged 14 commits into from
Nov 23, 2023

Conversation

markusicu
Copy link
Member

Successive changes from Ken for adding new characters to the default sort order.
Files from Ken taken verbatim; commit messages quoting Ken's progress notes (notes per delta from Ken's UCA16Journal.txt in each corresponding commit).

From Ken:

Simply copy unidata-15.1.0d5.txt (7/28/2023 revision used for 15.1.0 release)
to unidata-16.0.0d1.txt and update internal date, version, etc.

Run existing 15.1.0 sifter executable with the 16.0.0 updated library for
properties, case mapping, etc., to process unidata-16.0.0d1.txt.
Verify that the output allkeys.txt is identical to the allkeys.txt
released for 15.1.0, except for the generated date header.

Current initial state archived as:

unidata-16.0.0d1.txt (1550494 bytes, 10/06/2023)

Process the diff between the released 15.1.0 UnicodeData.txt
(UnicodeData-15.1.0d3.txt) and the current latest draft of the
16.0.0 UnicodeData.txt (UnicodeData-16.0.0d7.txt). Clean this up
to just a list of all the 1177 new UnicodeData.txt records for
16.0. Run the results through a small transducer that snips out
the fields not used for the unidata.txt input for the sifter.
(This is just a simple utiity I have had for years -- it would
be easy to replicate in Perl or Python, as needed.) The result
is archived as:

uc151to160add.txt (49212 bytes, 10/07/2023)

That is the source of the data fields to paste into the evolving
draft of unidata.txt for 16.0. I do it this way because the
bookkeeping is all automatic. I need to find the right place in unidata.txt
for all 1177 lines, and when the input from uc151to160add.txt has
dwindled down to 0 lines left to transfer, I know I'm completely
done.
From Ken:

For this delta, search the input for all the new lines that can be
intercalated into unidata.txt without affecting any primary weights
or introducing any new secondary weights.

1. Move the new uppercase for 0264 (A7CB) into unidata.txt right below
the entry for 0264. This just introduce the new uppercase, and does
not impact the primary weight sequence at all.

Verify by generating allkeys.txt and examining the diff.

2. Look for any new combining marks for Brahmic scripts that should be
equated with the existing weights for Devanagari candrabindu, anusvara,
and visarga. (This is a regular feature of DUCET now, to avoid the
unnecessary proliferation of secondary weights for these for each
new script that has them, because they are never mixed and matched
across scripts.) The clear candidate for 16.0 is Tulu-Tigalari. The
3 characters in question are:

113CA;TULU-TIGALARI SIGN CANDRA ANUNASIKA;Mc;0901;;;;;
113CC;TULU-TIGALARI SIGN ANUSVARA;Mc;0902;;;;;
113CD;TULU-TIGALARI SIGN VISARGA;Mc;0903;;;;;

Checking the latest proposal document (L2/23-031) verifies that
"CANDRA ANUNASIKA" is the Tulu-Tigalari candrabindu analog.
These three lines are copied into unidata.txt right below the
corresponding entries for the Grantha analogs (closest related
script, as well as sequential in code point order). The artificial
decompositions 0901, 0902, and 0903 are manually added to the
entries.

Verify by generating allkeys.txt and examining the diff.

3. Rinse and repeat for the next new Brahmic scipt with any of these
three characters: Gurung Khema:

1612D;GURUNG KHEMA SIGN ANUSVARA;Mn;0902;;;;;

The order of intercalation for these entries is not critical, because they
are just being equated to something else. But I put it after 11F03,
KAWI SIGN VISARGA, to keep it in code point order in the input file.
The artificial deocmposition 0902 is manually added to the entry.

Verify by generating allkeys.txt and examining the diff.

4. Note that the Kirat Rai ANUSVARA, TONPI (a bindu), and VISARGA should
*not* be equated to the Devanagari combining marks. Kirat Rai was
deliberately encoded more like an alphabet, even though it minimally
qualified as an abugida because of the inherent vowel and the
existence of a killer. It is best for Kirat Rai to just give these
three characters a primary weight in the code point order. So I
skip over them for this draft of unidata.txt. They will be processed
later when adding all the primary weights for Kirat Rai. So a no-op
at this point in the processing.

5. Note any obvious sets of compatibility decompositions that will
just result in more equivalences without impacting primary weights.
The obvious set for 16.0 are the outlined Latin capital letters in
the legacy computer symbols repertoire, 1CCD6..1CCEF. Move these
entries into unidata.txt just before the very similar set of
squared Latin capital letters, 1F130..1F149. Again, the exact placement
in unidata.txt doesn't matter, because these will all end up equated
to existing other weights, but putting 1CCD6..1CCEF in that location
makes the parallel collation treatment obvious and makes these easier
to track in the input file. No manual modification of the decomposition is
needed, as these all already have formal compatibility decompositions.

Verify by generating allkeys.txt and examining the diff.

At this point, the obvious candidates (other than digits) have been
taken care of. 31 down, 1146 to go. Note that to this point, not only
is the diff for unidata.txt easy to examine, but also the diff for
allkeys.txt is still well-formed and easy to interpret.

Archive this delta 2:

unidata-16.0.0d2.txt (1552253 bytes, 10/07/2023)
From Ken:

Delta 3 processing will focus on the digits, which if intercalated
correctly, will also not impact any primary weights. The work for these
is fairly tedious for 16.0, because there are eight new sets of digits
in the repertoire. This is also the point where it makes sense to come
to grips with where to put all the new scripts in the overall order,
so that the points where the digits are intercalated into the input
file are reasonably consistent with each other and with the decision
about the overall order of new scripts.

Looking at all the new scripts, I suggest the following:

Tulu-Tigalari should go right after Grantha. This matches code point
order (intentional in the Roadmap) and these two are historically close.

Todhri should go right after Vithkuqi. This matches code point order
(intentional in the Roadmap) and these two are both Albanian scripts.

Sunuwar should go right after Tangsa, another con-script used in
the same general area of India, NE Myanmar and the Himalayas.

Gurung Khema should go right after Sunuwar. It is also a con-script
used in Nepal and Sikkim.

Kirat Rai should go right after Gurung Khema. It is also a con-script
used in NE India and Sikkim.

The ordering of Tangsa > Sunuwar > Gurung Khema > Kirat Rai will also
match the order of the block descriptions in the 16.0 core specification.

Ol Onal should go right after Ol Chiki, another con-script for a Munda
language spoken in the same general area of NE India. (This also matches
the order of the block descriptions in the 16.0 core specification.)

Garay is an African con-script for Wolof in Senegal. Its order is
rather arbitrary, but in the core specification we put it after
Medefaidrin (from Nigeria). We can do the same for UCA, which will
put it in unidata.txt after Medefaidrin and before Adlam.

1. Myanmar Extended-C digits

Myanmar Pa'O digits (116D0..116D9) and Myanmar Eastern Pwo Karen digits
(116DA..116E3) aren't from a new script. They can be slotted in right
after the Myanmar Tai Laing digit series. And they might as well be
handled together as a set of two.

For historic reasons, and in part because of the way the sifter works,
digits have been accumulated together in unidata.txt by the digit values,
so for each new set, the digit zero value needs to be intercalated in
the set of other script digits for zero, etc.

Verify by generating allkeys.txt and examining the diff.

One of the reasons why I do this deliberately, and mostly one set of digits
at a time is because I am also looking for any anomalies that might have
crept into my update of the underlying library. Unlike in ICU,
which I presume at this point just parses the entire UCD and autogenerates
its numeric tables, I still do some of this work of table updates by hand,
so occasionally I muff an update and need to make sure I'm getting the
correct 0..9 values as expected for each new set of digits added in a version.

2. Sunuwar digits

Sunuwar digits (11BF0..11BF9) get intercalated after Tangsa digits, per
the above scheme.

Verify by generating allkeys.txt and examining the diff.

3. Gurung Khema digits

Gurung Khema digits (16130..16139) get intercalated after Sunuwar digits.

Verify by generating allkeys.txt and examining the diff.

4. Kirat Rai digits

Kirat Rai digits (16D70..16D79) get intercalated after Gurung Khema digits.

Verify by generating allkeys.txt and examining the diff.

5. Ol Onal digits

Ol Onal digits (1E5F1..1E5FA) get intercalated after Ol Chiki digits.

Verify by generating allkeys.txt and examining the diff.

6. Garay digits

Garay digits (10D40..10D49) get intercalated after Medefaidrin digits.

Verify by generating allkeys.txt and examining the diff.

7. Outlined digits

The outlined digits (1CCF0..1CCF9) from the legacy computer symbols repertoire
should be treated analogously to the segmented digits. These all have
explicit <font> compatibility decompositions, which will impact their
tertiary weights. The entire range can be slotted into unidata.txt
as a chunk, right after the list of segmented digits.

Verify by generating allkeys.txt and examining the diff.

At this point I'm done with the digits. I do one more comprehensive diff
between the d2 and d3 version of allkeys.txt, to verify all still looks to
be in correct order. 80 more down, 1066 to go.

Archive this delta 3:

unidata-16.0.0d3.txt (155520 bytes, 10/07/2023)
From Ken:

After scanning the remaining new characters for possible candidates for
intercalation without impacting primary or secondary weights, I spied
the following candidates.

1. Nuktas

There was a stray nukta added, 11F5A KAWI SIGN NUKTA. Most of the nuktas
are also folded to a single secondary weight, so this one won't require a
new secondary weight:

11F5A;KAWI SIGN NUKTA;Mn;093C;;;;;

Note that an explicit decomposition to 093C is added here.

Verify by generating allkeys.txt and examining the diff.

2. Vedic tone marks

The pattern in DUCET for all Vedic tone marks is just to ignore them
completely for collation. There are a couple Vedic tone marks in 16.0
added for Tulu-Tigalari. These can be added to the ranges of ignored
Vedic accents in unidata.txt.

Verify by generating allkeys.txt and examining the diff.

3. Garay combining marks

Garay has four diacritic combining marks, 10D6A..10D6D, which do not
participate in any canonical equivalences. The combining 10D69 GARAY VOWEL SIGN E
is highly significant to the orthography, so must get a primary weight.

10D6A;GARAY CONSONANT GEMINATION MARK;Mn;;;;;;
10D6B;GARAY COMBINING DOT ABOVE;Mn;;;;;;
10D6C;GARAY COMBINING DOUBLE DOT ABOVE;Mn;;;;;;
10D6D;GARAY CONSONANT NASALIZATION MARK;Mn;;;;;;

All 4 of these are non-spacing marks above.

The proposal is unclear about the collation implications of the gemination mark,
but it seems safe to presume it should be given a secondary weight and not be
ignored.

The dot above and double dot above are effectively two nuktas. The dot above is
used on one letter to mark a native Garay ŋ, as opposed to a prenasalization.
The dot above and double dot above are used as diacritics on another letter to
indicate two borrowed Arabic sounds. That means the dot above and double dot
above need to be distinguished from each other. For these diacritic marked letters
there are no atomic encodings -- they can only be represented as sequences.
The collation information in the proposal indicates that the sequences (e.g.,
the two sequences representing Arabic sounds: <10D76, 10D6B>, <10D76, 10D6C>)
should get primary weights. DUCET does this kind of thing for atomic characters
which have canonical decompositions, but it does *not* do it for arbitrary
sequences, since there is no "target" in the encoding to assign the primary
weight for the sequence to. Conceivably, the sifter apparatus could be extended
to allow for weighting these ghost primaries, but not now for the 16.0 draft.

The nasalization mark (10D6D) isn't clearly exemplified, and the proposal
claims it is ignored for collation.

The gemination mark can occur over the GARAY VOWEL SIGN E and also over the
COMBINING DOT ABOVE. At a minimum, its secondary weight should be distinguished
from the COMBINING DOT ABOVE.

There are other complications for the Garay collation which mean it cannot
be fully handled by a default ordering in DUCET anyway.

The net conclusions I draw for now are that:

a. 10D69 GARAY VOWEL SIGN E needs a primary weight.
b. 10D6A GARAY CONSONANT GEMINATION MARK should get a script-specific secondary weight.
c. 10D6B GARAY COMBINING DOT ABOVE can be weighted as generic above [0033]
d. 10D6C GARAY COMBINING DOUBLE DOT ABOVE can be weighted as generic nukta [00C2]
e. 10D6D GARAY COMBINING NASALIZATION MARk can be weighted as generic above [0033]

a) and b) will be dealt with later when doing all the primary weighting for Garay.
c), d) and e) won't impact primary or secondary weights already in DUCET, so I
am dealing with them now, and bleeding them out of the set of the Garay characters
to be weighted later. The relevant hacks for them are:

10D6B;GARAY COMBINING DOT ABOVE;Mn;F8F5;;;;;
10D6C;GARAY COMBINING DOUBLE DOT ABOVE;Mn;093C;;;;;
10D6D;GARAY CONSONANT NASALIZATION MARK;Mn;F8F5;;;;;

to equate them to a generic above secondary weight (F8F5) and to the generic
nukta (093C). These entries are added to unidata.txt in the section dealing
with secondary weighting, just ahead of the section discussing Adlam secondaries,
with notes explaining the generic weight decisions.

Verify by generating allkeys.txt and examining the diff.

This concludes all the apparent candidates that don't require new primary or
secondary weights -- although there are possibly a couple other gc=Mn that
might qualify. I'm deferring those to further analysis on a per-script basis.
6 more down, 1060 to go.

Archive this delta 4:

unidata-16.0.0d4.txt (1556145 bytes, 10/07/2023)
From Ken:

I'm now at the point where further updates to unidata.txt will render the
output file (allkeys.txt) effectively undiffable. I've never felt it was
worth the effort to try to write custom tooling that would keep track of
all the relative differences in weights and report changes on those differences.
I suppose that is an opportunity for somebody who feels ambitious.

1. New Latin case pair

A7CC/A7CD is a new Latin case pair for s with diagonal stroke. The
proposal is silent about collation for these, so I'm just intercalating
them with a primary weight difference after A7A9/A7AA s with short
stroke overlay.

Generate allkeys.txt and verify that A7CC/A7CD show up properly weighted
as a case pair following A7A9/A7AA in primary order.

2. New Cyrillic case pair

1C89/1C8A is a new Cyrillic case pair for tje. The proposal is silent
about collation for these, but states that this is a Khanty letter for
[t'], i.e. [tʲ]. I'm intercalating it between Cyrillic letter twe and
Cyrillic letter Komi tje, with a primary weight distinction.

Generate allkeys.txt and verify that 1C89/1C8A show up properly weighted
as a case pair following A68C/A68D in primary order.

3. Archaic letter SHRI for Kannada and Telugu

0C5C;TELUGU ARCHAIC SHRII;Lo;;;;;;
0CDC;KANNADA ARCHAIC SHRII;Lo;;;;;;

The proposal claims that these can be collated as equivalent to the word SHRII,
i.e., the spelled out sequences. I'm adding the relevant sequences,
i.e. SHA-virama-RA-II, as a <sort> decomposition for each of these two:

0C5C;TELUGU ARCHAIC SHRII;Lo;<sort> 0C36 0C4D 0C30 0C40;;;;;
0CDC;KANNADA ARCHAIC SHRII;Lo;<sort> 0CB6 0CCD 0CB0 0CC0;;;;;

Generate allkeys.txt and verify that these two show up weighted correctly
by the decomposition.

4. Arabic Pegon letters

The proposal is silent about collation for these 4 additions.

0897;ARABIC PEPET;Mn;;;;;;

Because the pepet character was introduced to have a proper form for
what is currently commonly represented with a maddah, I give it a
<sort> equivalence to 0653 and intercalate it in unidata.txt right
after maddah. This is likely to provide better behavior for Pegon
material that may mix maddah and (in the future) 0897 pepet for this.
Revised entry in unidata.txt looks like this:

0897;ARABIC PEPET;Mn;<sort> 0653;;;;;

10EC2;ARABIC LETTER DAL WITH TWO DOTS VERTICALLY BELOW;Lo;;;;;;
10EC3;ARABIC LETTER TAH WITH TWO DOTS VERTICALLY BELOW;Lo;;;;;;
10EC4;ARABIC LETTER KAF WITH TWO DOTS VERTICALLY BELOW;Lo;;;;;;

Default collation for these nuktated Arabic letters is fairly arbitrary.
I intercalated 10EC2 just ahead of 0759. I intercalated 10EC3 just after
088C, and 10EC4 just after 08B4.

Generate allkeys.txt and verify that these pepet shows up weighted as
equivalent to maddah and the other 3 have primary weights in the expected
order.

5. Arabic combining alef overlay

The proposal is not explicit about collation for 10EFC, but implies that
it should be treated as equivalent to 0670 ARABIC LETTER SUPERSCRIPT ALEF.
I am simply equating it to 0670 in DUCET, similar to the way several
variant forms of maddah are equated to 0653.

Generate allkeys.txt and verify that 10EFC is weighted identically to
0670.

11 more down, 1049 to go.

Archive this delta 5:

unidata-16.0.0d5.txt (1557279 bytes, 10/07/2023)
From Ken:

O.k., a new day. Time to start in on the meat of the matter -- the whole script
additions in primary order. I start with the unicameral script additions.
Those are a bit simpler than any new bicameral scripts (Garay for
16.0). (See above in the Delta 3 discussion for the detailed rationale for
the placement of the new scripts -- I won't replicate that discussion here,
but just proceed, based on the placement decisions already made.)

1. Kirat Rai

I start with this one, even though it has some complexities, because it
is fresh in our minds from the extended discussion of the unicodetools PR.
First move the entire range of Kirat Rai alphabetic records (in code point
order) into unidata-16.0.0d6.txt, starting after Tangsa. This includes all
the consonants, all the vowels signs, which in Kirat Rai are actually
full standalone letters, and the two killers:

16D40;KIRAT RAI SIGN ANUSVARA;Lm;;;;;;
...
16D6C;KIRAT RAI SIGN SAAT;Lm;;;;;;

I omit the three punctuation marks for now. Those end up elsewhere in
DUCET, and it is more efficient to deal with all the punctuation additions
in a separate delta later on, after the alphabetic runs have been
established.

As noted in discussion, the ANUSVARA, TONPI (bindu), and VISARGA for Kirat
Rai are encoded as standalone modifier letters, rather than as combining
marks, and we've already decided not to try equating them to the various
combining candrabindu, visarga, etc., that use generic weights shared
with the Devanagari archetypes of the combining marks. The proposal suggests
that they be given primary distinctions and be left in code point order.
It might be better to move them to the end of the list of consonants, but
without further rationale provided, the simplest solution here is to simply
do as the proposal suggests. Indeed, the proposal states that "This sort
order with anusvara, tonpi, and visarga sorting first has been approved
by AKRS." So we just leave it that way for DUCET.

The next complication results from canonical equivalences for three
Kirat Rai vowels: AI, O, AU. In cases like this, the way to get the sifter
to introduce contractions is to surround the records in question with
the CONTRACTION pragma, to wit:

CONTRACTION

16D68;KIRAT RAI VOWEL SIGN AI;Lo;16D67 16D67;;;;;
16D69;KIRAT RAI VOWEL SIGN O;Lo;16D63 16D67;;;;;
16D6A;KIRAT RAI VOWEL SIGN AU;Lo;16D69 16D67;;;;;

DEFAULT

In cases like this, as for Tamil, Kannada, etc., etc., I also drop
a comment into unidata.txt with a somewhat redundant explanation,
to remind everybody what is going on here.

The effect of the CONTRACTION pragma is to tell the sifter that for
the range of entries where it is in effect, the sifter is to go ahead and
assign a primary weight to the code point and *also* generate a
contraction entry from the decomposition, giving it the same weight
as the atomic character code point. In the absence of the CONTRACTION
pragma, such an entry is instead just entered into allkeys.txt with
the sequence of weights from the decomposition, and does not have
its own primary weight.

But wait, there's more. Because of the strange encoding of Kirat Rai
vowel signs, we have a canonical closure problem for 16D6A. The
full decomposition for 16D6A is <16D63, 16D67, 16D67>, and we need
to weight that sequence with the same primary weight via contraction.
Fortunately, this is not the first time this problem has been encountered
for the sifter. A similar problem of canonical closure for a recursive
canonical decomposition occurs for 0CCB in Kannada and 0DDD in Sinhala.
The mechanism baked in to the sifter to deal with this is a "secondary
decomposition", which can be added to the input entry in the decomposition
field. As a first step for handling 16D6A, I'm putting the following
entry into unidata-16.0.0d6.txt:

16D6A;KIRAT RAI VOWEL SIGN AU;Lo;16D69 16D67, 16D63 16D67 16D67;;;;;

The comma delimitation in the decomposition allows for adding a secondary
decomposition. When enclosed within the CONTRACTION pragma, this generates
a second contraction using the secondary decomposition information.

This *almost* solves the problem for Kirat Rai, as it was solved for
Kannada and Sinhala. Unfortunately, the way the Kirat Rai vowels work,
there is yet *another* sequence that is canonically equivalent:
<16D63, 16D68>. That sequence is equivalent to <16D63, 16D67, 16D67>,
so it also needs a contraction that is weighted with the same primary
weight. The sifter code currently treats the Kannada and Sinhala cases
essentially as the extent of the problem -- only a *single* secondary
contraction is allowed for in the code. The syntax is for this currently is

decompfield := decomp (, decomp)?

rather than

decompfield := decomp (, decomp)*

because the code was written to just look for the comma and then process
a *single* secondary decomposition value, rather than being written to
expect and process an indefinite *list* of secondary decompositions.

To fix this for Kirat Rai (and future-proof against any similar cases
in the future), I'm going to have to do some considerable refactoring
of the relevant decomposition handling code in the sifter, which is
a tricky and sensitive part of the code. For now
I am just going to postpone that code work until all the rest of the 16.0 input
for UCA has been taken care of.

But in the meantime, while adding the first relevant secondary decomposition
for 16D6A to unidata.txt, I also reversed the order of the secondary
decomposition for the 0CCB and 0DDD entries in unidata.txt. Now the field with
the secondary decomposition better matches what the code states, assuming
the *first* entry is the formal canonical decomposition string from
the UCD (which is then recursively decomposed internal to the sifter
processing), followed by a secondary decomposition, which in the Kannada
and Sinhala cases is the full decomposition. The impact on the output
in allkeys.txt is just to invert the order of two contraction lines,
but it does not affect any of the weighting per se. The other side effect
is that the output log will stop warning about encountering a "Non-binary"
canonical decomposition for 0CCB and 0DDD in the recursive decomposition.

Generate allkeys.txt and verify that Kirat Rai weights are as expected,
with special attention to the results for 16D68, 16D69, and 16D6A.
Also examine the impact of the secondary decomposition change for
0CCB and 0DDD.

O.k., this discussion for Kirat Rai and its implications is hairy enough
that I'm going to make this its own delta, without introducing more
scripts into this set of changes.

45 more down, 1004 to go.

Archive this delta 6:

unidata-16.0.0d6.txt (1559396 bytes, 10/08/2023)
From Ken:

1. Todhri

Given all the complications of Kirat Rai to start off the day, I'm rewarding
myself before lunch by dealing with an easy case: Todhri. This is just a straight
unicameral alphabet, with no complications other than two letters that have
canonical equivalent sequences.

Move the relevant entries for Todhri (105C0..105F3), in code point order, into
unidata.txt, right after Vithkuqi. Apply the CONTRACTION pragma to the
two decomposed vowels, 105C9 and 105E4.

Generate allkeys.txt and verify that Todhri weights are as expected,
including the two contractions.

2. Sunuwar

This is another simple one, another simple unicameral alphabet with no marks,
and with the desired collation order the same as the code point order.

Move the relevant entries for Sunuwar (10BC0..11BE0), in code point order, into
unidata.txt, right after Tangsa (and ahead of the Kirat Rai I just added).
Leave the one punctuation sign to deal with later.

Generate allkeys.txt and verify that Sunuwar weights are as expected.

3. Gurung Khema

Gurung Khema is a bit more complicated. This one is an abugida, and it
has decomposition and contraction issues for the vowel signs.

First move all the relevant entries for Gurung Khema (16100..1612F),
in code point order, into unidata.txt, right after Sunuwar.

The 8 multi-part vowels with decompositions, 16121..16128, need to have
the CONTRACTION pragma, as the intent is for the vowels to all get
primary weights. 3 of the multi-part vowels, 16126..16128, have full
decompositions into sequences of three parts. Because of this, as for
Kirat Rai discussed above, those three need to have the full decompositions
added in their entries as secondary decompositions. The entries
affected are:

16126;GURUNG KHEMA VOWEL SIGN O;Mn;16121 1611F, 1611E 1611E 1611F;;;;;
16127;GURUNG KHEMA VOWEL SIGN OO;Mn;16122 1611F, 1611E 16129 1611F;;;;;
16128;GURUNG KHEMA VOWEL SIGN AU;Mn;16121 16120, 1611E 1611E 16120;;;;;

A replication note for when trying to build allkeys.txt with sifter in
the unicodetools: Before the sifter will work correctly for weighting of
abugidas, the Alphabetic property has to be updated for the repertoire in
question. In particular, all gc=Mn or gc=Mc vowel signs, consonant signs,
and length marks in abugidas need to be set explicitly to Other_Alphabetic
in PropList.txt first (and the relevant derivations be run based on that).
Otherwise, during the sift process, the sifter won't see these as
alphabetic and branch down the path for primary weights, but rather will
identify them as otherwise unaccounted for combining marks, and attempt
to give them secondary weights. Anusvaras and visargas also should be
set to Other_Alphabetic, but those are already bled off in unidata.txt by being
given explicit decompositions to generic marks.

Another piece of the puzzle is that nuktas and viramas (including killers)
should be given the Diacritic property in PropList.txt, but these are
more marginal for sifter behavior. Most nuktas are now bled off with
explicit decompositions, and the viramas are almost all picked up in
the sifter via their ccc=9 values. This could become a problem in the
future if SAH insists on ccc=0 for some newly encoded viramas, at which
point the sifter code may need an update to catch any combining
mark viramas (or conjoiners and killers) with ccc=0. The example we have for
16.0, in Kirat Rai, is not a problem, because that is gc=Lm, ccc=0,
so the sifter gets its Alphabetic status from gc=Lm and assigns it a
primary weight.

Generate allkeys.txt and verify that Gurung Khema weights are as expected,
with special attention to the vowel contractions.

4. Tulu-Tigalari

First move all the relevant entries for Tulu-Tigalari (11380..113D0),
in code point order, into unidata.txt, right after Grantha.

Put in the CONTRACTION pragma for the 3 two-part dependent vowel signs,
113C5, 113C7, 113C8. Do the same for each of the 4 two-part independent
vowel signs, 11383, 11385, 1138E, 11391.
Those aren't contiguous in code point order, so should use
multiple entries for the pragma, to make sure they don't pick up entries
that they shouldn't.

Now, checking against the detailed specification of the collation order
in the proposal (L2/22-031), invert the order of 113B3 LLA and 113B4 RRA,
so the collation order is RRA < LLA. That seems to be a deliberate choice
in the proposal.

Next move 113D1 TULU-TIGALARI REPHA into unidata.txt, immediately after
the RA (113AC). The repha is a separately encoded form of ra.

Note that for the Tulu-Tigalari vowels, there are deliberate encoding
gaps for short e and short o. Those might be added to the encoding later
on, in which case they would intercalate neatly in the gaps, and would
fit in the same places in the primary collation order. There is one
anomaly in the specification of collation in that it specifies the
primary order for vowel sign o, even though that is not encoded.
There is also a typo indicating: vowel sign vocalic ll << vowel sign ee.
That should be a primary distinction, like all the rest. Ignored.

The au length mark (113C8) only occurs as the second part of some
two-part vowels, and is basically would not be weighted alone
in most text, because it is basically bled by contractions that
form the weights for atomically encoded two-part vowels.
It makes more sense to give it a primary order
*after* the viramas, so I have reversed the position in unidata.txt,
as compared to the specification in the proposal. See the treatment
for Grantha, which has similar components.

The pluta (113D3) is not in L2/22-031. It was added later, based on
Srinidhi and Sridatta's L2/22-260.
L2/22-260 is silent about its ordering. It is a letter that serves
as a different kind of vowel lengthener. I'm giving it a primary order
after the au length mark. Again, see the comparable treatment of
the same component in Grantha.

The gemination mark (113D2) is also not in L2/22-031, but comes from
L2/22-260, which is silent about its ordering. However, the comparison
is made there to Gurmukhi addak (0A71), Khojki sign shadda (11237), and Soyombo
gemination mark (11A98). For DUCET, 0A71 is given a Gurmukhi-specific
secondary weight. The Khojki shadda is simply equated to the Arabic shadda.
The Soyombo gemination mark is given a Soyombo-specific secondary weight.
On balance, it seems best to just add a new secondary weight for the
Tulu-Tigalari gemination mark. I defer that to later, along with any
other new secondary weight additions required. Remember that Garay is
also introducing a gc=Mn gemination mark, so I have to figure out how to
deal with that one, too. So later.

Regenerate allkeys.txt, and verify that the Tulu-Tigalari weights
are as expected, with special attention to the various vowel contractions
and to the few other characters that receive primary weight not in
code point order, as noted above.

5. Ol Onal

Ol Onal is another easy one. It is a simple alphabet. The proposal (L2/22-151R)
specifies that the collation order is simply the same as the encoding order,
however, the discussion of the use of the two combining marks, MU (nasalization,
a dot above) and IKIR (lengthening, a dot below) suggests to me that it makes
more sense to give them secondary weights. That is, rather than what is
specified in the proposal:

A < A+MU < A+IKIR < A+IKIR+MU = A+MU+IKIR

what probably makes more sense for ordering is:

A << A+MU << A+IKIR << A+IKIR+MU = A+MU+IKIR

which would be accomplished better with secondary weights for MU and IKIR.

The proposal compares these two marks to the corresponding marks in Ol Chiki,
which are spacing and given gc=Lm, and the corresponding marks in Nag Mundari,
which are non-spacing diacritics. For best consistency, I think we should
follow the pattern of Nag Mundari, which gives the two non-spacing marks
secondary weights, rather than the Ol Chiki pattern, where the spacing
modifier letters get primary weights.

In any case, since the preferred solution involves assigning new secondary
weights, I defer the MU and IKIR to a later draft when I deal with those.

So for now, for the rest of the alphabet, I move all the letters (1E5D0..1E5ED)
and the HODDOND (1E5F0), in code point order, into unidata.txt right after Ol Chiki.

Regenerate allkeys.txt, and verify that the Ol Onal weights are
as expected.

I'll save Garay for the next delta.

233 more down, 771 to go.

Archive this delta 7:

unidata-16.0.0d7.txt (1569698 bytes, 10/08/2023)
From Ken:

1. Garay

I've saved Garay for a separate delta, because it is the only new bicameral
script in the bunch, and that introduces intercalation complications for it
in unidata.txt. It also has a sukun, a gemination mark, and a reduplication mark.
The proposal implies it needs a syllabic ordering, rather than a simpler ordering.
It also has a couple variant letters, which need special handling.

Garay goes into unidata.txt after Medefaidrin, another West African bicameral
script and before Adlam, another West African bicameral script.

The first step is to move all the capital and small letters into unidata.txt.
Then I rearrange them into case pairs, in the same manner as for
Medefaidrin and Adlam. Note that this rearrangement is not strictly necessary
to get weight assignments correct for the case pairs, but it is better to
keep maintaining new bicameral scripts in the same way as for existing
ones already in unidata.txt. This consistency helps in understanding
what is going on.

Next, deal with OLD KA (10D64/10D84) and OLD NA (10D65/10D85). These are claimed
to just variant forms of KA and NA, respectively, and are claimed explicitly
to sort equal to them. I merge them into the case pair bundles for KA and NA,
with a <sort> decomposition to make them sort similarly to KA and NA, but
with a case-distinguished tertiary weight difference.

It is completely unclear how to weight the vowels. There is an extensive
discussion of collation in L2/22-048, but the main upshot of that seems to
be that collation is conceived of in terms of a syllabic grid, rather than
in terms of the actual string used to represent each node of the syllabic
grid. I default to putting the vowels in code point order ahead of all
the consonants. Any attempt to implement the actual syllabic ordering
would require an extensive tailoring. Just putting the vowels first, in
roughly the order specified for vowels in syllables, would seem to suffice
for the default ordering in DUCET.

The gemination mark (10D6A) will get a script-specific secondary, so I defer
that for now.

Regenerate allkeys.txt, and verify that the Garay weights are
as expected, including all the case pairs and the equivalents for the
two pairs of variant letters.

51 more down, 720 to go.

Archive this delta 8:

unidata-16.0.0d8.txt (1572176 bytes, 10/08/2023)
From Ken:

All of the new scripts have been dealt with (at least for all the basic letters
and their primary orders). Next I move on to clear away the enormous pile of
miscellaneous symbols (gc=So) for 16.0, most of which are associated with the
new block of computer legacy graphic symbols.

These new symbols are just added to unidata.txt in the appropriate symbols sections
in code point order. All will get primary weights assigned, but will be marked
as variables for collation.

1. Symbols added in the 2400 block

Three new symbols (2427..2429) were added to the 2400 block. Add them to unidata.txt
after 2426, in code point order.

Regenerate allkeys.txt, and verify that these three symbols weight as expected.

2. Symbols added in the new 1CC00 block for legacy computing symbols

Several hundred symbols in the range 1CC00..1CEB3 were added in this block.
The outlined digits and letters were already dealt with above. Add the rest
of these entries into unidata.txt in code point order, ahead of the
run of entries for the Symbols for Legacy Computing block at 1FB00.

Regenerate allkeys.txt, and verify that this range of symbols weights as expected.

3. Various arrows added to the Supplemental Arrows-C 1F800 block

Ten new arrows (1F8B2..1F8BB) were added to the 1F800 block. Add them to unidata.txt
after 1F8B1, in code point order.

Regenerate allkeys.txt, and verify that this range of symbols weights as expected.

4. More legacy computing symbols

37 more legacy computing symbols (1FBCB..1FBEF) were added to the 1FB00 block.
Add them to unidata.txt, after 1FBCA, in code point order.

Regenerate allkeys.txt, and verify that this range of symbols weights as expected.

700 more down, 20 to go. Woot! Home stretch.

Archive this delta 9:

unidata-16.0.0d9.txt (1602275 bytes, 10/08/2023)
From Ken:

For this delta, focus on all the remaining script-specific punctuation and symbols.

1. Dandas and double dandas

Tulu-Tigalari and Kirat Rai both have script-specific dandas. Intercalate them
into unidata.txt roughly in code point order. This puts Kirat Rai
after Mro in the list of dandas, and it puts Tulu-Tigalari after Khojki
in the list of dandas. The Balinese inverted carik siki and inverted carik
pareren are also a variety of danda and double danda. Those can be interlaced
with the non-inverted danda and double danda for Balinese.

Regenerate allkeys.txt, and verify that these six dandas weight as expected.

2. Miscellaneous other punctuation

1B7F BALINESE PANTI BAWAK is a variant of 1B5A BALINESE PANTI. Just intercalate
after 1B5A.

10D6E GARAY HYPHEN is another script-specific hyphen. Add to the list of
script-specific hyphens in the dashes and hyphens subsection of punctuation.

113D7..113D8, the two Tulu-Tigalari pushpikas can just be intercalated in
the miscellaneous punctuation section of unidata.txt, after the Khojki
abbreviation sign and before the Newa punctuation.

16D6D KIRAT RAI SIGN YUPI is another Indic abbreviation sign. Add in
the script-specific miscellaneous punctuation section of unidata.txt,
in code point order.

11BE1 SUNUWAR SIGN PVO is an auspicious mark, similar in some ways to
a siddham mark or a Devanagari bhale. These may have particular
pronunciations, but are treated as punctuation marks. Just add to
unidata.txt in the script-specific section of miscellaneous punctuation
in code point order. That will put it right after 119E2 NANDINAGARI
SIGN SIDDHAM.

1E5FF OL ONAL ABBREVIATION SIGN. Likewise just add in the script-specific
miscellaneous punctuation section in code point order.

Regenerate allkeys.txt, and verify that these seven punctuation marks
weight as expected and are indicated as variables.

3. Miscellaneous other symbols

Garay plus sign (10D8E) and minus sign (10D8F). In the absence of any
better information about these, just intercalate in the math symbols
section of unidata.txt after 002B PLUS SIGN and 2212 MINUS SIGN,
respectively.

Regenerate allkeys.txt and verify that these two symbols weight
as expected and are indicated as variables.

4. Garay reduplication mark

10D6F GARAY REDUPLICATION MARK has its properties misconstrued. It is
explained in the proposal (L2/22-048) under section 4, Punctuation.
However, most examples of iteration or reduplication marks are
designated as gc=Lm, Extender=True in the UCD. This keeps the
reduplication or iteration mark within the context of the word for
segmentation purposes, which is usually the desired outcome. A few
similar marks have ended up as gc=Po. But the Garay proposal
specifies the mark as gc=So, which is clearly wrong.

I am updating UnicodeData.txt for 16.0 to change this character from gc=So to
gc=Lm, and then updating the underlying library for the sifter
accordingly.

With the updated interpretation of properties, 10D6F can be
intercalated in the extenders section of unidata.txt. I put it
between AAF4 MEETEI MAYEK WORD REPETITION MARK and
16B42 PAHAWH HMONG SIGN VOS NRUA.

Regenerate allkeys.txt and verify that these 10D6F shows up with a
primary weight and is grouped with the extenders.

16 more down, 4 to go.

Archive this delta 10:

unidata-16.0.0d10.txt (1602884 bytes, 10/08/2023)
From Ken:

At this point there are only 4 more characters to deal with: the Garay and
Tulu-Tigalari gemination marks and the nasalization and lengthener marks
for Ol Onal. All four are gc=Mn and as discussed above should probably end
up with secondary weights.

Because the addition of script-specific secondary weights requires some
corresponding code changes to the sifter code, I have saved these four
to the last. I will provide the intercalation points for unidata.txt,
but then also spell out in detail what corresponding changes need to
be made to the source code to enable the extension of the list of defined
secondary weights. At that time I will also update the various dates
and version information in the sifter source code, so that allkeys.txt and
other output files get stamped with the correct versions and dates.

1. Ol Onal MU and IKIR (1E5EE..1E5EF).

Thinking about these overnight, there doesn't seem to be any strong
case for these requiring *script-specific* secondary weights. They
are the only combining marks used in Ol Onal. One is just a dot above
and the other is just a dot below (and only used on one letter).
Their collation implications can be handled simply be equating them
to the generic above and generic below secondary weights.

Intercalate 1E5EE and 1E5EF into unidata.txt in the combining marks
section after the 4 Nag Mundari combining marks, and weight them
as generic above and generic below.

Regenerate allkeys.txt and examine the diff against allkeys-16.0.0d10.txt.
(This can be done with a diff, because the two marks are being given
existing secondary weights.)

2. Garay and Tulu-Tigalari gemination marks (10D6A, 113D2)

These were discussed at some length above. For consistency with
similar cases in other scripts, a solution that gives them script-specific
secondary weights seems best. Both occur in scripts that have other
combining marks that these gemination marks should be distinguished from.

Introducing new secondary weights in the sifter is much less automatic
than just letting the sifter assign new primary weights. Because the
sifter's generation of allkeys.txt is still tightly coupled to the
generation of the symbolic forms used for the CTT table for ISO 14651,
the code needs to be touched to introduce new secondary weight symbols
in the correct order. (At some point soon, we might decide to give up
on generating the CTT table for ISO 14651, but that is a separate
decision that would require a significant overhaul of the code to clean
up the sifter, and shouldn't be considered now simply to avoid the
work for two new secondaries for 16.0.)

So first I detour to do the 16.0 updating of the sifter.

unisift.c

Update the two version strings and the versioned file names to 16.0.0.
The copyright year hasn't (yet) changed since this code was last touched
for 15.1.

For consistency in documentation, update the NOTDEF section of the
switch statement in unisift_ForceToSecondary() for the three Tulu-Tigalari
and one Gurung Khema mark that were equated to generic anusvara,
visarga, etc. This doesn't impact the actual code, but is good practice
for bookkeeping on these exceptions.

unisyms.c

Bump up the NUMSECONDSYMS constant from 261 to 263 for the two new weights
to be added.

There are no strong rules for which order new secondaries need to be
intercalated in, and no obvious positions, since Tulu-Tigalari and
Garay are both new. To keep things simple, I chose to add both of them
together, right after the existing entry for the Soyombo gemination mark
in secondSyms[]:

"<D11A98>", /* Soyombo gemination mark */
"<D10D6A>", /* Garay gemination mark */
"<D113D2>", /* Tulu-Tigalari gemination mark */

That requires a corresponding addition of two entries to secondSymVals[]:

/* Soyombo, Garay, Tulu-Tigalari */
    0x11A98, 0x10D6A, 0x113D2,

And then, the actual two entries for 10D6A and 113D2 need to be moved
into unidata.txt in the same relative position, right after 11A98.

I also made two small tweaks to ranges in dumpCollatingSymbols() to ensure
that those ranges pick up the new Gurung Khema and Symbols for Legacy
Computing Supplement blocks when dumping collating symbols for the CTT
for 14651. Corresponding tweaks needed to be made to the ranges
in unisift_BuildSymWtTree().

This manual insertion of secondary marks is, of course, very tedious
and error-prone in cases where several new marks need to be added
for a version. It would be helpful if it could be automated a
bit more, but the work involved has never risen to the level where
it could offset the considerable effort that would be required to
figure out and implement actual automation here. The whole issue
of secondary weights in DUCET has seen an extensive amount of
custom tinkering over the years, in large part to keep the
inflation of secondary values under control, and to avoid breaking
out of the magic number range: 0021..01FF for secondaries, so
the DUCET table could keep the lowest primary weight stable
at 0200. We would have broken the bank years ago, if not for
the considerable work in custom folding of lots of secondary
weights for marks in different scripts into common, generic
secondary weight values.

Rebuild the sifter and deploy.

Regenerate allkeys.txt and verify that the two new secondary
weights have been added correctly, and that the secondary weight
range shows in the output diagnostics as having been bumped
up to 263. My first run for this in fact turned up an underlying
problem in my library when I checked for the magic number 263
and came up one short in the output. I had mistakenly given
an Alphabetic value to 10D6A, which resulted in the sifter giving
it a primary weight, instead of the desired secondary weight
next to the other two gemination marks. Correction of the
library resulted in the expected output in allkeys.txt:

11A98 ; [.0000.00D2.0002] # SOYOMBO GEMINATION MARK
10D6A ; [.0000.00D3.0002] # GARAY CONSONANT GEMINATION MARK
113D2 ; [.0000.00D4.0002] # TULU-TIGALARI GEMINATION MARK

For these secondary weight changes, another necessary check
is to verify that the CTT generation picked up the correct
new symbols. For that I check the ctt14651.txt output file:

collating-symbol <D11A98>  % SOYOMBO GEMINATION MARK
collating-symbol <D10D6A>  % GARAY CONSONANT GEMINATION MARK
collating-symbol <D113D2>  % TULU-TIGALARI GEMINATION MARK
...
<D11A98>  % SOYOMBO GEMINATION MARK
<D10D6A>  % GARAY CONSONANT GEMINATION MARK
<D113D2>  % TULU-TIGALARI GEMINATION MARK
...
<U11A98> IGNORE;<D11A98>;<MIN>;<SFFFF> % SOYOMBO GEMINATION MARK
<U10D6A> IGNORE;<D10D6A>;<MIN>;<SFFFF> % GARAY CONSONANT GEMINATION MARK
<U113D2> IGNORE;<D113D2>;<MIN>;<SFFFF> % TULU-TIGALARI GEMINATION MARK

And everything seems to be in order there. If the tables in unisyms.c
are not correctly updated, these symbol assignments for the CTT can easily end
up with off-by-one errors, throwing the table completely out of
whack.

4 more down, 0 to go. Yay!

Archive this delta 11:

unidata-16.0.0d11.txt (1603293 bytes, 10/09/2023)

Generate decomps-16.0.0d11.txt (sifter -t unidata-16.0.0d11.txt) to
document the changes in decompositions for this version of UCA.
I diff this file against the released version for 15.1.0: decomps-15.1.0d4.txt
to see what changed. It shows all the new decompositions, including all
the synthetic decomposition additions for collation. It also shows
the order change for the two secondary decompositions for Kannada
and Sinhala, discussed above.

----

From Ken via email:

As usual, there are also some small modifications to the sifter source code,
particularly to deal with the introduction of new secondary weights for UCA 16.0.
I was able to pare away various secondaries and get the additions down to just
two new secondary weights --
but any addition of a secondary requires some work on the sifter code.
And there are always version and a few range updates required for a new version, as well.

Note that handling the full new set (1177 characters) for Unicode 16.0
in the sifter requires that some character properties outside of UnicodeData.txt
also be updated correctly.
The most important of these is Alphabetic, which depends on Other_Alphabetic in PropList.txt.
To a lesser extent, the values for Diacritic and Extender (also in PropList.txt)
might impact a few weight assignments.
I very strongly recommend that before going much further on script-by-script UCA work,
particularly for the abugidas,
that you first take a break to do a complete update of PropList.txt for 16.0.
Neglecting that step will just lead to confusion and
characters out of place later on in the UCA work,
particularly when you get to Tulu-Tigalari.
macchiati
macchiati previously approved these changes Oct 18, 2023
Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spot-checking commits. Looks good so far.

@eggrobin eggrobin marked this pull request as ready for review November 10, 2023 22:53
@eggrobin
Copy link
Member

(Closing and reopening to rerun CI.)

@eggrobin eggrobin closed this Nov 10, 2023
@eggrobin eggrobin reopened this Nov 10, 2023
@markusicu
Copy link
Member Author

Ken has sent more updates to the data and to the sifter code. I will try to catch up with those “soon”.

@eggrobin eggrobin marked this pull request as draft November 10, 2023 23:00
From Ken:

I've done the code refactoring that enables the processing of more than
one secondary decomposition in the input file, so we could get the full
set of contractions working for the Kirat Rai vowel au.

There was a very
substantial refactoring of the code related to processContractions().
Instead of the prior one-off branch that checked for a secondary
decomposition and handled it specially, the code now assumes that a
decomposition in the input file is a comma-separated list which usually
defaults to a single entry, if present. However, rather than refactoring
the code so that it could handle indefinite lists of decompositions for
contractions, I just have it work now with a static array of up to four
decompositions. For over a decade, we've been able to get by with two.
Kirat Rai forces us up to three. I expect it will take awhile to find a
situation that requires four -- but if we do eventually need to go
there, the code will continue to just work with no further updates.
Doing it this way avoided even *more* extensive changes to the code to
build up and tear down dynamic lists for this extremely edgy edge case.
From Ken:

I've updated my library and the sifter input file, and now have a new
draft of allkeys.txt including the 15 characters added at UTC-177.

The intercalations were all easy, and an examination of the resulting
weights looks fine to me.
@markusicu
Copy link
Member Author

I have caught up with Ken's recent changes, and I am taking this out of draft for review and merging. I will look into the UCA test failure separately; at a minimum, I will need to hardcode sample characters for future scripts.

@markusicu markusicu marked this pull request as ready for review November 22, 2023 20:38
@markusicu markusicu requested a review from macchiati November 22, 2023 20:39
From Ken:

I removed the reference to ustbuild, which is a
separate utility that I used to house in my sifter directory, but which
[...] I moved out to a separate directory with a
separate make.
Copy link
Contributor

@Ken-Whistler Ken-Whistler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look like they have caught up o.k.

@markusicu markusicu merged commit 2c0cd74 into unicode-org:main Nov 23, 2023
10 of 11 checks passed
@markusicu markusicu deleted the uca16-ken branch November 23, 2023 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants