UCA 16.0 delta 11 · unicode-org/unicodetools@7973464

Commit

UCA 16.0 delta 11

From Ken:

At this point there are only 4 more characters to deal with: the Garay and
Tulu-Tigalari gemination marks and the nasalization and lengthener marks
for Ol Onal. All four are gc=Mn and as discussed above should probably end
up with secondary weights.

Because the addition of script-specific secondary weights requires some
corresponding code changes to the sifter code, I have saved these four
to the last. I will provide the intercalation points for unidata.txt,
but then also spell out in detail what corresponding changes need to
be made to the source code to enable the extension of the list of defined
secondary weights. At that time I will also update the various dates
and version information in the sifter source code, so that allkeys.txt and
other output files get stamped with the correct versions and dates.

1. Ol Onal MU and IKIR (1E5EE..1E5EF).

Thinking about these overnight, there doesn't seem to be any strong
case for these requiring *script-specific* secondary weights. They
are the only combining marks used in Ol Onal. One is just a dot above
and the other is just a dot below (and only used on one letter).
Their collation implications can be handled simply be equating them
to the generic above and generic below secondary weights.

Intercalate 1E5EE and 1E5EF into unidata.txt in the combining marks
section after the 4 Nag Mundari combining marks, and weight them
as generic above and generic below.

Regenerate allkeys.txt and examine the diff against allkeys-16.0.0d10.txt.
(This can be done with a diff, because the two marks are being given
existing secondary weights.)

2. Garay and Tulu-Tigalari gemination marks (10D6A, 113D2)

These were discussed at some length above. For consistency with
similar cases in other scripts, a solution that gives them script-specific
secondary weights seems best. Both occur in scripts that have other
combining marks that these gemination marks should be distinguished from.

Introducing new secondary weights in the sifter is much less automatic
than just letting the sifter assign new primary weights. Because the
sifter's generation of allkeys.txt is still tightly coupled to the
generation of the symbolic forms used for the CTT table for ISO 14651,
the code needs to be touched to introduce new secondary weight symbols
in the correct order. (At some point soon, we might decide to give up
on generating the CTT table for ISO 14651, but that is a separate
decision that would require a significant overhaul of the code to clean
up the sifter, and shouldn't be considered now simply to avoid the
work for two new secondaries for 16.0.)

So first I detour to do the 16.0 updating of the sifter.

unisift.c

Update the two version strings and the versioned file names to 16.0.0.
The copyright year hasn't (yet) changed since this code was last touched
for 15.1.

For consistency in documentation, update the NOTDEF section of the
switch statement in unisift_ForceToSecondary() for the three Tulu-Tigalari
and one Gurung Khema mark that were equated to generic anusvara,
visarga, etc. This doesn't impact the actual code, but is good practice
for bookkeeping on these exceptions.

unisyms.c

Bump up the NUMSECONDSYMS constant from 261 to 263 for the two new weights
to be added.

There are no strong rules for which order new secondaries need to be
intercalated in, and no obvious positions, since Tulu-Tigalari and
Garay are both new. To keep things simple, I chose to add both of them
together, right after the existing entry for the Soyombo gemination mark
in secondSyms[]:

"<D11A98>", /* Soyombo gemination mark */
"<D10D6A>", /* Garay gemination mark */
"<D113D2>", /* Tulu-Tigalari gemination mark */

That requires a corresponding addition of two entries to secondSymVals[]:

/* Soyombo, Garay, Tulu-Tigalari */
0x11A98, 0x10D6A, 0x113D2,

And then, the actual two entries for 10D6A and 113D2 need to be moved
into unidata.txt in the same relative position, right after 11A98.

I also made two small tweaks to ranges in dumpCollatingSymbols() to ensure
that those ranges pick up the new Gurung Khema and Symbols for Legacy
Computing Supplement blocks when dumping collating symbols for the CTT
for 14651. Corresponding tweaks needed to be made to the ranges
in unisift_BuildSymWtTree().

This manual insertion of secondary marks is, of course, very tedious
and error-prone in cases where several new marks need to be added
for a version. It would be helpful if it could be automated a
bit more, but the work involved has never risen to the level where
it could offset the considerable effort that would be required to
figure out and implement actual automation here. The whole issue
of secondary weights in DUCET has seen an extensive amount of
custom tinkering over the years, in large part to keep the
inflation of secondary values under control, and to avoid breaking
out of the magic number range: 0021..01FF for secondaries, so
the DUCET table could keep the lowest primary weight stable
at 0200. We would have broken the bank years ago, if not for
the considerable work in custom folding of lots of secondary
weights for marks in different scripts into common, generic
secondary weight values.

Rebuild the sifter and deploy.

Regenerate allkeys.txt and verify that the two new secondary
weights have been added correctly, and that the secondary weight
range shows in the output diagnostics as having been bumped
up to 263. My first run for this in fact turned up an underlying
problem in my library when I checked for the magic number 263
and came up one short in the output. I had mistakenly given
an Alphabetic value to 10D6A, which resulted in the sifter giving
it a primary weight, instead of the desired secondary weight
next to the other two gemination marks. Correction of the
library resulted in the expected output in allkeys.txt:

11A98 ; [.0000.00D2.0002] # SOYOMBO GEMINATION MARK
10D6A ; [.0000.00D3.0002] # GARAY CONSONANT GEMINATION MARK
113D2 ; [.0000.00D4.0002] # TULU-TIGALARI GEMINATION MARK

For these secondary weight changes, another necessary check
is to verify that the CTT generation picked up the correct
new symbols. For that I check the ctt14651.txt output file:

collating-symbol <D11A98> % SOYOMBO GEMINATION MARK
collating-symbol <D10D6A> % GARAY CONSONANT GEMINATION MARK
collating-symbol <D113D2> % TULU-TIGALARI GEMINATION MARK
...
<D11A98> % SOYOMBO GEMINATION MARK
<D10D6A> % GARAY CONSONANT GEMINATION MARK
<D113D2> % TULU-TIGALARI GEMINATION MARK
...
<U11A98> IGNORE;<D11A98>;<MIN>;<SFFFF> % SOYOMBO GEMINATION MARK
<U10D6A> IGNORE;<D10D6A>;<MIN>;<SFFFF> % GARAY CONSONANT GEMINATION MARK
<U113D2> IGNORE;<D113D2>;<MIN>;<SFFFF> % TULU-TIGALARI GEMINATION MARK

And everything seems to be in order there. If the tables in unisyms.c
are not correctly updated, these symbol assignments for the CTT can easily end
up with off-by-one errors, throwing the table completely out of
whack.

4 more down, 0 to go. Yay!

Archive this delta 11:

unidata-16.0.0d11.txt (1603293 bytes, 10/09/2023)

Generate decomps-16.0.0d11.txt (sifter -t unidata-16.0.0d11.txt) to
document the changes in decompositions for this version of UCA.
I diff this file against the released version for 15.1.0: decomps-15.1.0d4.txt
to see what changed. It shows all the new decompositions, including all
the synthetic decomposition additions for collation. It also shows
the order change for the two secondary decompositions for Kannada
and Sinhala, discussed above.

----

From Ken via email:

As usual, there are also some small modifications to the sifter source code,
particularly to deal with the introduction of new secondary weights for UCA 16.0.
I was able to pare away various secondaries and get the additions down to just
two new secondary weights --
but any addition of a secondary requires some work on the sifter code.
And there are always version and a few range updates required for a new version, as well.

Note that handling the full new set (1177 characters) for Unicode 16.0
in the sifter requires that some character properties outside of UnicodeData.txt
also be updated correctly.
The most important of these is Alphabetic, which depends on Other_Alphabetic in PropList.txt.
To a lesser extent, the values for Diacritic and Extender (also in PropList.txt)
might impact a few weight assignments.
I very strongly recommend that before going much further on script-by-script UCA work,
particularly for the abugidas,
that you first take a break to do a complete update of PropList.txt for 16.0.
Neglecting that step will just lead to confusion and
characters out of place later on in the UCA work,
particularly when you get to Tulu-Tigalari.

Loading branch information

markusicu committed Oct 10, 2023

1 parent f710f4d commit 7973464

0 comments on commit `7973464`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `7973464`

Commit

There are no files selected for viewing

0 comments on commit 7973464

0 comments on commit `7973464`