Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
From Ken: At this point there are only 4 more characters to deal with: the Garay and Tulu-Tigalari gemination marks and the nasalization and lengthener marks for Ol Onal. All four are gc=Mn and as discussed above should probably end up with secondary weights. Because the addition of script-specific secondary weights requires some corresponding code changes to the sifter code, I have saved these four to the last. I will provide the intercalation points for unidata.txt, but then also spell out in detail what corresponding changes need to be made to the source code to enable the extension of the list of defined secondary weights. At that time I will also update the various dates and version information in the sifter source code, so that allkeys.txt and other output files get stamped with the correct versions and dates. 1. Ol Onal MU and IKIR (1E5EE..1E5EF). Thinking about these overnight, there doesn't seem to be any strong case for these requiring *script-specific* secondary weights. They are the only combining marks used in Ol Onal. One is just a dot above and the other is just a dot below (and only used on one letter). Their collation implications can be handled simply be equating them to the generic above and generic below secondary weights. Intercalate 1E5EE and 1E5EF into unidata.txt in the combining marks section after the 4 Nag Mundari combining marks, and weight them as generic above and generic below. Regenerate allkeys.txt and examine the diff against allkeys-16.0.0d10.txt. (This can be done with a diff, because the two marks are being given existing secondary weights.) 2. Garay and Tulu-Tigalari gemination marks (10D6A, 113D2) These were discussed at some length above. For consistency with similar cases in other scripts, a solution that gives them script-specific secondary weights seems best. Both occur in scripts that have other combining marks that these gemination marks should be distinguished from. Introducing new secondary weights in the sifter is much less automatic than just letting the sifter assign new primary weights. Because the sifter's generation of allkeys.txt is still tightly coupled to the generation of the symbolic forms used for the CTT table for ISO 14651, the code needs to be touched to introduce new secondary weight symbols in the correct order. (At some point soon, we might decide to give up on generating the CTT table for ISO 14651, but that is a separate decision that would require a significant overhaul of the code to clean up the sifter, and shouldn't be considered now simply to avoid the work for two new secondaries for 16.0.) So first I detour to do the 16.0 updating of the sifter. unisift.c Update the two version strings and the versioned file names to 16.0.0. The copyright year hasn't (yet) changed since this code was last touched for 15.1. For consistency in documentation, update the NOTDEF section of the switch statement in unisift_ForceToSecondary() for the three Tulu-Tigalari and one Gurung Khema mark that were equated to generic anusvara, visarga, etc. This doesn't impact the actual code, but is good practice for bookkeeping on these exceptions. unisyms.c Bump up the NUMSECONDSYMS constant from 261 to 263 for the two new weights to be added. There are no strong rules for which order new secondaries need to be intercalated in, and no obvious positions, since Tulu-Tigalari and Garay are both new. To keep things simple, I chose to add both of them together, right after the existing entry for the Soyombo gemination mark in secondSyms[]: "<D11A98>", /* Soyombo gemination mark */ "<D10D6A>", /* Garay gemination mark */ "<D113D2>", /* Tulu-Tigalari gemination mark */ That requires a corresponding addition of two entries to secondSymVals[]: /* Soyombo, Garay, Tulu-Tigalari */ 0x11A98, 0x10D6A, 0x113D2, And then, the actual two entries for 10D6A and 113D2 need to be moved into unidata.txt in the same relative position, right after 11A98. I also made two small tweaks to ranges in dumpCollatingSymbols() to ensure that those ranges pick up the new Gurung Khema and Symbols for Legacy Computing Supplement blocks when dumping collating symbols for the CTT for 14651. Corresponding tweaks needed to be made to the ranges in unisift_BuildSymWtTree(). This manual insertion of secondary marks is, of course, very tedious and error-prone in cases where several new marks need to be added for a version. It would be helpful if it could be automated a bit more, but the work involved has never risen to the level where it could offset the considerable effort that would be required to figure out and implement actual automation here. The whole issue of secondary weights in DUCET has seen an extensive amount of custom tinkering over the years, in large part to keep the inflation of secondary values under control, and to avoid breaking out of the magic number range: 0021..01FF for secondaries, so the DUCET table could keep the lowest primary weight stable at 0200. We would have broken the bank years ago, if not for the considerable work in custom folding of lots of secondary weights for marks in different scripts into common, generic secondary weight values. Rebuild the sifter and deploy. Regenerate allkeys.txt and verify that the two new secondary weights have been added correctly, and that the secondary weight range shows in the output diagnostics as having been bumped up to 263. My first run for this in fact turned up an underlying problem in my library when I checked for the magic number 263 and came up one short in the output. I had mistakenly given an Alphabetic value to 10D6A, which resulted in the sifter giving it a primary weight, instead of the desired secondary weight next to the other two gemination marks. Correction of the library resulted in the expected output in allkeys.txt: 11A98 ; [.0000.00D2.0002] # SOYOMBO GEMINATION MARK 10D6A ; [.0000.00D3.0002] # GARAY CONSONANT GEMINATION MARK 113D2 ; [.0000.00D4.0002] # TULU-TIGALARI GEMINATION MARK For these secondary weight changes, another necessary check is to verify that the CTT generation picked up the correct new symbols. For that I check the ctt14651.txt output file: collating-symbol <D11A98> % SOYOMBO GEMINATION MARK collating-symbol <D10D6A> % GARAY CONSONANT GEMINATION MARK collating-symbol <D113D2> % TULU-TIGALARI GEMINATION MARK ... <D11A98> % SOYOMBO GEMINATION MARK <D10D6A> % GARAY CONSONANT GEMINATION MARK <D113D2> % TULU-TIGALARI GEMINATION MARK ... <U11A98> IGNORE;<D11A98>;<MIN>;<SFFFF> % SOYOMBO GEMINATION MARK <U10D6A> IGNORE;<D10D6A>;<MIN>;<SFFFF> % GARAY CONSONANT GEMINATION MARK <U113D2> IGNORE;<D113D2>;<MIN>;<SFFFF> % TULU-TIGALARI GEMINATION MARK And everything seems to be in order there. If the tables in unisyms.c are not correctly updated, these symbol assignments for the CTT can easily end up with off-by-one errors, throwing the table completely out of whack. 4 more down, 0 to go. Yay! Archive this delta 11: unidata-16.0.0d11.txt (1603293 bytes, 10/09/2023) Generate decomps-16.0.0d11.txt (sifter -t unidata-16.0.0d11.txt) to document the changes in decompositions for this version of UCA. I diff this file against the released version for 15.1.0: decomps-15.1.0d4.txt to see what changed. It shows all the new decompositions, including all the synthetic decomposition additions for collation. It also shows the order change for the two secondary decompositions for Kannada and Sinhala, discussed above. ---- From Ken via email: As usual, there are also some small modifications to the sifter source code, particularly to deal with the introduction of new secondary weights for UCA 16.0. I was able to pare away various secondaries and get the additions down to just two new secondary weights -- but any addition of a secondary requires some work on the sifter code. And there are always version and a few range updates required for a new version, as well. Note that handling the full new set (1177 characters) for Unicode 16.0 in the sifter requires that some character properties outside of UnicodeData.txt also be updated correctly. The most important of these is Alphabetic, which depends on Other_Alphabetic in PropList.txt. To a lesser extent, the values for Diacritic and Extender (also in PropList.txt) might impact a few weight assignments. I very strongly recommend that before going much further on script-by-script UCA work, particularly for the abugidas, that you first take a break to do a complete update of PropList.txt for 16.0. Neglecting that step will just lead to confusion and characters out of place later on in the UCA work, particularly when you get to Tulu-Tigalari.
- Loading branch information