Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make CLDR radical-stroke order = UAX38 #909

Merged
merged 1 commit into from
Aug 15, 2024

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Aug 14, 2024

Make the CLDR radical-stroke order of CJK ideographs match the order in UAX38 section 2.1.2 Sorting Algorithm Used by the Radical-Stroke Indexes.

For

Changes:

  • Change the 64-bit encoding of the sort order, moving the simplified-ness field below the number of residual strokes.
    • As a result, the traditional and simplified forms of characters are intermingled.
    • Therefore, any output by radical changes from multiple "buckets" like 182, 182', 182'' to a single "bucket" for radical 182.
    • Code-wise, many places that used to work with a radicalNumberAndSimplified now just work with the radicalNumber.
  • Change the "extension" field from 1 bit which worked like in UTS10 Implicit Weights to a 4-bit field which, together with the code point, yields the same relative order as UAX38 block|code point.

Other code changes

  • Some code worked with a 32-bit "short data" version of the 64-bit "order", by omitting some low-level fields and shifting down the high-level ones. I already had a TODO questioning the duplication. Now that I would have had to fiddle with the bit shifts & masks in two places, I switched the code to always work with the "long order".
  • I had two parsers for the radical-and-simplified string. I created a shared function.
  • There is a new TODO about more original-Unihan characters than before whose code point order differs from the CLDR, now UAX38, order. I am not planning to pursue this right now.

The modified output goes into CLDR:

The full radical-stroke order is printed into the FractionalUCA.txt file there. (file format documentation)
See the CLDR PR for the diffs.

This change also affects some of the CLDR Chinese tailoring data.

Sample FractionalUCA [radical] data diffs:

-[radical 1=⼀一:一𪛙丁-丆𠀀-𠀂𬺰𰀀万-丌亐卄𠀃-𠀆𪛚𪜀𪜁𫝀𬺱-𬺴𰀁-𰀄不-专丗𠀇-𠀌𪜂𫠡𬺵-𬺹𮯰𰀅-𰀇且-世丘-丝㐀𠀍-𠀗𫠢𫠣𬺺-𬺾𰀈-𰀊𱍐丞-丢㐁㐂𠀘-𠀚𠀜𠀞-𠀠𫝁𫠤𫠥𬺿-𬻉𰀋𱍑丣-严丽鿖𠀡-𠀤𠀦-𠀨𠀪𠀫𫝂𫠦-𫠩𬻊-𬻒𰀌𱍒並丧𠀬-𠀮𠀰-𠀴𪜃𫠪-𫠭𬻓-𬻘𰀍𱍓-𱍗鿗𠀵𠀶𠀸𠀺𠀻𪜄𫠮𬻙-𬻝𰀎-𰀑𠀽-𠁀𠤢𪜅𫠯-𫠲𬻞-𬻠𰀒-𰀕𱍘-𱍝𠁁-𠁅𪜆𫠳-𫠵𬻡-𬻥𱍞𱍟𠁆-𠁈𠁊𠁋𫠶𬻦-𬻨𰀖-𰀘𱍠𱍡𠁌𠁍𫠷-𫠼𬻩-𬻮𰀙𰀚𱍢-𱍤𠁎-𠁒𫝃𫠽𬻯𰀛𰀜𱍥䶶𠁓𠁔𫠾𫠿𬻰𰀝𱍦𱍧𠁕𠁗-𠁛𠁝𤳏𪜇𫡀𱍨𠁖𰀞𱍩𠁟𫡁𫡂𠁠𰀟𬻱𱍪]
+[radical 1=⼀一:一𪛙丁-丆𠀀-𠀂𬺰𰀀万-丌亐卄𠀃-𠀆𪛚𪜀𪜁𫝀𬺱-𬺴𰀁-𰀄不-专丗𠀇-𠀌𪜂𫠡𬺵-𬺹𰀅-𰀇𮯰且-世丘-丝㐀𠀍-𠀗𫠢𫠣𬺺-𬺾𰀈-𰀊𱍐丞-丢㐁㐂𠀘-𠀚𠀜𠀞-𠀠𫝁𫠤𫠥𬺿-𬻉𰀋𱍑丣-严丽鿖𠀡-𠀤𠀦-𠀨𠀪𠀫𫝂𫠦-𫠩𬻊-𬻒𰀌𱍒並丧𠀬-𠀮𠀰-𠀴𪜃𫠪-𫠭𬻓-𬻘𰀍𱍓-𱍗鿗𠀵𠀶𠀸𠀺𠀻𪜄𫠮𬻙-𬻝𰀎-𰀑𠀽-𠁀𠤢𪜅𫠯-𫠲𬻞-𬻠𰀒-𰀕𱍘-𱍝𠁁-𠁅𪜆𫠳-𫠵𬻡-𬻥𱍞𱍟𠁆-𠁈𠁊𠁋𫠶𬻦-𬻨𰀖-𰀘𱍠𱍡𠁌𠁍𫠷-𫠼𬻩-𬻮𰀙𰀚𱍢-𱍤𠁎-𠁒𫝃𫠽𬻯𰀛𰀜𱍥䶶𠁓𠁔𫠾𫠿𬻰𰀝𱍦𱍧𠁕𠁗-𠁛𠁝𤳏𪜇𫡀𱍨𠁖𰀞𱍩𠁟𫡁𫡂𠁠𰀟𬻱𱍪]
...
 [radical 179=⾲韭:韭韮䪞𩐁𩐂𲊦𱂍韯䪟𩐃韰𩐄韱䪠𩐅-𩐈韲䪡䪢𩐉𩐊䪣𩐋𩐍𩐎𱂎䪤𩐌𩐏-𩐓䪥𩐔-𩐖]
-[radical 180=⾳音:音竟章䪦-䪨𩐗𮧶𮧷𮸱𱂏韴韵䪩𩐘𩐙𫖗𮸲𮸳韶韷䪪𩐚-𩐝𫖘𬰹-𬰻𮧸𩐞-𩐦𬰼𮧹𮧺𲊧韸䪫䪬𩐧-𩐬𬰽𮧻𱂐𱂑𩐭-𩐰𲊨韹韺䪭𩐱-𩐴𫖙𮧼𱂒𱂓𲊩韻韼䪮䪯𩐵-𩐸𮧽韽-響𩐹-𩐾𫖚𩐿-𩑁𫖛𮧾䪰𩑂-𩑆𮧿頀𩑇𩑈𫖜𬰾𩑉𩑊]
-[radical 181=⾴頁:頁𩑋頂-頄𩑌-𩑏𬰿項-頉䪱䪲𩑐-𩑘𬱀頊-頓䪳-䪵𩑙-𩑯𫖝𮨀-𮨂𱂔𱂕頔-頚䪶-䪾𩑰-𩒎𫖞𬱁𬱂𮨃-𮨆𲊪-𲊬頛-頣頦-頬䪿-䫂𩒏-𩒭𬱃𮨇-𮨊𱂖𱂗𲊭𲊮頤頥頭-頽䫃-䫊𩒮-𩓜𫖟𫖠𬱄-𬱇𮨋𮨌𱂘𱂙𲊯頿-顊䫋-䫓𩓝-𩓿𫖡𬱈𬱉𮨍-𮨔𲊰-𲊲頾顋-顕䫔-䫝𩔀-𩔘𫖢𫖣𬱊𬱋𮨕𮨖𲊳𲊴顖-類䫞-䫧𩔙-𩔲𫖤𮨗-𮨛𱂚-𱂜𲊵𲊶顟-顣䫨-䫫𩔳-𩕈𫖥𫖦𬱌𬱍𮨜𮨝𲊷顤-顨䫬-䫱𩕉-𩕞𫖧𬱎𮨞𮨟𱂝𲊸顩-顫䫲-䫴𩕟-𩕫𫖨𬱏𮨠𮨡顬-顯𩕬-𩕽𱂞顰䫵䫶𩕾-𩖅𫖩𬱐𮨢𮨣顱顲䫷𩖆-𩖈𮨤𮨥𩖉-𩖎𬱑𱂟顳顴𩖏-𩖓𬱒]
-[radical 181'=⻚页:页-顷𬱓顸-须𫖪𮸴𱂠𲊹顼-预𫖫𫠆𬱔𬱕𱂡颅-颈𫖬𫖭𬱖-𬱚𱂢𲊺𲊻颉-颏𫖮-𫖱𬱛-𬱢𱂣-𱂨颐-颖𫖲𫖳𬱣-𬱥𱂩-𱂬𲊼𲊽颗𩖕𩖖𫖴-𫖶𬱦-𬱬𮸵𱂭-𱂰题-额𫖷𬱭-𬱯𮸶-𮸸𱂱-𱂳𲊾-𲋀颞-颡𫖸𬱰𱂴-𱂹𲋁𫖹𱂺颢颣𬱱𮸹𮸺𱂻𲋂颤𩖗𲋃颥𬱲颦𫖺颧𬱳]
-[radical 182=⾵風:風䫸𩖘𩖙𮨦颩颪䫹𩖚-𩖡颫颬䫺-䫽𩖢-𩖯𩖱-𩖳𫖻𮨧𱂼颭-颱䫾-䬃𩖴-𩗃𮨨𱂽-𱂿颲颳䬄䬅𠙬𩗄-𩗒𮨩-𮨫颴颵䬆-䬊𩗓-𩗧𮨬𱃀-𱃂𲋅颶颷䬋-䬐𩗨-𩘄𮨭-𮨯𱃃-𱃆颸-颺䬑-䬗𩘅-𩘍𩘏-𩘛𬱴𱃇-𱃉𲋆颻-飀䬘-䬚𩘎𩘜-𩘬𮨰𱃊飁-飄䬛䬜𩘭-𩘷𮨱𱃋-𱃍飅-飊䬝𩘸-𩙇飋𩙈-𩙋𩙍𮨲𱃎𱃏𲋉𲋊䬞𩙎-𩙐𫗅𲋌䬟𩙑-𩙕𫗆𱃐𲋍𩙖-𩙜𱃑飌飍𩙝𩙞𱃒𱃓𩙟𮨳𩙠-𩙤]
-[radical 182'=⻛风:风飏𱃔𫗇𫠇𬱵𬱷𱃕𲋎𲋏飐-飒𩙥𩙦𫠈𬱸𬱺𱃖𱃗𲋐𬱼𱃘-𱃚𩙧𫗈𬱽𱃛飓𩙨-𩙪𫗉𬱾-𬲀𮨴𮸻𱃜𱃝𲋑飔飖𩙫𩙬𫗊𱃞飕飗𩙭𩙮飘𮨵𱃟飙飚𩙯𬲅𬲆𮸼𱃠𩙰𫗋]
-[radical 182''=𲋄:𲋄𬱶𫖼𬱹𬱻𫖽-𫖿𬲁𬲂𲋇𲋈𫗀-𫗂𬲃𬲄𩙌𫗃𫗄𬲇𲋋𬲈]
+[radical 180=⾳音:音竟章䪦-䪨𩐗𮧶𮧷𱂏𮸱韴韵䪩𩐘𩐙𫖗𮸲𮸳韶韷䪪𩐚-𩐝𫖘𬰹-𬰻𮧸𩐞-𩐦𬰼𮧹𮧺𲊧韸䪫䪬𩐧-𩐬𬰽𮧻𱂐𱂑𩐭-𩐰𲊨韹韺䪭𩐱-𩐴𫖙𮧼𱂒𱂓𲊩韻韼䪮䪯𩐵-𩐸𮧽韽-響𩐹-𩐾𫖚𩐿-𩑁𫖛𮧾䪰𩑂-𩑆𮧿頀𩑇𩑈𫖜𬰾𩑉𩑊]
+[radical 181=⾴頁⻚页:頁𩑋页頂-頄𩑌-𩑏𬰿顶顷𬱓項-頉䪱䪲𩑐-𩑘𬱀顸-须𫖪𱂠𲊹𮸴頊-頓䪳-䪵𩑙-𩑯𫖝𮨀-𮨂𱂔𱂕顼-预𫖫𫠆𬱔𬱕𱂡頔-頚䪶-䪾𩑰-𩒎𫖞𬱁𬱂𮨃-𮨆𲊪-𲊬颅-颈𫖬𫖭𬱖-𬱚𱂢𲊺𲊻頛-頣頦-頬䪿-䫂𩒏-𩒭𬱃𮨇-𮨊𱂖𱂗𲊭𲊮颉-颏𫖮-𫖱𬱛-𬱢𱂣-𱂨頤頥頭-頽䫃-䫊𩒮-𩓜𫖟𫖠𬱄-𬱇𮨋𮨌𱂘𱂙𲊯颐-颖𫖲𫖳𬱣-𬱥𱂩-𱂬𲊼𲊽頿-顊䫋-䫓𩓝-𩓿𫖡𬱈𬱉𮨍-𮨔𲊰-𲊲颗𩖕𩖖𫖴-𫖶𬱦-𬱬𱂭-𱂰𮸵頾顋-顕䫔-䫝𩔀-𩔘𫖢𫖣𬱊𬱋𮨕𮨖𲊳𲊴题-额𫖷𬱭-𬱯𱂱-𱂳𲊾-𲋀𮸶-𮸸顖-類䫞-䫧𩔙-𩔲𫖤𮨗-𮨛𱂚-𱂜𲊵𲊶颞-颡𫖸𬱰𱂴-𱂹𲋁顟-顣䫨-䫫𩔳-𩕈𫖥𫖦𬱌𬱍𮨜𮨝𲊷𫖹𱂺顤-顨䫬-䫱𩕉-𩕞𫖧𬱎𮨞𮨟𱂝𲊸颢颣𬱱𱂻𲋂𮸹𮸺顩-顫䫲-䫴𩕟-𩕫𫖨𬱏𮨠𮨡颤𩖗𲋃顬-顯𩕬-𩕽𱂞颥𬱲顰䫵䫶𩕾-𩖅𫖩𬱐𮨢𮨣颦𫖺顱顲䫷𩖆-𩖈𮨤𮨥𩖉-𩖎𬱑𱂟颧𬱳顳顴𩖏-𩖓𬱒]
+[radical 182=⾵風⻛风𲋄:風风𲋄䫸𩖘𩖙𮨦颩颪䫹𩖚-𩖡飏𱃔颫颬䫺-䫽𩖢-𩖯𩖱-𩖳𫖻𮨧𱂼𫗇𫠇𬱵𬱷𱃕𲋎𲋏𬱶颭-颱䫾-䬃𩖴-𩗃𮨨𱂽-𱂿飐-飒𩙥𩙦𫠈𬱸𬱺𱃖𱃗𲋐𫖼𬱹𬱻颲颳䬄䬅𠙬𩗄-𩗒𮨩-𮨫𬱼𱃘-𱃚颴颵䬆-䬊𩗓-𩗧𮨬𱃀-𱃂𲋅𩙧𫗈𬱽𱃛颶颷䬋-䬐𩗨-𩘄𮨭-𮨯𱃃-𱃆飓𩙨-𩙪𫗉𬱾-𬲀𮨴𱃜𱃝𲋑𮸻𫖽颸-颺䬑-䬗𩘅-𩘍𩘏-𩘛𬱴𱃇-𱃉𲋆飔飖𩙫𩙬𫗊𱃞𫖾𫖿𬲁𬲂𲋇𲋈颻-飀䬘-䬚𩘎𩘜-𩘬𮨰𱃊飕飗𩙭𩙮𫗀-𫗂𬲃𬲄飁-飄䬛䬜𩘭-𩘷𮨱𱃋-𱃍飘𮨵𱃟飅-飊䬝𩘸-𩙇飙飚𩙯𬲅𬲆𱃠𮸼飋𩙈-𩙋𩙍𮨲𱃎𱃏𲋉𲋊𩙰𫗋𩙌𫗃𫗄𬲇𲋋䬞𩙎-𩙐𫗅𲋌䬟𩙑-𩙕𫗆𱃐𲋍𩙖-𩙚𬲈𩙛𩙜𱃑飌飍𩙝𩙞𱃒𱃓𩙟𮨳𩙠-𩙤]

Copy link
Member

@kenlunde kenlunde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear whether this change accommodates a second non-Chinese simplified radical, which is expressed using three apostrophes, and which is new for Unicode Version 16.0. See the Proposed Update of UAX #38.

@markusicu
Copy link
Member Author

It is unclear whether this change accommodates a second non-Chinese simplified radical, which is expressed using three apostrophes, and which is new for Unicode Version 16.0. See the Proposed Update of UAX #38.

I did that three months ago in one of the radical-and-simplified parsers:

The refactored parser here, at the end of RadicalStroke.java, handles up to three apostrophes, quoting the 16.0 proposed.html version of UAX38.

@markusicu
Copy link
Member Author

PS: There is no character with a three-apostrophe radical as the primary kRSUnicode value. There are only two characters where such a radical is in the secondary value. This code only uses the primary value. That's why I had missed updating that version of the parser before -- it never saw a triple apostrophe.

@kenlunde
Copy link
Member

Ah, good. For reference, draft code point U+3347B in the Extension J block (Unicode Version 17.0) will have 212'''.4 as its primary kRSUnicode property value.

@markusicu
Copy link
Member Author

FYI: The output CLDR files are now in unicode-org/cldr#3960. I have updated the description of this PR here with a link for that as well.

@markusicu markusicu requested a review from nedley August 15, 2024 01:58
@markusicu
Copy link
Member Author

Could I get a review please? Or maybe a rubber stamp?
@macchiati?
@pedberg-icu approved the output in CLDR, maybe you could look at the generator changes here?

@macchiati
Copy link
Member

macchiati commented Aug 15, 2024 via email

@markusicu
Copy link
Member Author

done

tnx!

@markusicu markusicu merged commit 6acc90b into unicode-org:main Aug 15, 2024
28 checks passed
@markusicu markusicu deleted the rad-stroke-uax38 branch August 15, 2024 18:08
@markusicu markusicu added the uca label Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants