Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ucdxml and TR42 #859

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Ucdxml and TR42 #859

wants to merge 15 commits into from

Conversation

jowilco
Copy link

@jowilco jowilco commented Jun 6, 2024

PR to make it easy to see what changes have been made to support UCDXML.

@jowilco jowilco changed the title Ucdxml preview Ucdxml and TR42 Oct 16, 2024
@jowilco jowilco marked this pull request as ready for review October 16, 2024 21:13
@jowilco
Copy link
Author

jowilco commented Oct 16, 2024

Comment on June 6 is no longer valid - we're now ready for review.

@jowilco
Copy link
Author

jowilco commented Oct 16, 2024

@macchiati @eggrobin @markusicu - Please can you review?

@@ -310,6 +313,15 @@ Unihan_Variants ; kSpoofingVariant
Unihan_Variants ; kTraditionalVariant
Unihan_Variants ; kZVariant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a line for kZhuang here? (In other words, are you getting any data for kZhuang?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version of UCDXML does not support kZhuang, just kZhuangNumeric.
Similar to Unikemet, we should add support either for the revised 16.0 UCDXML files, or for 17.

Comment on lines 154 to 157
cjkRSTUnicode ; kRSTUnicode
cjkReading ; kReading
cjkSrc_NushuDuben ; kSrc_NushuDuben
cjkTGT_MergedSrc ; kTGT_MergedSrc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change: Tangut and Nüshu are not CJK, they should not have a cjk alias.

The name kReading is unfortunate (since this is really Nüshu-specific), but it is what it is.
I guess you should add the comment that I should have added saying that these are the fields from the Tangut and Nüshu sources files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

default:
throw new RuntimeException("Missing Catalog case");
}
case Enumerated:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the associated pile of UnicodeMaps) seems like it is going to be a bit annoying to maintain as we add properties.
Is there a reason why you are not doing something like

final UnicodeProperty property = indexUnicodeProperties.getProperty(prop);
final List<String> valueAliases = property.getValueAliases(property.getValue(codepoint));
return valueAliases.size() == 1 ? valueAliases.get(0) : valueAliases.get(1);

for most of them (special-casing Decomposition_Type etc. as needed)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I was assuming that there were going to be more special cases, but I agree that as there are not, your solution is better. Implemented.

@macchiati
Copy link
Member

macchiati commented Nov 27, 2024 via email

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi John, I took a peek -- not yet at the new .java files... -- and we discussed some high-level things in today's Unicode Tools meeting.

Some comments below.
For publication, we are thinking that we continue to copy two of the grouped files into the repo, but run the tool as part of a publication step, so that we don't check in all of the large, highly redundant files.

@@ -204,5 +215,5 @@ Confusable_MA ; SINGLE_VALUED ; $codePoints
#Emoji ; SINGLE_VALUED ; <enum>
#Emoji_Presentation ; SINGLE_VALUED ; <enum>
#Emoji_Modifier ; SINGLE_VALUED ; <enum>
#Emoji_Modifier_Base ; SINGLE_VALUED ; <enum>
#Emoji_Modifier_Base ; SINGLE_VALUED ; <enum>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other lines get their indentation fixed, but this one gets it un-fixed...?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now replaced all the tab chars with spaces to avoid this issue appearing on editors with different tab settings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation should go into https://github.com/unicode-org/unicodetools/tree/main/docs,
one of

  • some existing file that covers UCDXML (not sure if there is one)
  • a new ucdxml.md there
  • a new index.md in a new ucdxml/ folder there

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to docs/ucdxml.md

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unicode Tools normally work with internal data files stored in https://github.com/unicode-org/unicodetools/tree/main/unicodetools/src/main/resources/org/unicode

Is it necessary to create a new, separate folder structure outside of that?
Why not add a ucdxml/ folder in the usual place?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to resources/org/unicode/uax42/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many little folders with one file each, and one folder with three. Is that necessary or useful? Can we flatten these all into one folder?

Some folders like "properties" are ok, but adding lots of mini folders seems cumbersome.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please no spaces in file names

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


## Step 1 - Generate property value fragments

- Run org.unicode.xml.GeneratePropertyValues to populate the UNICODETOOLS_REPO_DIR/uax/uax42/fragments/ folder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will want to be a specific, reproducible mvn command line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but please check to see if this is what you were thinking.


## Step 2 - Generate TR42 index.html and index.rnc

- In UNICODETOOLS_REPO_DIR/uax/uax42/ run `mvn xml:transform`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mvn command lines should run from the root as usual, see https://github.com/unicode-org/unicodetools/blob/main/docs/build.md for examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have what you want, but...


- In UNICODETOOLS_REPO_DIR/uax/uax42/ run `mvn xml:transform`

index.html and index.rnc will be generated in UNICODETOOLS_REPO_DIR/uax/uax42/output/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output should go into a folder under UNICODETOOLS_GEN_DIR as usual, such as UNICODETOOLS_GEN_DIR/ucdxml/17.0.0/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have what you want, but...

1. Clone and build [jing-trang](https://github.com/relaxng/jing-trang)
2. Run the following:
```
java -jar C:\_git\jing-trang\build\jing.jar -c UNICODETOOLS_REPO_DIR\uax\uax42\output\index.rnc <path to UAX xml file>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mvn ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should probably discuss what is in-scope and what is out-of-scope for the process and the utility. The deliverables are:

  1. The UCDXML files
  2. The TR42 HTML file
  3. The RNC file

The fragment files are generated, but are then consumed by the TR42 HTML file and the RNC file. I think that these should be in the repo, especially as some of them are created manually.

Should any of the deliverables be stored in unicodetools?
If we are planning to validate the UCDXML files using the RNC files as part of the build, we'll need a process that incorporates all steps. Should that be considered a "unit test"?

```
java -jar C:\_git\jing-trang\build\jing.jar -c UNICODETOOLS_REPO_DIR\uax\uax42\output\index.rnc <path to UAX xml file>
```
Note that the UAX xml file has to be saved as NFD as the Unihan syntax regular expressions are expecting NFD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the tool generate the data files in NFD?
It seems like the files should come out as needed for the tool chain as well as for publication.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data files are not NFD by default, and I'm not sure that we should change the format for publication.
The rest of your question ties back to my previous comment; we could add a step to create an NFD version of the UCDXML files as part of an end-to-end process. However, is that in scope?

@markusicu markusicu requested a review from echeran December 10, 2024 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants