Skip to content

Commit

Permalink
Merge remote-tracking branch 'la-vache/main' into modifier-ψ-and-ω
Browse files Browse the repository at this point in the history
  • Loading branch information
eggrobin committed Jun 6, 2024
2 parents d280919 + 184d7e5 commit f741099
Show file tree
Hide file tree
Showing 19 changed files with 437 additions and 211 deletions.
48 changes: 47 additions & 1 deletion .github/workflows/cli-build-instructions.yml
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,22 @@ jobs:
with:
repository: unicode-org/unicodetools
path: unicodetools/mine/src
- name: Checkout base UnicodeData.txt
if: ${{ github.event_name == 'pull_request'}}
uses: actions/checkout@v3
with:
ref: ${{ github.event.pull_request.base.sha }}
path: base
sparse-checkout: unicodetools/data/ucd/dev/UnicodeData.txt
- name: Compare repertoire
if: ${{ github.event_name == 'pull_request'}}
run: |
# Look for changes affecting the first two fields of UnicodeData.txt (code point and name).
sed 's/^\([^;]*;[^;]*\);.*$/\1/' unicodetools/mine/src/unicodetools/data/ucd/dev/UnicodeData.txt > merged-repertoire.txt
sed 's/^\([^;]*;[^;]*\);.*$/\1/' base/unicodetools/data/ucd/dev/UnicodeData.txt > base-repertoire.txt
set +e
diff base-repertoire.txt merged-repertoire.txt
echo "REPERTOIRE_CHANGED=$?" >> "$GITHUB_ENV"
- name: Get the CLDR_REF from pom.xml
id: cldr_ref
run: echo "CLDR_REF="$(mvn --file unicodetools/mine/src/pom.xml help:evaluate -Dexpression=cldr.version -q -DforceStdout | cut -d- -f3) >> $GITHUB_OUTPUT && cat ${GITHUB_OUTPUT}
Expand Down Expand Up @@ -316,6 +332,10 @@ jobs:
- name: Run command - UCA - collation validity log
run: |
cd unicodetools/mine/src
echo "REPERTOIRE_CHANGED=$REPERTOIRE_CHANGED"
if [[ ${REPERTOIRE_CHANGED:-0} -ne 0 ]]
then set +e
fi
# invoke main() in class ...UCA.Main
mvn -s .github/workflows/mvn-settings.xml compile exec:java -Dexec.mainClass="org.unicode.text.UCA.Main" -Dexec.args="writeCollationValidityLog ICU" -am -pl unicodetools -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=$CURRENT_UVERSION
# check for output file
Expand All @@ -333,6 +353,22 @@ jobs:
with:
repository: unicode-org/unicodetools
path: unicodetools/mine/src
- name: Checkout base UnicodeData.txt
if: ${{ github.event_name == 'pull_request'}}
uses: actions/checkout@v3
with:
ref: ${{ github.event.pull_request.base.sha }}
path: base
sparse-checkout: unicodetools/data/ucd/dev/UnicodeData.txt
- name: Compare repertoire
if: ${{ github.event_name == 'pull_request'}}
run: |
# Look for changes affecting the first two fields of UnicodeData.txt (code point and name).
sed 's/^\([^;]*;[^;]*\);.*$/\1/' unicodetools/mine/src/unicodetools/data/ucd/dev/UnicodeData.txt > merged-repertoire.txt
sed 's/^\([^;]*;[^;]*\);.*$/\1/' base/unicodetools/data/ucd/dev/UnicodeData.txt > base-repertoire.txt
set +e
diff base-repertoire.txt merged-repertoire.txt
echo "REPERTOIRE_CHANGED=$?" >> "$GITHUB_ENV"
- name: Get the CLDR_REF from pom.xml
id: cldr_ref
run: echo "CLDR_REF="$(mvn --file unicodetools/mine/src/pom.xml help:evaluate -Dexpression=cldr.version -q -DforceStdout | cut -d- -f3) >> $GITHUB_OUTPUT && cat ${GITHUB_OUTPUT}
Expand Down Expand Up @@ -372,6 +408,16 @@ jobs:
- name: Run invariant tests
run: |
cd unicodetools/mine/src
MAVEN_OPTS="-ea" mvn -s .github/workflows/mvn-settings.xml test -am -pl unicodetools -Dtest=TestTestUnicodeInvariants#testSecurityInvariants -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=$CURRENT_UVERSION -DfailIfNoTests=false -DEMIT_GITHUB_ERRORS
echo "REPERTOIRE_CHANGED=$REPERTOIRE_CHANGED"
if [[ ${REPERTOIRE_CHANGED:-0} -ne 0 ]]
then ERROR="::notice"
else ERROR="::error"
fi
MAVEN_OPTS="-ea" mvn -s .github/workflows/mvn-settings.xml test -am -pl unicodetools -Dtest=TestTestUnicodeInvariants#testSecurityInvariants -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=$CURRENT_UVERSION -DfailIfNoTests=false -DEMIT_GITHUB_ERRORS 2>&1 | sed "s/^::error/$ERROR/"
STATUS=${PIPESTATUS[0]}
if [[ ${REPERTOIRE_CHANGED:-0} -ne 0 ]]
then exit 0
else exit $STATUS
fi
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
11 changes: 8 additions & 3 deletions docs/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ Indic scripts only:
- [ ] Commit

---
- [ ] PropsList.txt — Add Other_Alphabetic, Diacritic, and Extender to satisfy invariants, or to taste
- [ ] PropsList.txt — Add Other_Alphabetic, Other_Lowercase, Diacritic, and Extender to satisfy invariants, or to taste
- [ ] Commit

---
Expand All @@ -67,8 +67,13 @@ PR preparation:
- [ ] PR button — Set to DRAFT pull request
- unless approved for the upcoming version
- [ ] PR button — Press
- The **Check UCA data** CI check might fail; many character additions need separate handling there,
but that is out of scope for the PAG work of preparing `data-for-new`. This will get resolved later.
- The **Check UCA data** and **Check security data invariants** CI checks are
suppressed; many character additions need separate handling there,
but that is out of scope for the PAG work of preparing `data-for-new`,
so reporting those failures could distract from real issues
in the UCD invariants.
UCA and security data issues are addressed later in the process,
before the start of β review.

## Scripts

Expand Down
54 changes: 39 additions & 15 deletions docs/security.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ machine-generated, then tweaked. They have names like
source/confusables-winFonts.txt. The main file is confusables-source.txt.

***There is fairly complex processing for the confusables, so carefully diff the
results. Sometimes you may get an unexpected union of two equivalence sets. Look
at Testing below for help.***
results. Sometimes you may get an unexpected union of two equivalence sets.
Look at Testing below for help.***

Look at the following spreadsheets / bugs to see if there are any additional
suggestions.
Expand All @@ -19,17 +19,38 @@ suggestions.
Suggestions](https://docs.google.com/spreadsheet/ccc?key=0ArRWBHdd5mx-dHRXelRVbXRYSVp2QTNDdTBlV1I5X1E&usp=drive_web#gid=0)**
* **[Identifier Restriction
Suggestions](https://docs.google.com/spreadsheet/ccc?key=0ArRWBHdd5mx-dEJJWkdzZzk4cDRYbEVLTmhraGN0Q3c&usp=drive_web#gid=0)**
* *[Unicode
Bugs](http://www.unicode.org/edcom/bugtrack/query?status=accepted&status=assigned&status=new&status=reopened&group=component&order=priority&col=id&col=summary&col=status&col=type&col=priority&col=milestone&col=component&owner=mark&report=10)
(under TR #36/39)*\
:construction: **TODO**: That Trac instance is gone.
Markus thinks we decided that there was nothing useful in it,
and deleted it without saving data. Check with Mark.
* *[Sample PRs](https://github.com/unicode-org/unicodetools/pull/841)

If so, assess and add to unicodetools/data/security/{version}/data/source/confusables-source.txt — *if needed.*

Then in the spreadsheets, move the "new stuff" line to the end.

### File Format
There is a brief description of the file format at the top.
Each line represents a mapping from a code point or set of code points to a sequence of one or more code points.

For example:
```
0021 ; 01C3 # ( ! → ǃ) EXCLAMATION MARK → LATIN LETTER RETROFLEX CLICK
```

The ordering of characters doesn't matter.
So it doesn't matter whether you have the above line, or
```
01C3 ; 0021 # ( ǃ → !) LATIN LETTER RETROFLEX CLICK → EXCLAMATION MARK
```
It also doesn't matter if you have identical lines; the second one will be a NOOP.

The mappings are used to generate equivalence classes.
From each equivalence class, one representative member will be chosen,
and in the resulting data file, all the other characters will map to that representative.
Because of transitivity, the equivalence class will tend to be somewhat looser than expected.

We've discussed possible future enhancements:
- Have a second, narrower mapping that is more exact.
- Allow for mappings from sequences to sequences (instead of just code points to sequences).
- Provide for context, perhaps like the Transform rules.
Eg [x { a } y → A](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Aarabic_type%3A%5D&g=&i=)

## Before generating

First, in CLDR, update the script metadata:
Expand All @@ -51,13 +72,10 @@ Run GenerateConfusables -c -b to generate the files. They will appear in two pla
* reformatted source, log
* $UNICODETOOLS_DIR/data/security/11.0.0/* *including log.txt*

**Run TestSecurity to verify that the confusable mappings are idempotent!**
The TestSecurity.java test is part of the unit test suite, run by a github CI.
It verifies that the confusable mappings are idempotent.

With the same VM arguments as the generator.
Starting in 2021q3, TestSecurity needs to be run as a JUnit test.
It is also now part of the unit test suite and run on GitHub CI.

Copy the following from the output directory to the top level of the revision directory:
Copy the following from the output directory to the top level of the revision directory, and check in.

* confusables.txt
* confusablesSummary.txt
Expand All @@ -66,6 +84,12 @@ Copy the following from the output directory to the top level of the revision di
* ReadMe.txt
* xidmodifications.txt

### Review

Review the mappings to make sure that there are no surprises.
The biggest issue is if two equivalence classes are mistakenly joined.
For example, if you map b to d, then that will join the equivalence class for b with that of d.

### IdentifierStatus.txt & IdentifierType.txt

Markus 2020-feb-07 for Unicode 13.0:
Expand Down
34 changes: 16 additions & 18 deletions unicodetools/data/emoji/dev/emoji-test.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# emoji-test.txt
# Date: 2024-05-01, 21:25:24 GMT
# Date: 2024-06-04, 16:46:01 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -1751,12 +1751,12 @@
1F936 1F3FD ; fully-qualified # 🤶🏽 E3.0 Mrs. Claus: medium skin tone
1F936 1F3FE ; fully-qualified # 🤶🏾 E3.0 Mrs. Claus: medium-dark skin tone
1F936 1F3FF ; fully-qualified # 🤶🏿 E3.0 Mrs. Claus: dark skin tone
1F9D1 200D 1F384 ; fully-qualified # 🧑‍🎄 E13.0 mx claus
1F9D1 1F3FB 200D 1F384 ; fully-qualified # 🧑🏻‍🎄 E13.0 mx claus: light skin tone
1F9D1 1F3FC 200D 1F384 ; fully-qualified # 🧑🏼‍🎄 E13.0 mx claus: medium-light skin tone
1F9D1 1F3FD 200D 1F384 ; fully-qualified # 🧑🏽‍🎄 E13.0 mx claus: medium skin tone
1F9D1 1F3FE 200D 1F384 ; fully-qualified # 🧑🏾‍🎄 E13.0 mx claus: medium-dark skin tone
1F9D1 1F3FF 200D 1F384 ; fully-qualified # 🧑🏿‍🎄 E13.0 mx claus: dark skin tone
1F9D1 200D 1F384 ; fully-qualified # 🧑‍🎄 E13.0 Mx claus
1F9D1 1F3FB 200D 1F384 ; fully-qualified # 🧑🏻‍🎄 E13.0 Mx claus: light skin tone
1F9D1 1F3FC 200D 1F384 ; fully-qualified # 🧑🏼‍🎄 E13.0 Mx claus: medium-light skin tone
1F9D1 1F3FD 200D 1F384 ; fully-qualified # 🧑🏽‍🎄 E13.0 Mx claus: medium skin tone
1F9D1 1F3FE 200D 1F384 ; fully-qualified # 🧑🏾‍🎄 E13.0 Mx claus: medium-dark skin tone
1F9D1 1F3FF 200D 1F384 ; fully-qualified # 🧑🏿‍🎄 E13.0 Mx claus: dark skin tone
1F9B8 ; fully-qualified # 🦸 E11.0 superhero
1F9B8 1F3FB ; fully-qualified # 🦸🏻 E11.0 superhero: light skin tone
1F9B8 1F3FC ; fully-qualified # 🦸🏼 E11.0 superhero: medium-light skin tone
Expand Down Expand Up @@ -3721,6 +3721,11 @@
1F41A ; fully-qualified # 🐚 E0.6 spiral shell
1FAB8 ; fully-qualified # 🪸 E14.0 coral
1FABC ; fully-qualified # 🪼 E15.0 jellyfish
1F980 ; fully-qualified # 🦀 E1.0 crab
1F99E ; fully-qualified # 🦞 E11.0 lobster
1F990 ; fully-qualified # 🦐 E3.0 shrimp
1F991 ; fully-qualified # 🦑 E3.0 squid
1F9AA ; fully-qualified # 🦪 E12.0 oyster

# subgroup: animal-bug
1F40C ; fully-qualified # 🐌 E0.6 snail
Expand Down Expand Up @@ -3777,8 +3782,8 @@
1F344 ; fully-qualified # 🍄 E0.6 mushroom
1FABE ; fully-qualified # 🪾 E16.0 leafless tree

# Animals & Nature subtotal: 161
# Animals & Nature subtotal: 161 w/o modifiers
# Animals & Nature subtotal: 166
# Animals & Nature subtotal: 166 w/o modifiers

# group: Food & Drink

Expand Down Expand Up @@ -3881,13 +3886,6 @@
1F960 ; fully-qualified # 🥠 E5.0 fortune cookie
1F961 ; fully-qualified # 🥡 E5.0 takeout box

# subgroup: food-marine
1F980 ; fully-qualified # 🦀 E1.0 crab
1F99E ; fully-qualified # 🦞 E11.0 lobster
1F990 ; fully-qualified # 🦐 E3.0 shrimp
1F991 ; fully-qualified # 🦑 E3.0 squid
1F9AA ; fully-qualified # 🦪 E12.0 oyster

# subgroup: food-sweet
1F366 ; fully-qualified # 🍦 E0.6 soft ice cream
1F367 ; fully-qualified # 🍧 E0.6 shaved ice
Expand Down Expand Up @@ -3936,8 +3934,8 @@
1FAD9 ; fully-qualified # 🫙 E14.0 jar
1F3FA ; fully-qualified # 🏺 E1.0 amphora

# Food & Drink subtotal: 138
# Food & Drink subtotal: 138 w/o modifiers
# Food & Drink subtotal: 133
# Food & Drink subtotal: 133 w/o modifiers

# group: Travel & Places

Expand Down
Loading

0 comments on commit f741099

Please sign in to comment.