CLDR-16249 Describe EBNF syntax more clearly #3322

macchiati · 2023-10-05T09:26:33Z

This PR completes the ticket.

ALLOW_MANY_COMMITS=true

pedberg-icu · 2023-10-05T15:58:39Z

Going ahead and merging this...

gibson042 · 2023-10-05T16:18:59Z

docs/ldml/tr35.md

+The BNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in [W3C XML Notation](https://www.w3.org/TR/REC-xml/#sec-notation). The main differences are:

+1. Bounded repetition following Perl regex syntax is allowed, such as alphanum{3,8}
+2. Constraints (well-formedness or validity) use separate notes


@macchiati this list misses some other dialect differences that I see in the document.

Suggested change

The BNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in [W3C XML Notation](https://www.w3.org/TR/REC-xml/#sec-notation). The main differences are:

1. Bounded repetition following Perl regex syntax is allowed, such as alphanum{3,8}

2. Constraints (well-formedness or validity) use separate notes

The BNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in [W3C XML Notation](https://www.w3.org/TR/REC-xml/#sec-notation). The main differences are:

1. Bounded repetition following Perl regex syntax is allowed, such as `alphanum{3,8}`

2. Whitespace inside bracketed enumerations and ranges is ignored (e.g., `[A-Z a-z]` is the same as `[A-Za-z]`)

3. A backslash may be used to escape a following "x"-prefixed hexadecimal code point (e.g., `\x20` is the same as `#x20`) or the immediately following non-alphanumeric character (e.g., `[\&\-]` is the same as `[#x26#x2D]`)

4. Constraints (well-formedness or validity) use separate notes

(backslash escaping appears in the Unicode Sets grammar, which could and probably should be expressed without it)

I filed https://unicode-org.atlassian.net/browse/CLDR-17210 to pick up the suggested additions

Is there a process by which I can submit this kind of fix? Looking over CONTRIBUTING.md, docs/ is not mentioned in Areas where contributions are welcome.

Yes, these can be submitted in the docs/ directory. And thanks for your reviews!

https://unicode-org.atlassian.net/browse/CLDR-17498, #3609

(backslash escaping appears in the Unicode Sets grammar, which could and probably should be expressed without it)

@macchiati To elaborate on this point: UnicodeSet syntax includes expressions like s [A-Za-z0-9] [A-Za-z0-9_\x20]* s, in which \x20 is to be interpreted as matching U+0020 SPACE. That table also includes [[\u0000-\U00010FFFF]-[uxUN]], in which \u0000 is to be interpreted as matching U+0000, \U00010FFFF is to be interpreted as matching U+10FFFF, [\u0000-\U00010FFFF] is to be interpreted as matching any code point in the inclusive range between those two (i.e., any code point), and the whole expression is to be interpreted as matching any code point other than "u", "x", "U", or "N" (a clever use of UnicodeSets in defining UnicodeSets, but one that does not conform with the defined BNF syntax). The latter expression should be replaced with something like [^uxUN], leaving only \x escapes as described in point 3 of my above suggestion (which I have just modified to disallow \u, \U, and \N in anticipation of this issue recurring).

P.S. There's also a typo in that table—pValuePerl should use [^ \\ \}] rather than [^\}].

srl295 · 2023-10-12T23:08:49Z

@gibson042 comment was post merge, i'll add a note on the ticket

macchiati · 2024-04-03T19:07:21Z

Could you capture this in the ticket?

…

On Wed, Apr 3, 2024, 11:59 Richard Gibson ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In docs/ldml/tr35.md <#3322 (comment)>: > +The BNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in [W3C XML Notation](https://www.w3.org/TR/REC-xml/#sec-notation). The main differences are: +1. Bounded repetition following Perl regex syntax is allowed, such as alphanum{3,8} +2. Constraints (well-formedness or validity) use separate notes (backslash escaping appears in the Unicode Sets <https://unicode.org/reports/tr35/#Unicode_Sets> grammar, which could and probably should be expressed without it) @macchiati <https://github.com/macchiati> To elaborate on this point: UnicodeSet syntax <https://unicode.org/reports/tr35/#unicodeset-syntax> includes expressions like s [A-Za-z0-9] [A-Za-z0-9_\x20]* s, in which \x20 is to be interpreted as matching U+0020 SPACE. That table also includes [[\u0000-\U00010FFFF]-[uxUN]], in which \u0000 is to be interpreted as matching U+0000, \U00010FFFF is to be interpreted as matching U+10FFFF, [\u0000-\U00010FFFF] is to be interpreted as matching any code point in the inclusive range between those two (i.e., any code point), and the whole expression is to be interpreted as matching any code point other than "u", "x", "U", or "N" (a clever use of UnicodeSets in defining UnicodeSets, but one that does not conform with the defined BNF syntax). The latter expression should be replaced with something like [^uxUN], leaving only \x escapes as described in point 3 of my above suggestion (which I have just modified to disallow \u, \U, and \N in anticipation of this issue recurring). P.S. There's also a typo in that table—pValuePerl should use [^ \\ \}] rather than [^\}]. — Reply to this email directly, view it on GitHub <#3322 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMFYKJYQ2JD6KRN2FBDY3RGQBAVCNFSM6AAAAAA5T6DCI6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSNZXG4YTKMRZG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

CLDR-16249 Describe EBNF syntax more clearly

b682597

macchiati requested review from btangmu and pedberg-icu October 5, 2023 09:26

pedberg-icu approved these changes Oct 5, 2023

View reviewed changes

pedberg-icu merged commit 2e03d3c into unicode-org:main Oct 5, 2023
4 checks passed

gibson042 reviewed Oct 5, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLDR-16249 Describe EBNF syntax more clearly #3322

CLDR-16249 Describe EBNF syntax more clearly #3322

macchiati commented Oct 5, 2023

pedberg-icu commented Oct 5, 2023

gibson042 Oct 5, 2023 •

edited

Loading

pedberg-icu Oct 31, 2023

gibson042 Oct 31, 2023

macchiati Apr 3, 2024

gibson042 Apr 3, 2024 •

edited

Loading

gibson042 Apr 3, 2024

srl295 commented Oct 12, 2023

macchiati commented Apr 3, 2024 via email

CLDR-16249 Describe EBNF syntax more clearly #3322

CLDR-16249 Describe EBNF syntax more clearly #3322

Conversation

macchiati commented Oct 5, 2023

pedberg-icu commented Oct 5, 2023

gibson042 Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

pedberg-icu Oct 31, 2023

Choose a reason for hiding this comment

gibson042 Oct 31, 2023

Choose a reason for hiding this comment

macchiati Apr 3, 2024

Choose a reason for hiding this comment

gibson042 Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

gibson042 Apr 3, 2024

Choose a reason for hiding this comment

srl295 commented Oct 12, 2023

macchiati commented Apr 3, 2024 via email

gibson042 Oct 5, 2023 •

edited

Loading

gibson042 Apr 3, 2024 •

edited

Loading