From 49a17d9c62efd54ec114f6e4777964230d8fa8b4 Mon Sep 17 00:00:00 2001 From: Markus Scherer Date: Wed, 21 Aug 2024 15:14:50 -0700 Subject: [PATCH] Unicode 16 NamesList.html 20240821 --- unicodetools/data/ucd/dev/NamesList.html | 52 ++++++++++++++++-------- 1 file changed, 35 insertions(+), 17 deletions(-) diff --git a/unicodetools/data/ucd/dev/NamesList.html b/unicodetools/data/ucd/dev/NamesList.html index 7f015c9d8..a67236e89 100644 --- a/unicodetools/data/ucd/dev/NamesList.html +++ b/unicodetools/data/ucd/dev/NamesList.html @@ -108,7 +108,7 @@

UnicodeĀ® NamesList File Format

Date - 2024-08-19 + 2024-08-21 This Version @@ -159,8 +159,8 @@

1.0 Introduction

draft versions of the NamesList.txt file. The support for UTF-8 encoded files and the syntax for the UTF-8 charset declaration in a comment at the head of the file were introduced after Unicode 6.1.0 was published, as was the syntax for the specification of variation sequences and alternate glyphs and their respective summaries. The repertoire restriction -in comments and aliases in the names list format was loosened from the prior -limitation to U+0020..U+00FF, to include the wider range U+0020..U+02FF, as of Unicode 11.0.

+in comments and aliases in the names list format was loosened from the earlier +limitation to U+0020..U+00FF, to include the wider range U+0020..U+02FF, as of Unicode 11.0, and dropped entirely as of Unicode 16.0.0.

The same input file can be used for the preparation of drafts and final editions for ISO/IEC 10646. Earlier versions of that standard used a different style, referred to below as ISO-style. That style necessitated the presence of some @@ -281,10 +281,18 @@

2.0 NamesList File Structure charset declaration (see below). Alternatively, or in addition, a BOM may be present at the very beginning of the file, forcing the encoding to be interpreted as UTF-16 (little-endian only) or UTF-8. When - declared as UTF-8, the names list format will support use of characters in - the range U+0020..U+02FF in LINE and LABEL elements. Otherwise, + declared as UTF-8, the names list format will support use of any Unicode characters in + STRING and LABEL elements. Otherwise, the supported repertoire is limited to Latin-1, and attempted use of characters outside the Latin-1 range will result in data corruption.

+

The NamesList file format does not support styled text; each line or other element + will usually be displayed in a specific font selected for it. To allow CHAR elements + that normally use chart glyphs to better coexist with running text in LABEL and STRING + elements, a user defined limit can be set, below which the normal selection of (chart) glyphs + for the CHAR element is overridden in favor of equivalent glyphs from a font selected for better + readability in running text. Any running text outside that range will use standard chart + glyphs, which may result in a ransom note effect. For production of the Unicode Standard + Version 16.0.0 and later the limit is set to U+1EFF.

Several of these elements, while part of the formal definition of the file format, do not occur in final published versions of NamesList.txt in the UCD.

@@ -514,14 +522,14 @@

2.1 NamesList File ElementsBecause a LINE or an EXPAND_LINE can itself start with a special character followed by a SP or LF, an "unmarked" COMMENT_LINE should match the input in lower priority than line types that require a special character or have a more restrictive set of characters than EXPAND_LINE. - Similarly, a SUBHEADER containing TAB "!" LF should match with a higher priority than those + Similarly, a SUBHEADER containing TAB "!" LF should match with a higher priority than one where the TAB is followed by a LINE.

2.2 NamesList File Primitives

-

The following are the primitives and terminals for the NamesList syntax.

+

The following are the primitives and terminals for the NamesList syntax. "Limit" is a user-defined value; see discussion of the implications of Limit in the notes below.

LINE:		STRING LF
 COMMENT:	"(" LABEL ")"
@@ -533,8 +541,8 @@ 

2.2 NamesList File Primitives< TAG: <sequence of ASCII letters> LCTAG: <sequence of lowercase ASCII letters> -STRING: <sequence of characters in the range U+0020..U+02FF, except controls> -LABEL: <sequence of characters in the range U+0020..U+02FF, except controls, "(" or ")"> +STRING: <sequence of characters, except controls> +LABEL: <sequence of characters, except controls, "(" or ")"> VARSEL: CHAR | "ALT" ( "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" ) VARSEL_LIST: "{" CHAR_LIST "}" @@ -580,19 +588,27 @@

2.2 NamesList File Primitives< of following characters.
  • The hyphen in a character range CHAR-CHAR is replaced by an EN DASH on output.
  • -
  • In a STRING or LABEL, a Unicode character outside the range - U+0000..U+02FF is displayed as is, with a glyph matching - the chart font, and not with the font that is otherwise defined for that element.
  • The NamesList.txt file is encoded in UTF-8 if the first line is a FILE_COMMENT containing the declaration "UTF-8" or any casemap variation thereof. Otherwise the file is encoded in Latin-1 (older versions). Beyond detecting the charset declaration (typically: "; charset=utf-8") the remainder of that comment is ignored. - If the file is not encoded as - UTF-8, the character repertoire for running text (anything - other than CHAR) is effectively restricted to the repertoire of Latin-1. - Otherwise, characters in the range U+0020..U+02FF - are allowed in STRING or LABEL elements, and elements derived from them.
  • + When declared as UTF-8, the NamesList format will support any Unicode character + in STRING or LABEL elements, but see further implications below. +
  • In a STRING or LABEL element, a Unicode character outside the range + U+0020..Limit is displayed with a glyph matching + the chart font, and not with the font that is otherwise defined for that element. + The Limit value is user defined. + For production of the Unicode Standard from Version 16.0.0 and later the Limit + value is set to U+1EFF. + All code points less than the Limit value can be mapped onto a font selected for best + results in running text. However, any CHAR elements contained in an EXPAND_LINE + are exempt from this and are always displayed with a glyph matching the chart font. + The net effect is a workaround for the fact that the NamesList format does + not support style runs within any element that encompasses a single unit of flowed text.
  • +
  • When drafting STRING or LABEL elements, one should note that text containing + characters outside the range U+0020..Limit may result in a ransom note effect, + as the regular text font and charts fonts would be alternated. This is best avoided.
  • The code chart layout program (Unibook) can accept files in several other formats. These include little-endian UTF-16, @@ -613,6 +629,8 @@

    Modifications

    Version 16.0.0

    • Reissued for Unicode 16.0.0
    • +
    • Reflect the wider range of possible values for the user defined Limit.
    • +
    • Added an explanation of the effect of the Limit value.

    Version 15.1.0