-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
635 lines (493 loc) · 53.3 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<meta charset="utf-8">
<title>Language Tags and Locale Identifiers for the World Wide Web</title>
<script src="https://www.w3.org/Tools/respec/respec-w3c" async="" class="remove"></script>
<script class="remove">
var respecConfig = {
specStatus: "ED",
edDraftURI: "https://w3c.github.io/ltli/",
shortName: "ltli",
editors: [
{ name: "Addison Phillips", mailto: "[email protected]",
company: "Invited Expert", w3cid: 33573 },
{ name: "Felix Sasaki",
company: "Invited Expert"
}
],
// if you wish the publication date to be other than today, set this
//publishDate: "2015-04-23",
//previousMaturity: "WD",
//previousPublishDate: "2015-04-23",
noRecTrack: true,
group: "i18n",
github: "w3c/ltli",
xref: ["i18n-glossary"],
localBiblio: {
"CLDR": {
title: "Common Locale Data Repository",
href: "http://cldr.unicode.org",
publisher: "Unicode"
},
"LDML": {
title: "Unicode Technical Standard #35: Locale Data Markup Language",
href: "https://www.unicode.org/reports/tr35/",
publisher: "Unicode",
authors: [
"Mark Davis",
"CLDR Contributors"
]}
}
};
</script>
<link rel="stylesheet" href="local.css" />
</head>
<body>
<section id="abstract">
<p>This document provides definitions and best practices related to the identification of the natural language of content in document formats, specifications, and implementations on the Web. It describes how language tags are used to indicate a user's locale preferences which, in turn, are used to process, format, and display information to the user.</p>
</section>
<section id="sotd">
<p>This is an updated Public Working Draft of "Language Tags and Locale
Identifiers for the World Wide Web". The Working Group expects this to
become a Working Group Note.</p>
<p class="note">If you wish to make comments regarding this document, please <a href="https://github.com/w3c/ltli/issues"
style="font-size: 120%;">raise a github issue</a>. You may also send
email to the list <a href="mailto:[email protected]?subject=%5Bltli%5D%20">[email protected]</a>
(<a href="mailto:[email protected]?subject=subscribe">subscribe</a>,
<a href="https://lists.w3.org/Archives/Public/www-international/">archives</a>)
as mentioned below. Please include <q>[ltli]</q> at the start of your
email's subject. To make it easier to track comments, please raise
separate issues or send separate emails for each comment. All comments
are welcome.</p>
</section>
<section id="introduction">
<h2>Introduction</h2>
<p>Language tags and locales are some of the fundamental building blocks of <a>internationalization</a> (<q><a>i18n</a></q>) of the Web. In this document you will find definitions for much of the basic terminology related to this aspect of <a>internationalization</a>.</p>
<p>This document also provides terminology and best practices needed by specification authors for the identification of <a>natural language</a> values in document formats or protocols and which are recommended by the Internationalization (I18N) Working Group. These (and many other) best practices, along with links to supporting materials, can also be found in the <cite>Internationalization Best Practices for Spec Developers</cite> [[INTERNATIONAL-SPECS]]. In addition to the best practices found here, additional best practices relating to language metadata on the Web can be found in [[STRING-META]].</p>
</section>
<section id="conventions">
<h3>Document Conventions</h3>
<p>In this document [[RFC2119]] keywords in uppercase italics have their usual meaning. We also use these stylistic conventions:</p>
<p class="definition-example"><strong>Definitions</strong> appear with a different background color and decoration like this.</p>
<p class="advisement"><strong>Best practices</strong> appear with a different background color and decoration like this.</p>
<p class="issue-example" id="issue-example"><strong>Recommendations</strong> for future work appear with a different background color and decoration like this.</p>
</section>
<section id="language-terminology">
<h3>Languages and Language Tags</h3>
<p>Tags for identifying the <a>natural language</a> of content or the <a>international preferences</a> of users are one of the fundamental building blocks of the Web. The <a>language tags</a> found in Web and Internet formats and protocols are defined by [[BCP47]]. Consistent use of language tags provides applications the ability to perform language-specific formatting or processing. For example, a user-agent might use the language to select an appropriate font for displaying text or a Web page designer might style text differently in one language than in another.</p>
<p>Many of the core standards for the Web include support for <a>language tags</a>; these include the <code>xml:lang</code> attribute in [[XML10]], the <code>lang</code> and <code>hreflang</code> atttributes in [[HTML]], the <code>language</code> property in [[XSL10]], and the <code>:lang</code> pseudo-class in CSS [[CSS3-SELECTORS]], and many others, including SVG, TTML, SSML, etc.</p>
<p class="definition"><dfn data-lt="natural language|language">Natural Language</dfn> (or, in this document, just <em>language</em>). The spoken, written, or signed communications used by human beings.</p>
<p>There are many ways that languages might be identified and many reasons that software might need to identify the language of content on the Web. Document formats and protocols on the Web generally use the identifiers used in most other parts of the Internet, consisting of the language tags defined in [[BCP47]]. "BCP" nomenclature refers to the current set of IETF RFCs that form the "best current practice".</p>
<p class="definition"><dfn data-lt="language tag|language tags">Language tag</dfn>. A string used as an identifier for a language. In this document, the term <em>language tag</em> always refers explicitly to a [[BCP47]] language tag. These language tags consist of one or more subtags.</p>
<p class="advisement" id="ltli-bcp47-refer"><a class="self" href="#ltli-bcp47-refer">​</a>Specifications for the Web that require language identification MUST refer to [[BCP47]]. </p>
<p class="advisement" id="ltli-no-rfc-refs"><a href="#ltli-no-rfc-refs" class="self">​</a>Specifications SHOULD NOT refer to specific component RFCs of [[BCP47]].</p>
<p>[[BCP47]] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, called <em>Tags for Identifying Languages</em> [[RFC5646]], defines the grammar, form, and terminology of language tags. The second part, called <em>Matching of Language Tags</em> [[RFC4647]], describes several schemes for matching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.</p>
<p class="advisement" id="ltli-successor-ref"><a href="#ltli-successor-ref" class="self">​</a>Formulations such as "<span class="quote">RFC 5646 or its successor</span>" MAY be used, but only in cases where the specific document version is necessary.</p>
<p>While this style of reference was once popular, using the BCP reference is more accurate. Since the grammar of language tags has been fixed since [[RFC4646]], referring to the BCP will not incur additional compliance risk to most implementations.</p>
<p class="advisement" id="ltli-no-obsolete-refs"><a href="#ltli-no-obsolete-refs" class="self">​</a>Specifications MUST NOT reference obsolete versions of [[BCP47]], such as [[RFC1766]] or [[RFC3066]].</p>
<p class="advisement" id="ltli-obs-language-tag-ref"><a href="#ltli-obs-language-tag-ref" class="self">​</a>Specifications that need to preserve compatibility with obsolete versions of [[BCP47]] MUST reference the production <code>obs-language-tag</code> in [[BCP47]].</p>
<p>Beginning with [[RFC4646]], [[BCP47]] defined a more complex, machine-readable syntax for language tags. This syntax is stable and is not expected to change in the foreseeable future. Some specifications might desire or require compatibility with the older language tag grammar found in previous versions of BCP47 (specifically [[RFC1766]] and [[RFC3066]]). This grammar was more permissive and is described in [[BCP47]] as the ABNF production <code>obs-language-tag</code>. [[RFC4646]], which introduced the current grammar for language tags, was replaced by [[RFC5646]] as part of the current [[BCP47]].</p>
<p class="advisement" id="ltli-language-information-in-uris-req"><a href="#ltli-language-information-in-uris-req" class="self">​</a>Applications that provide language information as part of URIs (e.g. in the realm of RDF) SHOULD use [[BCP47]].</p>
<p>Currently, URIs expressing language information often use values from parts of ISO 639. This leads to situations in which there are ambiguities about what the proper value should be, e.g. for German <code>de</code> from ISO 639-1 or <code>ger</code> from ISO 639-2. By using BCP 47 and its language sub tag registry, such ambiguities can be avoided, e.g. for German, the registry contains only <code>de</code>.</p>
<p class="definition"><dfn data-lt="subtag|subtags">Subtag</dfn>. A sequence of ASCII letters or digits separated from other subtags by the hyphen-minus character and identifying a specific element of meaning withing the overall <a>language tag</a>. In [[BCP47]], subtags can consist of upper or lowercase ASCII letters (the case carries no distinction) or ASCII digits. Subtags are limited to no more than eight characters (although additional length restrictions apply depending on the specific use of the subtag).</p>
<p>Selecting content or behavior based on the language tag requires a few additional concepts defined by [[BCP47]] (in [[RFC4647]]). In this document, we adopt the following terminology taken directly from [[BCP47]]:</p>
<p class="definition"><dfn data-lt="iana language subtag registry|subtag registry|registry|lstr">IANA Language Subtag Registry</dfn>. A machine-readable text file available via IANA which contains a comprehensive list of all of the subtags valid in language tags. (Link: <a href="https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry">Registry</a>)</p>
<p class="advisement" id="ltli-no-subsidiary-stds"><a href="#ltli-no-subsidiary-stds" class="self">​</a>Specifications SHOULD NOT reference [[BCP47]]'s underlying standards that contribute to the <a>IANA Language Subtag Registry</a>, such as ISO639, ISO15924, ISO3066, or UN M.49.</p>
<p>Some standards might directly consume one of [[BCP47]]'s contributory standards, in which case a reference is wholly appropriate. However, in most cases, the purpose of the reference is to specify a valid list of codes and their meanings. [[BCP47]]'s <a>subtag registry</a> is stabilized and resolves ambiguity in a number of useful ways and so should be the preferred source for this type of reference.</p>
<p>[[BCP47]] defines two different levels of conformance. See <a href="https://tools.ietf.org/html/bcp47#section-2.2.9">classes of conformance</a> in [[BCP47]] for specifics. For language tags, the levels of conformance correspond to type of checking that an implementation applies to language tag values.</p>
<p class="definition"><dfn data-lt="well-formed|well-formed language tag|well-formed language tags">Well-formed language tag</dfn>. A language tag that follows the grammar defined in [[BCP47]]. That is, it is structurally correct, consisting of ASCII letters and digit <a>subtags</a> of the prescribed length, separated by hyphens.</p>
<p class="definition"><dfn data-lt="valid|valid language tag|valid language tags">Valid language tag</dfn>. A language tag that is <a>well-formed</a> and which also conforms to the additional <a href="https://tools.ietf.org/html/bcp47#section-2.2.9">conformance requirements</a> in [BCP47], notably that each of the subtags appears in the IANA Language Subtag Registry.</p>
<p class="advisement" id="ltli-well-formed-req"><a href="#ltli-well-formed-req" class="self">​</a>Specifications SHOULD require that language tags be <a>well-formed</a>.</p>
<p class="advisement" id="ltli-valid-req"><a href="#ltli-valid-req" class="self">​</a>Specifications MAY require that language tags be <a>valid</a>.</p>
<p class="advisement" id="ltli-valid-content-req"><a href="#ltli-valid-content-req" class="self">​</a>Specifications SHOULD require that content authors use <a>valid language tags</a>.</p>
<p>Note that this is stricter than what is recommended for implementations.</p>
<p class="advisement" id="ltli-validator-req"><a href="#ltli-validator-req" class="self">​</a>Content validators SHOULD check if content uses <a>valid language tags</a> where feasible.</p>
<p>Checking if a tag is <a>valid</a> requires access to or a copy of the <a>registry</a> plus additional runtime logic. While content authors are advised to choose, generate, and exchange only valid values, language tag matching and other common language tag operations are designed so that validity checking is not needed. Features or functions that need to understand the specific semantic content of subtags are the main reason that a specification would normatively require <a>valid</a> tags as part of the protocol or document format.</p>
<p class="definition"><dfn data-lt="language tag extension|extension|extensions|registered extension">Language tag extension</dfn> or <em>extension</em>. A system of additional [[BCP47]] subtags introduced by a single letter or digit subtag registered with IANA and permitting additional types of language identification.</p>
<p class="advisement" id="ltli-bcp47-extension-ref"><a href="#ltli-bcp47-extension-ref" class="self">​</a>Specifications MAY reference registered extensions to [[BCP47]] as necessary.</p>
<p>In particular, [[RFC6067]] defines the <cite>BCP 47 Extension U</cite>, also known as "Unicode Locales". This extension to [[BCP47]] provides additional subtag sequences for selecting specific locale variations.</p>
<p class="advisement" id="ltli-keep-extensions"><a class="self" href="#ltli-keep-extensions">​</a>Specifications SHOULD NOT restrict the length of language tags or permit or encourage the removal of extensions.</p>
<p class="definition"><dfn data-lt="language range|range|language-range|language ranges">Language range</dfn>. A string similar in structure to a language tag that is used for "identifying sets of language tags that share specific attributes".</p>
<p class="definition"><dfn>Language priority list</dfn>. A collection of one or more <a>language ranges</a> identifying the user's language preferences for use in matching. As the name suggests, such lists are normally ordered or weighted according to the user's preferences. The HTTP [[RFC2616]] <code>Accept-Language</code> [[RFC3282]] header is an example of one kind of language priority list.</p>
<p class="definition"><dfn data-lt="basic language range">Basic language range</dfn>. A <a>language range</a> consisting of a sequence of subtags separated by hyphens. That is, it is identical in appearance to a language tag.</p>
<p class="definition"><dfn>Extended language range</dfn>. A <a>language range</a> consisting of a sequence of hyphen-separated subtags. In an extended language range, a subtag can either be a valid subtag or the wildcard subtag <q><code>*</code></q>, which matches any value.</p>
<aside class="example">
<p>Basic versus extended language range and language priority list</p>
<p>The string <code>de-de</code> is a basic language range. It matches, for example, the language tag <code>de-DE-1996</code>, but not the language tag <code>de-Deva</code>.</p>
<p>The string <code>de-*-DE</code> is an extended language range. It matches all of the following tags:</p>
<ul>
<li>
<p><code>de-DE</code></p>
</li>
<li>
<p><code>de-DE-x-goethe</code></p>
</li>
<li>
<p><code>de-Latn-DE-1996</code></p>
</li>
</ul>
<p><code>"en; fr; zh-Hant"</code> is a language priority list. It would be read as "<span class="quote">English before French before Chinese as written in the Traditional script</span>". Note that the syntax shown is only an example, since it depends on the protocol, application, or implementation that uses the list.</p>
</aside>
<p>Some <a>language priority lists</a>, such as the <code>Accept-Language</code> [[RFC3282]] header mentioned earlier, provide "weights" for values appearing in the list. Such weighting cannot be depended on for anything other than ordering the list.</p>
<p class="advisement" id="ltli-range-type-req"><a href="#ltli-range-type-req" class="self">​</a>Specifications that define language tag matching or <a>language negotiation</a> MUST specify whether language ranges used are a <a>basic language range</a> or an <a>extended language range</a>.</p>
<p class="advisement" id="ltli-matching-result-ref"><a href="#ltli-matching-result-ref" class="self">​</a>Specifications that define language tag matching MUST specify whether the results of a matching operation contains a single result (<em>lookup</em> as defined in [[RFC4647]]), or a possibly-empty (zero or more) set of results (<em>filtering</em> as defined in [[RFC4647]]).</p>
<p class="advisement" id="ltli-matching-type-ref"><a href="#ltli-matching-type-ref" class="self">​</a>Specifications that define language tag matching MUST specify the matching algorithms available and the selection mechanism.</p>
<p>For example, JavaScript internationalization [[ECMA-402]] and [[CLDR]] provide a "best fit" algorithm which can be tailored by implementers.</p>
</section>
<section id="i18n-terminology">
<h3>Locales and Internationalization</h3>
<p><em>This section defines basic terminology related to internationalization and localization.</em></p>
<p>Users who speak different languages or come from different cultural backgrounds usually require software and services that are adapted to correctly process information using their native languages, writing systems, measurement systems, calendars, and other linguistic rules and cultural conventions.</p>
<p><a>Language tags</a> can also be used to identify <a>international preferences</a> associated with a given piece of content or user because these preferences are linked to the natural language, regional association, or culture of the end user. Such preferences are applied to processes such as presenting numbers, dates, or times; sorting lists linguistically; providing defaults for items such as the presentation of a calendar, or common units of measurement; selecting between 12- vs. 24-hour time presentation; and many other details that users might find too tedious to set individually. Collectively, an identifier for these preferences is usually called a <a>locale</a>. The extensions to [[BCP47]] that define <a>Unicode locales</a> [[CLDR]] provide the basis for <a>internationalization</a> APIs on the Web, notably the JavaScript language [[ECMASCRIPT]] uses <a>Unicode locales</a> as the basis for the APIs found in [[ECMA-402]].</p>
<p class="definition"><dfn id="international-preferences" data-lt="international preferences">International Preferences.</dfn> A user's particular set of language and formatting preferences and associated cultural conventions. Software can use these preferences to correctly process or present information exchanged with that user.</p>
<p>Many kinds of <a>international preference</a> may be offered
on the Web in order for a content or a service to be considered usable
and acceptable by users around the world. Some of these preferences
might include:
<ul>
<li>Natural language for text processing, such as parsing, spell checking, and
grammar checking;</li>
<li>User interface language, which may include items like images,
colors, sounds, formats, and navigational elements as well as the
visible text strings;</li>
<li>Presentation (human-oriented formatting) of dates, times, numbers,
lists, and other values;</li>
<li>Collation, sorting, and organization of content (such as in a
phone book or a dictionary);</li>
<li>Alternate time-keeping and calendars, which may include holidays,
work rules, weekday/weekend distinctions, the number and
organization of months, the numbering of years, and so forth;</li>
<li>Tax or regulatory regime;</li>
<li>Currency</li>
</ul>
... and many more. </p>
<p class="definition"><dfn id="internationalization" data-lt="internationalization|I18N|internationalized">Internationalization</dfn>. The design and development of a product that is enabled for target audiences that vary in culture, region, or language. Internationalization is sometimes abbreviated <code>i18n</code> because there are eighteen letters between the "I" and the "N" in the English word.</p>
<p class="definition"><dfn data-lt="localization|localized|L10N">Localization</dfn>. The tailoring of a system to the individual cultural expectations of a specific target market or group of individuals. Localization includes, but is not limited to, the translation of user-facing text and messages. Localization is sometimes abbreviated as <code>l10n</code> because there are ten letters between the "L" and the "N" in the English word. When a particular set of content and preferences corresponding to a specific set of international preferences is operationally available, then the system is said to be <em>localized</em>.</p>
<p class="definition"><dfn id="locale" data-lt="locale|locales">Locale</dfn>. An identifier (such as a <a>language tag</a>) for a set of <a>international preferences</a>. Usually this identifier indicates the preferred language of the user and possibly includes other information, such as a geographic region (such as a country). A locale is passed in APIs or set in the operating environment to obtain culturally-affected behavior within a system or process.</p>
<p class="definition"><dfn data-lt="locale aware|locale-aware|enabled|enable">Locale-aware</dfn> (or <em>Enabled</em>). A system that can respond to changes in the <a>locale</a> with culturally and language-specific behavior or content. Generally, systems that are internationalized can support a wide range of <a>locales</a> in order to meet the <a>international preferences</a> of many kinds of users.</p>
<p><a>Language tags</a> can provide information about the language, script, region, and various specially-registered variants using subtags. But sometimes there are international preferences that do not correlate directly with any of these. For example, many cultures have more than one way of sorting content items, and so the appropriate sort ordering cannot always be inferred from the language tag by itself. Thus a German language user might want to choose between the sort ordering used in a dictionary versus that used in a phone book.</p>
<p>Historically, locales were associated with and specific to the programming language or operating environment of the user. These application-specific identifiers often could be inferred from or converted into language tags. Some examples of locale models include Java's <code>java.util.Locale</code>, POSIX (with identifiers such as <code>de_CH@utf8</code>), Oracle databases (<code>AMERICAN_AMERICA.AL32UTF8</code>), or Microsoft's LCIDs (which used numeric codes such as <code>0x0409</code>). The relationship between several of these models, the underlying standards such as ISO639 or ISO3166, and early language tags (such as [[RFC1766]]) was entirely intentional. Implementations often mapped (and continue to map) language tags from an existing protocol, such as HTTP's Accept-Language header, to proprietary or platform-specific locale models.</p>
<p>Since the adoption of the current [[BCP47]] identifier syntax, a number of locale models have adopted BCP47 directly or provided adaptation or mappings between proprietary models and <a>language tags</a>. Notably, the development and adoption of the open-source repository of locale data known as [[CLDR]] has led to wider general adoption of <a>language tags</a> as <a>locale</a> identifiers.</p>
<p class="definition"><dfn data-lt="common locale data repository|CLDR" class="lint-ignore">Common Locale Data Repository</dfn> (or <em>[[CLDR]]</em>). The Common Locale Data Repository is a Unicode Consortium project that defines, collects, and curates sets of data needed to enable <a>locales</a> in systems or operating environments. CLDR data and its locale model are widely adopted, particularly in browsers.</p>
<p class="definition"><dfn data-lt="unicode locale|unicode locale identifier|unicode locale identifiers|unicode locales">Unicode Locale Identifier</dfn> or <em>Unicode Locale</em>. A <a>language tag</a> that follows the additional rules and restrictions on subtag choice defined in UTR#35 [[LDML]]. Any valid Unicode locale identifier is also a <a>valid</a> [[BCP47]] <a>language tag</a>, but a few <a>valid language tags</a> are not also valid Unicode locale identifiers.</p>
<p class="definition"><dfn data-lt="canonical Unicode locale identifier|canonical tag|canonical locale">Canonical Unicode locale identifier</dfn>. A <a>well-formed language tag</a> resulting from the application of the <a>Unicode locale identifier</a> canonicalization rules found in [[LDML]] (see <a href="https://www.unicode.org/reports/tr35/#Unicode_locale_identifier">Section 3</a>). This process converts any <a>valid</a> [[BCP47]] <a>language tag</a> into a valid <a>Unicode locale identifier</a>. For example, deprecated subtags or irregular grandfathered tags are replaced with their preferred value from the <a>IANA language subtag registry</a>.</p>
<p>[[CLDR]] defines and maintains two <a>language tag extensions</a> ([[RFC6067]] and [[RFC6497]]) that are related to <a>Unicode locale identifiers</a>. These extensions allow a <a>language tag</a> to express some <a>international preference</a> variations that go beyond linguistic or regional variation or to select formatting behavior or content when there are multiple options or user preferences within a given locale. <a>Unicode locale identifiers</a> are not required to include these extensions: they are only used when the locale being identified requires additional tailoring provided by one of these extensions. [[CLDR]] also applies specific interpretation of certain subtags when used as a locale identifier. See <a href="https://www.unicode.org/reports/tr35/tr35.html#Unicode_locale_identifier">Section 3.2</a> of [[LDML]] for details.</p>
<p>The <strong>Unicode locale <a>language tag extension</a></strong> [[RFC6067]] uses the <code>-u-</code> subtag, and provides subtags for selecting different locale-based formats and behaviors. See <a href="https://www.unicode.org/reports/tr35/tr35.html#Locale_Extension_Key_and_Type_Data">Section 3.6</a> of [[LDML]] for details.</p>
<p>The <strong>transformed content <a>language tag extension</a></strong> [[RFC6497]], which uses the <code>-t-</code> subtag, provides subtags for text transformations, such as transliteration between scripts. See <a href="https://www.unicode.org/reports/tr35/tr35.html#t_Extension">Section 3.7</a> of [[LDML]] for details.</p>
<p>Unicode Locales increasingly form the basis for <a>internationalization</a> on the Web, particularly as part of the <code>Intl</code> locale framework [[ECMA-402]] in JavaScript [[ECMASCRIPT]].</p>
<p class="advisement" id="ltli-canonical-author-req"><a class="self" href="#ltli-canonical-author-req">​</a>Content authors SHOULD choose language tags that are <a>canonical Unicode locale identifiers</a>.</p>
<p>The additional content restrictions and normalization steps found in <a href="https://www.unicode.org/reports/tr35/#Canonical_Unicode_Locale_Identifiers">Section 3</a> of [[LDML]] provide for better interoperability and consistency than that afforded by [[BCP47]] directly.</p>
<p class="advisement" id="ltli-canonical-impl-req"><a class="self" href="#ltli-canonical-impl-req">​</a>Implementations SHOULD only emit language tags that are <a>canonical Unicode locale identifiers</a> and SHOULD normalize language tags that they consume using the rules for producing canonical tags.</p>
<p>As above, the additional content restrictions and normalization steps found in <a href="https://www.unicode.org/reports/tr35/#Canonical_Unicode_Locale_Identifiers">Section 3</a> of [[LDML]] provide for better interoperability and consistency than that afforded by [[BCP47]] directly. This best practice should not be interpreted as meaning that implementations need to support, generate, process, or understand either of [[CLDR]]'s extensions.</p>
<p class="advisement" id="ltli-extensions-not-required"><a class="self" href="#ltli-extensions-not-required">​</a>Content authors SHOULD NOT include <a>language tag extensions</a> in a <a>language tag</a> unless the specific application requires the additional tailoring.</p>
<p>It is important to remember that every <a>Unicode locale identifier</a> is <em>also</em> a <a>well-formed</a> [[BCP47]] language tag. <a>Unicode locale identifiers</a> do not require the use of either of [[CLDR]]'s <a>language tag extensions</a>.</p>
<p class="note">Some international and cultural preferences are individual and are left to content authors, service providers, operating environments, or user agents to define and manage on behalf of the user.</p>
<p>Here are a few selected examples of <a>Unicode Locale identifiers</a> and the variations associated with them.</p>
<aside class="example" id="example-locale-variation" title="Numeric Formats and Digit Shapes">
<p>In this example, the value <code>123456789.5678</code> is formatted using the locale rules represented by the various language tags. Notice how the <code>u</code> extension and its <code>nu</code> keyword are used to select between Latin and Devanagari digit shapes in the Hindi-as-used-in-India (<code>hi-IN</code>) locale and between Latin and Arabic script digit shaps in the Arabic (<code>ar</code>) locale.</p>
<table>
<thead>
<tr>
<th style="width:20%">Variation Type</th>
<th style="width:20%">Value</th>
<th style="width:20%">Locale</th>
<th style="width:25%">Formatted Value</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan=6>Numbering System</td>
<td rowspan=6><code>123456789.5678</code></td>
<td>en-US</td>
<td lang="en-US">123,456,789.5678</td>
<tr>
<td>de</td>
<td lang="de">123.456.789,5678</td></tr>
<tr>
<td>hi-IN-u-nu-latn</td>
<td lang="hi">12,34,56,789.5678</td>
</tr>
<tr>
<td>hi-IN-u-nu-deva</td>
<td lang="hi-IN-u-nu-deva">१२,३४,५६,७८९.५६७८</td>
</tr>
<tr>
<td>ar-u-nu-latn</td>
<td dir=rtl lang="ar-u-nu-latn">123,456,789.5678</td>
</tr>
<tr>
<td>ar-u-nu-arab</td>
<td dir=rtl lang="ar-u-nu-arab">١٢٣٬٤٥٦٬٧٨٩٫٥٦٧٨</td>
</tr>
</tbody>
</table>
</aside>
<aside class="example" id="example-calendar-variation" title="Date Formats and Calendars">
<p>In this example, a date value corresponding to <kbd>8 October 2020</kbd> on the Gregorian calendar is formatted using various different locales. In the tables below we present both the local-language and English (<code>en</code>) <a>locale</a> format of the same date value with different corresponding extension sequences supplied. This demonstrates the interplay between different locales and calendars when formatting a <a>locale-neutral</a> date value. Note that the <a>language tag extensions</a> can be applied to any <a>language tag</a> to modify the resulting <a>Unicode locale</a>.</p>
<p>Here are some presentational differences between English, French, and Japanese locales without using <a>language tag extensions</a> (each of which happens to use the Gregorian calendar):</p>
<table>
<thead>
<tr>
<th>Value</th>
<th>Locale</th>
<th>Formatted Value</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan=3><code>2020-10-08T12:00:00Z</code></td>
<td>en</td>
<td>October 8, 2020</td>
</tr>
<tr>
<td>fr</td>
<td lang="fr">8 octobre 2020</td>
</tr>
<tr>
<td>ja</td>
<td lang="ja">2020年10月8日</td>
</tr>
</tbody>
</table>
<p>Thailand uses the Thai Buddhist calendar, which can be represented using the extension sequence <code>-u-ca-buddhist</code>. This calendar is similar to the Gregorian calendar, but uses a different year numbering scheme.</p>
<table>
<thead>
<tr>
<th>Value</th>
<th>Locale</th>
<th>Formatted Value</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan=4><code>2020-10-08T12:00:00Z</code></td>
<td>en</td>
<td>October 8, 2020</td>
</tr>
<tr>
<td>th-u-ca-gregory</td>
<td lang="th-u-ca-gregory">8 ตุลาคม ค.ศ. 2020</td>
</tr>
<tr>
<td>th-u-ca-buddhist</td>
<td lang="th-u-ca-buddhist">8 ตุลาคม 2563</td>
</tr>
<tr>
<td>en-u-ca-buddhist</td>
<td lang="en-u-ca-buddhist">October 8, 2563 BE</td>
</tr>
</tbody>
</table>
<p>In addition to the Gregorian calendar, Japan uses other calendar systems for different cultural or official purposes. One such calendar is the Japanese Imperial calendar denoted by the extension sequence <code>-u-ca-japanese</code>. This calendar is also similar to the Gregorian calendar, but uses a different year numbering scheme.</p>
<table>
<thead>
<tr>
<th>Value</th>
<th>Locale</th>
<th>Formatted Value</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan=3><code>2020-10-08T12:00:00Z</code></td>
<td>en</td>
<td>October 8, 2020</td>
</tr>
<tr>
<td>ja-u-ca-japanese</td>
<td lang="ja-u-ca-japanese">令和2年10月8日</td>
</tr>
<tr>
<td>en-u-ca-japanese</td>
<td lang="en-u-ca-japanese">October 8, 2 Reiwa</td>
</tr>
</tbody>
</table>
<p>Some countries or cultures use non-Gregorian calendars for official, religious, or cultural purposes. One such calendar is represented by the extension sequence <code>-u-ca-islamic</code>. This particular calendar is based on lunar months and thus <code>2020-10-08</code> (Gregorian) corresponds to the 21st day of the 2nd month (called "Safar" when rendered into English). This calendar also uses a different year numbering scheme.</p>
<table>
<thead>
<tr>
<th>Value</th>
<th>Locale</th>
<th>Formatted Value</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan=3><code>2020-10-08T12:00:00Z</code></td>
<td>en</td>
<td>October 8, 2020</td>
</tr>
<tr>
<td>ar-u-ca-islamic</td>
<td dir="rtl" lang="ar-u-ca-islamic">٢١ صفر ١٤٤٢ هـ</td>
</tr>
<tr>
<td>en-u-ca-islamic</td>
<td lang="en-u-ca-islamic">Safar 21, 1442 AH</td>
</tr>
</tbody>
</table>
</aside>
<p class="definition"><dfn data-lt="non-linguistic field|field|non-linguistic fields|fields">Non-linguistic Field</dfn>. Any element of a data structure not intended for the storage or interchange of natural language textual data. This includes non-string data types, such as booleans, numbers, dates, and so forth. It also includes strings, such as program or protocol internal identifiers. This document uses the term <em>field</em> as a short hand for this concept.</p>
<p>Specifications for document formats or protocols usually define the exchange, processing, or display of various data values or data structures. The Web primarily relies on text files for the serialization and exchange of data: even raw bytes are usually transmitted using a string serialization such as base64. Thus <a>non-linguistic fields</a> on the Web are also normally made up of strings. The important distinction here is that <a>non-linguistic fields</a> are generally interpreted by or meant for consumption by the underlying application, rather than by a user.</p>
<p class="definition"><dfn data-lt="locale-neutral|locale neutral">Locale-neutral</dfn>. A <a>non-linguistic field</a> is said to be <a>locale-neutral</a> when it is stored or exchanged in a format that is not specifically appropriate for any given language, locale, or culture and which can be interpreted unambiguously for presentation in a <a>locale aware</a> way.</p>
<p>Many specifications use a serialization scheme, such as those provided by [[XMLSCHEMA11-2]] or [[JSON-LD]], to provide a <a>locale neutral</a> encoding of <a>non-linguistic fields</a> in document formats or protocols.</p>
<p>A <a>locale-neutral</a> representation might itself be linked to a specific cultural preference, but such linkages should be minimized. For example, many of the ISO8601 date/time value serializations are linked to the Gregorian calendar, but the format, field order, separators, and visual appearance are not specifically suitable to any locale (they are intended to be machine readable) and, as shown in the <a href="#example-locale-variation">example</a> above, the value can be converted for display into any calendar or locale.</p>
<aside class="example">
<p>Suppose your application needs to collect and store some value in a <a>field</a>. The system can use a <a>locale-neutral</a> format for storing and exchanging the value. For instance, schema languages such as [[XMLSCHEMA11-2]] or data formats such as [[JSON]] provide ready made types for this purpose. When the user is entering or editing the value, however, the user expects to interact with a more human friendly format. For example, if your application needed to input a user's birth date and the value they were trying to enter were <code>2020-01-31</code>:</p>
<p>The input field might look like this in HTML:</p>
<pre><input type="date" id="birthDate" value="2020-01-31" lang=… ></em></pre>
<p>The <code>lang</code> attribute here should control the display and formatting of the value, including the expected input pattern. <em>Note that this guidance is at odds with what browsers do at the time this document was published.</em></p>
<table>
<thead>
<th style="width:15%">Value</th>
<th style="width:22%">Language Tag</th>
<th style="width:23%">Display</th>
<th style="width:30%">Input Format Pattern</th>
</thead>
<tbody>
</tbody>
<tr>
<td rowspan=4><code>2020-01-31</code></td>
<td>en-GB</td>
<td>31/01/2020</td>
<td>dd/MM/yyyy</td>
</tr>
<tr>
<td>en-US</td>
<td>01/31/2020</td>
<td>MM/dd/yyyy</td>
</tr>
<tr>
<td>fr-FR</td>
<td>31-01-2020</td>
<td>dd-MM-yyyy</td>
</tr>
<tr>
<td>zh-Hans-CN</td>
<td>2020-01-31</td>
<td>yyyy-MM-dd</td>
</tr>
</table>
</aside>
<p class="definition"><dfn>Language negotiation</dfn>. The process of matching a user's <a>international preferences</a> to available locales, localized resources, content, or processing.</p>
<p class="definition"><dfn data-lt="locale fallback|fallback">Locale fallback</dfn>. The process of searching for translated content, locale data, or other resources by "falling back" from more-specific resources to more-general ones following a deterministic pattern.</p>
<p>A user's preferences are usually expressed as a <a>locale</a> or prioritized list of locales. When negotiating the language, the system follows some sort of algorithm to get the best matching content or functionality from the available resources. In many cases the language negotiation algorithm uses <a>locale fallback</a>.</p>
<p class="advisement" id="ltli-format-like-doc-language"><a class="self" href="#ltli-format-like-doc-language">​</a>Specifications that present <a>fields</a> in a document format SHOULD require that data is formatted according to the language of the surrounding content.</p>
<p>When <a>non-linguistic fields</a> are presented to the user as part of a document or application, the document or application forms the "context" where the data is being viewed. Content authors or application developers need a way to make the <a>fields</a> seem like a natural part of the experience and need a way to control the presentation. This is indicated by the <a>language tag</a> of the context in which the content appears: usually <a>enabled</a> implementations interpret the tag as a <a>locale</a> in order to accomplish this. Using the runtime locale or localization of the user-agent as the locale for presenting <a>non-linguistic fields</a> should only be a last resort.</p>
<p class="advisement" id="ltli-input-like-doc-language"><a class="self" href="#ltli-input-like-doc-language">​</a>Specifications that present forms or receive input of <a>non-linguistic fields</a> in a document format or application SHOULD require that the values be presented to the user <a>localized</a> in the format of the language of the content or markup immediately surrounding the value.</p>
<p class="advisement" id="ltli-input-locale-neutrality"><a class="self" href="#ltli-input-locale-neutrality">​</a>Specifications that present, exchange, or allow the input of <a>non-linguistic fields</a> MUST use a <a>locale-neutral</a> format for storage and interchange.</p>
<p class="advisement" id="ltli-impl-input-like-doc-language"><a class="self" href="#ltli-impl-input-like-doc-language">​</a>Implementations SHOULD present <a>non-linguistic fields</a> in a document format or application using a format consistent with the language of the surrounding content and are encouraged to provide controls which are <a>localized</a> to the same <a>locale</a> for input or editing.</p>
<p>Users expect form fields and other data inputs to use a presentation for <a>non-linguistic fields</a> that is consistent with the document or application where the values appear. User's usually expect their input to match the document's context rather than the user-agent or operating environments and input validation, prompting, or controls are also thus consistent with the content. This gives content authors the ability to create a wholly localized customer experience and is generally in keeping with customer expectations.</p>
</section>
<section id="metadata-versus-text-processing">
<h3>Choosing between metadata and text-processing language</h3>
<p>There are two common uses for language tags in document formats, protocols, and specifications. In some cases, language tags are used to provide metadata about intended audience for collections of content, such as at the record or document level. In other cases, language tags are used to identify the language of specific bits of text in order to facilitate text processing.</p>
<section id="intended-audience">
<h5>The language of the intended audience</h5>
<p>Metadata that describes the language of the intended audience is about <strong>the document as a whole</strong>. Such metadata may be used for searching, serving the right language version, classification, etc. Where there are language changes in a document, information about the language of the intended audience is not specific enough to support text-processing, that is to say, in a way that would be needed for the application of text-to-speech, styling, automatic font assignment, etc.</p>
<p>The language of the intended audience does not include every language used in a document. Many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.</p>
<p>On the other hand, it is also possible to imagine a situation where a document contains the same or parallel content in more than one language. For example, a Web page may welcome Canadian readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. This situation is not as common on the Web as in printed material since it is easy to link to separate pages on the Web for different audiences, but it does occur where there are multilingual communities. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another.</p>
<p>There are also pages where the navigational information, including the page title, is in one language but the real content of the page is in another. While this is not necessarily good practice, it doesn't change the fact that the language of the intended audience is usually that of the content, regardless of the language at the top of the document source.</p>
<p>Metadata about the language of the intended audience is usually best declared outside the document, such as in the HTTP <span class="kw" translate="no">Content-Language</span> header.</p>
</section>
<section>
<h5>The text-processing language</h5>
<p>When specifying the text-processing language you are declaring the language in which a specific range of text is actually written, so that user agents or applications that manipulate the text (such as voice browsers, spell checkers, or style processors) can process the text in a language-appropriate manner. So we are, by necessity, talking about associating a single language with a specific range of text.</p>
<p>This specificity distinguishes the declaration of the language for text-processing from that of the language of the intended audience.</p>
<p>The language for text-processing is usually best declared using attributes on elements, including setting a document-wide default.</p>
<aside class="example">
<p>For example the <span class="kw" translate="no">html</span> element in [[HTML]] contains all of the content of the document, so setting the <span class="kw" translate="no">lang</span> attribute sets the text-processing language for the whole document except where locally overridden. Enclosed elements inherit the declared value, but you can, of course, override an initial declaration by specifying a different language on embedded elements where the language changes, eg. a French phrase in an English paragraph:</p>
<pre><html lang="en" dir="ltr">
<head>
<title>This example is in English</title>
...
</head>
<body>
<h1>This also inherits from <code>html</code></h1>
<p>The following example is in French:
<!-- Text-processing in French inside the 'span' tag -->
<span lang="fr">cet exemple est en français</span>
<!-- Text-processing reverts to English here -->
</p>
</body>
</html>
</pre>
</aside>
<aside class="note">
<p>The text-processing language can also be used as the locale identifier, such as when the user-agent must format data or when setting the <span class="kw" translate="no">Intl.Locale</span> for a JavaScript formatting function.</p>
</aside>
</section>
</section>
<section id="further-reading">
<h2>Further Reading</h2>
<p>The Internationalization WG has additional best practices and other references, such as articles on language tag choice. These include: </p>
<ul>
<li><a href="https://www.w3.org/TR/string-meta">Strings on the Web: Language and Direction Metadata</a> [[STRING-META]]</li>
<li><a href="https://www.w3.org/International/questions/qa-choosing-language-tags">Choosing a Language Tag</a></li>
<li><a href="https://www.w3.org/International/articles/language-tags/">Language Tags in HTML and XML</a></li>
<li><a href="https://www.w3.org/International/questions/qa-no-language">Tagging text with no language</a></li>
<li><a href="https://www.w3.org/International/questions/qa-i18n">Localization vs. Internationalization</a></li>
</ul>
</section>
<section class="appendix" id="revisionlog">
<h3>Revision Log</h3>
<p>Changes to this document following the <a href="http://www.w3.org/TR/2015/WD-ltli-20150423/">Working Draft</a> of 2015-04-23 are available via the <a href="https://github.com/w3c/ltli/commits/gh-pages">github commit log</a>. This document was significantly restructured since that revision. Notably:</p>
<ul>
<li>As this document is targetting Note status, removed mention of "normative" and "informative" and converted all references to standard ones.</li>
<li>Restructured document to be more accessible. Some sections were reordered or removed.</li>
<li>Removed the WS-I18N appendix.</li>
</ul>
<p>The following changes were made since the revision of 2006-06-20.</p>
<ul>
<li>Converted the format to ReSpec.</li>
<li>All references to RFC3066bis were updated to BCP 47 or to RFC5646 or
RFC 4647 as appropriate.</li>
<li>References to HTML were changed to point to HTML5.</li>
<li>Imported and rewrote the text formerly containing in
[[WS-I18N-SCENARIOS]] defining internationalization, locale, and other
important terms.</li>
<li>Modified and reorganized the other sections of this document. Moved
the Web services materials to an appendix.</li>
<li>Modified the SOTD to reflect our use of github.</li>
</ul>
<p>The following log records changes that have been made to this document
since the <a href="http://www.w3.org/TR/2006/WD-ltli-20060419/">publication
in April 2006</a>.</p>
<ul>
<li>
<p>The informative introductory section has been rewritten thoroughly,
including the description of the scope of the document, of
application scenarios and of the separation of locale versus natural
language.</p>
</li>
<li>
<p>Terms which rely on [[BCP47]] are not <em>defined</em> anymore,
but only <em>reference</em> these documents. In addition, examples
for these terms were created.</p>
</li>
<li>
<p>The requirements for language and locale values have been taken out
of the conformance section and are now placed in the body of the document.</p>
</li>
<li>
<p>A revision log has been created.</p>
</li>
</ul>
</section>
<section id="acknowledgements">
<h2>Acknowledgements</h2>
<p>The Internationalization Working Group would like to acknowledge the
following contributors to this specification:</p>
<ul>
<li>Felix Sasaki, for editing this document and publishing the 2006 WD</li>
<li>Mark Davis, Y.Umaoka, and others for contributions to the BCP47
extensions</li>
<li>WS-I18N contributors: this document was created to satisfy
requirement R005 in [[WS-I18N-REQ]].</li>
</ul>
</section>
</body>
</html>