-
Notifications
You must be signed in to change notification settings - Fork 48
/
rfc3987.txt
2579 lines (1781 loc) · 109 KB
/
rfc3987.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Network Working Group M. Duerst
Request for Comments: 3987 W3C
Category: Standards Track M. Suignard
Microsoft Corporation
January 2005
Internationalized Resource Identifiers (IRIs)
Status of This Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2005).
Abstract
This document defines a new protocol element, the Internationalized
Resource Identifier (IRI), as a complement to the Uniform Resource
Identifier (URI). An IRI is a sequence of characters from the
Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to
URIs is defined, which means that IRIs can be used instead of URIs,
where appropriate, to identify resources.
The approach of defining a new protocol element was chosen instead of
extending or changing the definition of URIs. This was done in order
to allow a clear distinction and to avoid incompatibilities with
existing software. Guidelines are provided for the use and
deployment of IRIs in various protocols, formats, and software
components that currently deal with URIs.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Overview and Motivation . . . . . . . . . . . . . . . . 3
1.2. Applicability . . . . . . . . . . . . . . . . . . . . . 3
1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . 4
1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 5
2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . 6
2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 7
Duerst & Suignard Standards Track [Page 1]
RFC 3987 Internationalized Resource Identifiers January 2005
3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 10
3.1. Mapping of IRIs to URIs . . . . . . . . . . . . . . . . 10
3.2. Converting URIs to IRIs . . . . . . . . . . . . . . . . 14
3.2.1. Examples . . . . . . . . . . . . . . . . . . . . 15
4. Bidirectional IRIs for Right-to-Left Languages. . . . . . . . 16
4.1. Logical Storage and Visual Presentation . . . . . . . . 17
4.2. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 18
4.3. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 19
4.4. Examples . . . . . . . . . . . . . . . . . . . . . . . . 19
5. Normalization and Comparison . . . . . . . . . . . . . . . . . 21
5.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . 22
5.2. Preparation for Comparison . . . . . . . . . . . . . . . 22
5.3. Comparison Ladder . . . . . . . . . . . . . . . . . . . 23
5.3.1. Simple String Comparison . . . . . . . . . . . . 23
5.3.2. Syntax-Based Normalization . . . . . . . . . . . 24
5.3.3. Scheme-Based Normalization . . . . . . . . . . . 27
5.3.4. Protocol-Based Normalization . . . . . . . . . . 28
6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.1. Limitations on UCS Characters Allowed in IRIs . . . . . 29
6.2. Software Interfaces and Protocols . . . . . . . . . . . 29
6.3. Format of URIs and IRIs in Documents and Protocols . . . 30
6.4. Use of UTF-8 for Encoding Original Characters .. . . . . 30
6.5. Relative IRI References . . . . . . . . . . . . . . . . 32
7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 32
7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . 32
7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . 33
7.3. URI/IRI Transfer between Applications . . . . . . . . . 33
7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 34
7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . 34
7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 35
7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . 36
7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 36
8. Security Considerations . . . . . . . . . . . . . . . . . . . 37
9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 39
10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.1. Normative References . . . . . . . . . . . . . . . . . . 40
10.2. Informative References . . . . . . . . . . . . . . . . . 41
A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 44
A.1. New Scheme(s) . . . . . . . . . . . . . . . . . . . . . 44
A.2. Character Encodings Other Than UTF-8 . . . . . . . . . . 44
A.3. New Encoding Convention . . . . . . . . . . . . . . . . 44
A.4. Indicating Character Encodings in the URI/IRI . . . . . 45
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45
Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 46
Duerst & Suignard Standards Track [Page 2]
RFC 3987 Internationalized Resource Identifiers January 2005
1. Introduction
1.1. Overview and Motivation
A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters.
The characters in URIs are frequently used for representing words of
natural languages. This usage has many advantages: Such URIs are
easier to memorize, easier to interpret, easier to transcribe, easier
to create, and easier to guess. For most languages other than
English, however, the natural script uses characters other than A -
Z. For many people, handling Latin characters is as difficult as
handling the characters of other scripts is for those who use only
the Latin alphabet. Many languages with non-Latin scripts are
transcribed with Latin letters. These transcriptions are now often
used in URIs, but they introduce additional ambiguities.
The infrastructure for the appropriate handling of characters from
local scripts is now widely deployed in local versions of operating
system and application software. Software that can handle a wide
variety of scripts and languages at the same time is increasingly
common. Also, increasing numbers of protocols and formats can carry
a wide range of characters.
This document defines a new protocol element called Internationalized
Resource Identifier (IRI) by extending the syntax of URIs to a much
wider repertoire of characters. It also defines "internationalized"
versions corresponding to other constructs from [RFC3986], such as
URI references. The syntax of IRIs is defined in section 2, and the
relationship between IRIs and URIs in section 3.
Using characters outside of A - Z in IRIs brings some difficulties.
Section 4 discusses the special case of bidirectional IRIs, section 5
various forms of equivalence between IRIs, and section 6 the use of
IRIs in different situations. Section 7 gives additional informative
guidelines, and section 8 security considerations.
1.2. Applicability
IRIs are designed to be compatible with recommendations for new URI
schemes [RFC2718]. The compatibility is provided by specifying a
well-defined and deterministic mapping from the IRI character
sequence to the functionally equivalent URI character sequence.
Practical use of IRIs (or IRI references) in place of URIs (or URI
references) depends on the following conditions being met:
Duerst & Suignard Standards Track [Page 3]
RFC 3987 Internationalized Resource Identifiers January 2005
a. A protocol or format element should be explicitly designated to
be able to carry IRIs. The intent is not to introduce IRIs into
contexts that are not defined to accept them. For example, XML
schema [XMLSchema] has an explicit type "anyURI" that includes
IRIs and IRI references. Therefore, IRIs and IRI references can
be in attributes and elements of type "anyURI". On the other
hand, in the HTTP protocol [RFC2616], the Request URI is defined
as a URI, which means that direct use of IRIs is not allowed in
HTTP requests.
b. The protocol or format carrying the IRIs should have a mechanism
to represent the wide range of characters used in IRIs, either
natively or by some protocol- or format-specific escaping
mechanism (for example, numeric character references in [XML1]).
c. The URI corresponding to the IRI in question has to encode
original characters into octets using UTF-8. For new URI
schemes, this is recommended in [RFC2718]. It can apply to a
whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384],
or the URN syntax [RFC2141]). It can apply to a specific part of
a URI, such as the fragment identifier (e.g., [XPointer]). It
can apply to a specific URI or part(s) thereof. For details,
please see section 6.4.
1.3. Definitions
The following definitions are used in this document; they follow the
terms in [RFC2130], [RFC2277], and [ISO10646].
character: A member of a set of elements used for the organization,
control, or representation of data. For example, "LATIN CAPITAL
LETTER A" names a character.
octet: An ordered sequence of eight bits considered as a unit.
character repertoire: A set of characters (in the mathematical
sense).
sequence of characters: A sequence of characters (one after another).
sequence of octets: A sequence of octets (one after another).
character encoding: A method of representing a sequence of characters
as a sequence of octets (maybe with variants). Also, a method of
(unambiguously) converting a sequence of octets into a sequence of
characters.
Duerst & Suignard Standards Track [Page 4]
RFC 3987 Internationalized Resource Identifiers January 2005
charset: The name of a parameter or attribute used to identify a
character encoding.
UCS: Universal Character Set. The coded character set defined by
ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].
IRI reference: Denotes the common usage of an Internationalized
Resource Identifier. An IRI reference may be absolute or
relative. However, the "IRI" that results from such a reference
only includes absolute IRIs; any relative IRI references are
resolved to their absolute form. Note that in [RFC2396] URIs did
not include fragment identifiers, but in [RFC3986] fragment
identifiers are part of URIs.
running text: Human text (paragraphs, sentences, phrases) with syntax
according to orthographic conventions of a natural language, as
opposed to syntax defined for ease of processing by machines
(e.g., markup, programming languages).
protocol element: Any portion of a message that affects processing of
that message by the protocol in question.
presentation element: A presentation form corresponding to a protocol
element; for example, using a wider range of characters.
create (a URI or IRI): With respect to URIs and IRIs, the term is
used for the initial creation. This may be the initial creation
of a resource with a certain identifier, or the initial exposition
of a resource under a particular identifier.
generate (a URI or IRI): With respect to URIs and IRIs, the term is
used when the IRI is generated by derivation from other
information.
1.4. Notation
RFCs and Internet Drafts currently do not allow any characters
outside the US-ASCII repertoire. Therefore, this document uses
various special notations to denote such characters in examples.
In text, characters outside US-ASCII are sometimes referenced by
using a prefix of 'U+', followed by four to six hexadecimal digits.
To represent characters outside US-ASCII in examples, this document
uses two notations: 'XML Notation' and 'Bidi Notation'.
Duerst & Suignard Standards Track [Page 5]
RFC 3987 Internationalized Resource Identifiers January 2005
XML Notation uses a leading '&#x', a trailing ';', and the
hexadecimal number of the character in the UCS in between. For
example, я stands for CYRILLIC CAPITAL LETTER YA. In this
notation, an actual '&' is denoted by '&'.
Bidi Notation is used for bidirectional examples: Lowercase letters
stand for Latin letters or other letters that are written left to
right, whereas uppercase letters represent Arabic or Hebrew letters
that are written right to left.
To denote actual octets in examples (as opposed to percent-encoded
octets), the two hex digits denoting the octet are enclosed in "<"
and ">". For example, the octet often denoted as 0xc9 is denoted
here as <c9>.
In this document, the key words "MUST", "MUST NOT", "REQUIRED",
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
and "OPTIONAL" are to be interpreted as described in [RFC2119].
2. IRI Syntax
This section defines the syntax of Internationalized Resource
Identifiers (IRIs).
As with URIs, an IRI is defined as a sequence of characters, not as a
sequence of octets. This definition accommodates the fact that IRIs
may be written on paper or read over the radio as well as stored or
transmitted digitally. The same IRI may be represented as different
sequences of octets in different protocols or documents if these
protocols or documents use different character encodings (and/or
transfer encodings). Using the same character encoding as the
containing protocol or document ensures that the characters in the
IRI can be handled (e.g., searched, converted, displayed) in the same
way as the rest of the protocol or document.
2.1. Summary of IRI Syntax
IRIs are defined similarly to URIs in [RFC3986], but the class of
unreserved characters is extended by adding the characters of the UCS
(Universal Character Set, [ISO10646]) beyond U+007F, subject to the
limitations given in the syntax rules below and in section 6.1.
Otherwise, the syntax and use of components and reserved characters
is the same as that in [RFC3986]. All the operations defined in
[RFC3986], such as the resolution of relative references, can be
applied to IRIs by IRI-processing software in exactly the same way as
they are for URIs by URI-processing software.
Duerst & Suignard Standards Track [Page 6]
RFC 3987 Internationalized Resource Identifiers January 2005
Characters outside the US-ASCII repertoire are not reserved and
therefore MUST NOT be used for syntactical purposes, such as to
delimit components in newly defined schemes. For example, U+00A2,
CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
the 'iunreserved' category. This is similar to the fact that it is
not possible to use '-' as a delimiter in URIs, because it is in the
'unreserved' category.
2.2. ABNF for IRI References and IRIs
Although it might be possible to define IRI references and IRIs
merely by their transformation to URI references and URIs, they can
also be accepted and processed directly. Therefore, an ABNF
definition for IRI references (which are the most general concept and
the start of the grammar) and IRIs is given here. The syntax of this
ABNF is described in [RFC2234]. Character numbers are taken from the
UCS, without implying any actual binary encoding. Terminals in the
ABNF are characters, not bytes.
The following grammar closely follows the URI grammar in [RFC3986],
except that the range of unreserved characters is expanded to include
UCS characters, with the restriction that private UCS characters can
occur only in query parts. The grammar is split into two parts:
Rules that differ from [RFC3986] because of the above-mentioned
expansion, and rules that are the same as those in [RFC3986]. For
rules that are different than those in [RFC3986], the names of the
non-terminals have been changed as follows. If the non-terminal
contains 'URI', this has been changed to 'IRI'. Otherwise, an 'i'
has been prefixed.
The following rules are different from those in [RFC3986]:
IRI = scheme ":" ihier-part [ "?" iquery ]
[ "#" ifragment ]
ihier-part = "//" iauthority ipath-abempty
/ ipath-absolute
/ ipath-rootless
/ ipath-empty
IRI-reference = IRI / irelative-ref
absolute-IRI = scheme ":" ihier-part [ "?" iquery ]
irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ]
irelative-part = "//" iauthority ipath-abempty
/ ipath-absolute
Duerst & Suignard Standards Track [Page 7]
RFC 3987 Internationalized Resource Identifiers January 2005
/ ipath-noscheme
/ ipath-empty
iauthority = [ iuserinfo "@" ] ihost [ ":" port ]
iuserinfo = *( iunreserved / pct-encoded / sub-delims / ":" )
ihost = IP-literal / IPv4address / ireg-name
ireg-name = *( iunreserved / pct-encoded / sub-delims )
ipath = ipath-abempty ; begins with "/" or is empty
/ ipath-absolute ; begins with "/" but not "//"
/ ipath-noscheme ; begins with a non-colon segment
/ ipath-rootless ; begins with a segment
/ ipath-empty ; zero characters
ipath-abempty = *( "/" isegment )
ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ]
ipath-noscheme = isegment-nz-nc *( "/" isegment )
ipath-rootless = isegment-nz *( "/" isegment )
ipath-empty = 0<ipchar>
isegment = *ipchar
isegment-nz = 1*ipchar
isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims
/ "@" )
; non-zero-length segment without any colon ":"
ipchar = iunreserved / pct-encoded / sub-delims / ":"
/ "@"
iquery = *( ipchar / iprivate / "/" / "?" )
ifragment = *( ipchar / "/" / "?" )
iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD
iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD
Some productions are ambiguous. The "first-match-wins" (a.k.a.
"greedy") algorithm applies. For details, see [RFC3986].
Duerst & Suignard Standards Track [Page 8]
RFC 3987 Internationalized Resource Identifiers January 2005
The following rules are the same as those in [RFC3986]:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
port = *DIGIT
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"
h16 = 1*4HEXDIG
ls32 = ( h16 ":" h16 ) / IPv4address
IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
This syntax does not support IPv6 scoped addressing zone identifiers.
Duerst & Suignard Standards Track [Page 9]
RFC 3987 Internationalized Resource Identifiers January 2005
3. Relationship between IRIs and URIs
IRIs are meant to replace URIs in identifying resources for
protocols, formats, and software components that use a UCS-based
character repertoire. These protocols and components may never need
to use URIs directly, especially when the resource identifier is used
simply for identification purposes. However, when the resource
identifier is used for resource retrieval, it is in many cases
necessary to determine the associated URI, because currently most
retrieval mechanisms are only defined for URIs. In this case, IRIs
can serve as presentation elements for URI protocol elements. An
example would be an address bar in a Web user agent. (Additional
rationale is given in section 3.1.)
3.1. Mapping of IRIs to URIs
This section defines how to map an IRI to a URI. Everything in this
section also applies to IRI references and URI references, as well as
to components thereof (for example, fragment identifiers).
This mapping has two purposes:
Syntaxical. Many URI schemes and components define additional
syntactical restrictions not captured in section 2.2.
Scheme-specific restrictions are applied to IRIs by converting
IRIs to URIs and checking the URIs against the scheme-specific
restrictions.
Interpretational. URIs identify resources in various ways. IRIs also
identify resources. When the IRI is used solely for
identification purposes, it is not necessary to map the IRI to a
URI (see section 5). However, when an IRI is used for resource
retrieval, the resource that the IRI locates is the same as the
one located by the URI obtained after converting the IRI according
to the procedure defined here. This means that there is no need
to define resolution separately on the IRI level.
Applications MUST map IRIs to URIs by using the following two steps.
Step 1. Generate a UCS character sequence from the original IRI
format. This step has the following three variants,
depending on the form of the input:
a. If the IRI is written on paper, read aloud, or otherwise
represented as a sequence of characters independent of
any character encoding, represent the IRI as a sequence
of characters from the UCS normalized according to
Normalization Form C (NFC, [UTR15]).
Duerst & Suignard Standards Track [Page 10]
RFC 3987 Internationalized Resource Identifiers January 2005
b. If the IRI is in some digital representation (e.g., an
octet stream) in some known non-Unicode character
encoding, convert the IRI to a sequence of characters
from the UCS normalized according to NFC.
c. If the IRI is in a Unicode-based character encoding (for
example, UTF-8 or UTF-16), do not normalize (see section
5.3.2.2 for details). Apply step 2 directly to the
encoded Unicode character sequence.
Step 2. For each character in 'ucschar' or 'iprivate', apply steps
2.1 through 2.3 below.
2.1. Convert the character to a sequence of one or more octets
using UTF-8 [RFC3629].
2.2. Convert each octet to %HH, where HH is the hexadecimal
notation of the octet value. Note that this is identical
to the percent-encoding mechanism in section 2.1 of
[RFC3986]. To reduce variability, the hexadecimal notation
SHOULD use uppercase letters.
2.3. Replace the original character with the resulting character
sequence (i.e., a sequence of %HH triplets).
The above mapping from IRIs to URIs produces URIs fully conforming to
[RFC3986]. The mapping is also an identity transformation for URIs
and is idempotent; applying the mapping a second time will not
change anything. Every URI is by definition an IRI.
Systems accepting IRIs MAY convert the ireg-name component of an IRI
as follows (before step 2 above) for schemes known to use domain
names in ireg-name, if the scheme definition does not allow
percent-encoding for ireg-name:
Replace the ireg-name part of the IRI by the part converted using the
ToASCII operation specified in section 4.1 of [RFC3490] on each
dot-separated label, and by using U+002E (FULL STOP) as a label
separator, with the flag UseSTD3ASCIIRules set to TRUE, and with the
flag AllowUnassigned set to FALSE for creating IRIs and set to TRUE
otherwise.
Duerst & Suignard Standards Track [Page 11]
RFC 3987 Internationalized Resource Identifiers January 2005
The ToASCII operation may fail, but this would mean that the IRI
cannot be resolved. This conversion SHOULD be used when the goal is
to maximize interoperability with legacy URI resolvers. For example,
the IRI
"http://résumé.example.org"
may be converted to
"http://xn--rsum-bpad.example.org"
instead of
"http://r%C3%A9sum%C3%A9.example.org".
An IRI with a scheme that is known to use domain names in ireg-name,
but where the scheme definition does not allow percent-encoding for
ireg-name, meets scheme-specific restrictions if either the
straightforward conversion or the conversion using the ToASCII
operation on ireg-name result in an URI that meets the scheme-
specific restrictions.
Such an IRI resolves to the URI obtained after converting the IRI and
uses the ToASCII operation on ireg-name. Implementations do not have
to do this conversion as long as they produce the same result.
Note: The difference between variants b and c in step 1 (using
normalization with NFC, versus not using any normalization)
accounts for the fact that in many non-Unicode character
encodings, some text cannot be represented directly. For example,
the word "Vietnam" is natively written "Việt Nam"
(containing a LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW)
in NFC, but a direct transcoding from the windows-1258 character
encoding leads to "Việt Nam" (containing a LATIN SMALL
LETTER E WITH CIRCUMFLEX followed by a COMBINING DOT BELOW).
Direct transcoding of other 8-bit encodings of Vietnamese may lead
to other representations.
Note: The uniform treatment of the whole IRI in step 2 is important
to make processing independent of URI scheme. See [Gettys] for an
in-depth discussion.
Note: In practice, whether the general mapping (steps 1 and 2) or the
ToASCII operation of [RFC3490] is used for ireg-name will not be
noticed if mapping from IRI to URI and resolution is tightly
integrated (e.g., carried out in the same user agent). But
Duerst & Suignard Standards Track [Page 12]
RFC 3987 Internationalized Resource Identifiers January 2005
conversion using [RFC3490] may be able to better deal with
backwards compatibility issues in case mapping and resolution are
separated, as in the case of using an HTTP proxy.
Note: Internationalized Domain Names may be contained in parts of an
IRI other than the ireg-name part. It is the responsibility of
scheme-specific implementations (if the Internationalized Domain
Name is part of the scheme syntax) or of server-side
implementations (if the Internationalized Domain Name is part of
'iquery') to apply the necessary conversions at the appropriate
point. Example: Trying to validate the Web page at
http://résumé.example.org would lead to an IRI of
http://validator.w3.org/check?uri=http%3A%2F%2Frésumé.
example.org, which would convert to a URI of
http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
example.org. The server side implementation would be responsible
for making the necessary conversions to be able to retrieve the
Web page.
Systems accepting IRIs MAY also deal with the printable characters in
US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
"{", "}", "|", "\", "^", and "`", in step 2 above. If these
characters are found but are not converted, then the conversion
SHOULD fail. Please note that the number sign ("#"), the percent
sign ("%"), and the square bracket characters ("[", "]") are not part
of the above list and MUST NOT be converted. Protocols and formats
that have used earlier definitions of IRIs including these characters
MAY require percent-encoding of these characters as a preprocessing
step to extract the actual IRI from a given field. This
preprocessing MAY also be used by applications allowing the user to
enter an IRI.
Note: In this process (in step 2.3), characters allowed in URI
references and existing percent-encoded sequences are not encoded
further. (This mapping is similar to, but different from, the
encoding applied when arbitrary content is included in some part
of a URI.) For example, an IRI of
"http://www.example.org/red%09rosé#red" (in XML notation) is
converted to
"http://www.example.org/red%09ros%C3%A9#red", not to something
like
"http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".
Note: Some older software transcoding to UTF-8 may produce illegal
output for some input, in particular for characters outside the
BMP (Basic Multilingual Plane). As an example, for the IRI with
non-BMP characters (in XML Notation):
"http://example.com/𐌀𐌁𐌂";
Duerst & Suignard Standards Track [Page 13]
RFC 3987 Internationalized Resource Identifiers January 2005
which contains the first three letters of the Old Italic alphabet,
the correct conversion to a URI is
"http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82"
3.2. Converting URIs to IRIs
In some situations, converting a URI into an equivalent IRI may be
desirable. This section gives a procedure for this conversion. The
conversion described in this section will always result in an IRI
that maps back to the URI used as an input for the conversion (except
for potential case differences in percent-encoding and for potential
percent-encoded unreserved characters). However, the IRI resulting
from this conversion may not be exactly the same as the original IRI
(if there ever was one).
URI-to-IRI conversion removes percent-encodings, but not all
percent-encodings can be eliminated. There are several reasons for
this:
1. Some percent-encodings are necessary to distinguish percent-
encoded and unencoded uses of reserved characters.
2. Some percent-encodings cannot be interpreted as sequences of
UTF-8 octets.
(Note: The octet patterns of UTF-8 are highly regular.
Therefore, there is a very high probability, but no guarantee,
that percent-encodings that can be interpreted as sequences of
UTF-8 octets actually originated from UTF-8. For a detailed
discussion, see [Duerst97].)
3. The conversion may result in a character that is not appropriate
in an IRI. See sections 2.2, 4.1, and 6.1 for further details.
Conversion from a URI to an IRI is done by using the following steps
(or any other algorithm that produces the same result):
1. Represent the URI as a sequence of octets in US-ASCII.
2. Convert all percent-encodings ("%" followed by two hexadecimal
digits) to the corresponding octets, except those corresponding
to "%", characters in "reserved", and characters in US-ASCII not
allowed in URIs.
3. Re-percent-encode any octet produced in step 2 that is not part
of a strictly legal UTF-8 octet sequence.
Duerst & Suignard Standards Track [Page 14]
RFC 3987 Internationalized Resource Identifiers January 2005
4. Re-percent-encode all octets produced in step 3 that in UTF-8
represent characters that are not appropriate according to
sections 2.2, 4.1, and 6.1.
5. Interpret the resulting octet sequence as a sequence of characters
encoded in UTF-8.
This procedure will convert as many percent-encoded characters as
possible to characters in an IRI. Because there are some choices
when step 4 is applied (see section 6.1), results may vary.
Conversions from URIs to IRIs MUST NOT use any character encoding
other than UTF-8 in steps 3 and 4, even if it might be possible to
guess from the context that another character encoding than UTF-8 was
used in the URI. For example, the URI
"http://www.example.org/r%E9sum%E9.html" might with some guessing be
interpreted to contain two e-acute characters encoded as iso-8859-1.
It must not be converted to an IRI containing these e-acute
characters. Otherwise, in the future the IRI will be mapped to
"http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
URI from "http://www.example.org/r%E9sum%E9.html".
3.2.1. Examples
This section shows various examples of converting URIs to IRIs. Each
example shows the result after each of the steps 1 through 5 is
applied. XML Notation is used for the final result. Octets are
denoted by "<" followed by two hexadecimal digits followed by ">".
The following example contains the sequence "%C3%BC", which is a
strictly legal UTF-8 sequence, and which is converted into the actual
character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
u-umlaut).
1. http://www.example.org/D%C3%BCrst
2. http://www.example.org/D<c3><bc>rst
3. http://www.example.org/D<c3><bc>rst
4. http://www.example.org/D<c3><bc>rst
5. http://www.example.org/Dürst
The following example contains the sequence "%FC", which might
represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the
iso-8859-1 character encoding. (It might represent other characters
in other character encodings. For example, the octet <fc> in
Duerst & Suignard Standards Track [Page 15]
RFC 3987 Internationalized Resource Identifiers January 2005
iso-8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.) Because
<fc> is not part of a strictly legal UTF-8 sequence, it is
re-percent-encoded in step 3.
1. http://www.example.org/D%FCrst
2. http://www.example.org/D<fc>rst
3. http://www.example.org/D%FCrst
4. http://www.example.org/D%FCrst
5. http://www.example.org/D%FCrst
The following example contains "%e2%80%ae", which is the percent-
encoded UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE.
Section 4.1 forbids the direct use of this character in an IRI.
Therefore, the corresponding octets are re-percent-encoded in step 4.
This example shows that the case (upper- or lowercase) of letters
used in percent-encodings may not be preserved. The example also
contains a punycode-encoded domain name label (xn--99zt52a), which is
not converted.
1. http://xn--99zt52a.example.org/%e2%80%ae
2. http://xn--99zt52a.example.org/<e2><80><ae>
3. http://xn--99zt52a.example.org/<e2><80><ae>
4. http://xn--99zt52a.example.org/%E2%80%AE
5. http://xn--99zt52a.example.org/%E2%80%AE
Implementations with scheme-specific knowledge MAY convert
punycode-encoded domain name labels to the corresponding characters
by using the ToUnicode procedure. Thus, for the example above, the
label "xn--99zt52a" may be converted to U+7D0D U+8C46 (Japanese
Natto), leading to the overall IRI of
"http://納豆.example.org/%E2%80%AE".
4. Bidirectional IRIs for Right-to-Left Languages
Some UCS characters, such as those used in the Arabic and Hebrew
scripts, have an inherent right-to-left (rtl) writing direction.
IRIs containing these characters (called bidirectional IRIs or Bidi
IRIs) require additional attention because of the non-trivial
Duerst & Suignard Standards Track [Page 16]
RFC 3987 Internationalized Resource Identifiers January 2005
relation between logical representation (used for digital
representation and for reading/spelling) and visual representation
(used for display/printing).
Because of the complex interaction between the logical
representation, the visual representation, and the syntax of a Bidi
IRI, a balance is needed between various requirements. The main
requirements are
1. user-predictable conversion between visual and logical
representation;
2. the ability to include a wide range of characters in various
parts of the IRI; and
3. minor or no changes or restrictions for implementations.
4.1. Logical Storage and Visual Presentation
When stored or transmitted in digital representation, bidirectional
IRIs MUST be in full logical order and MUST conform to the IRI syntax
rules (which includes the rules relevant to their scheme). This
ensures that bidirectional IRIs can be processed in the same way as
other IRIs.
Bidirectional IRIs MUST be rendered by using the Unicode
Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be
rendered in the same way as they would be if they were in a
left-to-right embedding; i.e., as if they were preceded by U+202A,
LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP
DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can
also be done in a higher-level protocol (e.g., the dir='ltr'
attribute in HTML).
There is no requirement to use the above embedding if the display is
still the same without the embedding. For example, a bidirectional
IRI in a text with left-to-right base directionality (such as used
for English or Cyrillic) that is preceded and followed by whitespace
and strong left-to-right characters does not need an embedding.
Also, a bidirectional relative IRI reference that only contains
strong right-to-left characters and weak characters and that starts
and ends with a strong right-to-left character and appears in a text
with right-to-left base directionality (such as used for Arabic or
Hebrew) and is preceded and followed by whitespace and strong
characters does not need an embedding.
Duerst & Suignard Standards Track [Page 17]
RFC 3987 Internationalized Resource Identifiers January 2005
In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
sufficient to force the correct display behavior. However, the
details of the Unicode Bidirectional algorithm are not always easy to
understand. Implementers are strongly advised to err on the side of
caution and to use embedding in all cases where they are not
completely sure that the display behavior is unaffected without the
embedding.
The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits
higher-level protocols to influence bidirectional rendering. Such
changes by higher-level protocols MUST NOT be used if they change the
rendering of IRIs.
The bidirectional formatting characters that may be used before or
after the IRI to ensure correct display are not themselves part of
the IRI. IRIs MUST NOT contain bidirectional formatting characters
(LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual
rendering of the IRI but do not appear themselves. It would
therefore not be possible to input an IRI with such characters
correctly.
4.2. Bidi IRI Structure
The Unicode Bidirectional Algorithm is designed mainly for running
text. To make sure that it does not affect the rendering of
bidirectional IRIs too much, some restrictions on bidirectional IRIs
are necessary. These restrictions are given in terms of delimiters
(structural characters, mostly punctuation such as "@", ".", ":", and
"/") and components (usually consisting mostly of letters and
digits).
The following syntax rules from section 2.2 correspond to components
for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment,
isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment.
Specifications that define the syntax of any of the above components
MAY divide them further and define smaller parts to be components
according to this document. As an example, the restrictions of
[RFC3490] on bidirectional domain names correspond to treating each
label of a domain name as a component for schemes with ireg-name as a
domain name. Even where the components are not defined formally, it
may be helpful to think about some syntax in terms of components and