title: "Extended Legacy Format (ELF)" subtitle: Schemas date: 8 October 2019 numbersections: true ...

Extended Legacy Format (ELF):
Schemas

{.ednote ...} This is an exploratory draft of the schema mechanism which provides an extensibility and validation framework for FHISO's proposed suite of Extended Legacy Format (ELF) standards. This document is not endorsed by the FHISO membership, and may be updated, replaced or obsoleted by other documents at any time.

Comments on this draft should be directed to the tsc-public@fhiso.org mailing list. {/}

FHISO's Extended Legacy Format (or ELF) is a hierarchical serialisation format and genealogical data model that is fully compatible with GEDCOM, but with the addition of a structured extensibility mechanism. It also clarifies some ambiguities that were present in GEDCOM and documents best current practice.

The GEDCOM file format developed by The Church of Jesus Christ of Latter-day Saints is the de facto standard for the exchange of genealogical data between applications and data providers. Its most recent version is GEDCOM 5.5.1 which was produced in 1999, but despite many technological advances since then, GEDCOM has remained unchanged.

{.note} Strictly, [GEDCOM 5.5] was the last version to be publicly released back in 1995. However a draft dated 2 October 1999 of a proposed [GEDCOM 5.5.1] was made public; it is generally considered to have the status of a standard and has been widely implemented as such.

FHISO are undertaking a program of work to produce a modernised yet backward-compatible reformulation of GEDCOM under the name ELF, the new name having been chosen to avoid confusion with any other updates or extensions to GEDCOM, or any future use of the name by The Church of Jesus Christ of Latter-day Saints. This document is one of five that form the initial suite of ELF standards, known collectively as ELF 1.0.0:

ELF: Primer. This is not a formal standard, but is being released alongside the ELF standards to provide a broad overview of ELF written in a less formal style. It gives particular emphasis to how ELF differs from GEDCOM.
ELF: Serialisation Format. This standard defines a general-purpose serialisation format based on the GEDCOM data format which encodes a dataset as a hierarchical series of lines, and provides low-level facilities such as escaping.
ELF: Schemas. This standard defines flexible extensibility and validation mechanisms on top of the serialisation layer. Although it is an optional component of ELF 1.0.0, future ELF extensions to ELF will be defined using ELF schemas.
ELF: Date, Age and Time Microformats. This standard defines microformats for representing dates, ages and times in arbitrary calendars, together with how they are applied to the Gregorian, Julian, French Republican and Hebrew calendars.
ELF: Data Model. This standard defines a data model based on the lineage-linked GEDCOM form, reformulated to be usable with the ELF serialisation model and schemas. It is not a major update to the GEDCOM data model, but rather a basis for future extension and revision.

Conventions used

Where this standard gives a specific technical meaning to a word or phrase, that word or phrase is formatted in bold text in its initial definition, and in italics when used elsewhere. The key words must, must not, required, shall, shall not, should, should not, recommended, not recommended, may and optional in this standard are to be interpreted as described in [RFC 2119].

An application is conformant with this standard if and only if it obeys all the requirements and prohibitions contained in this document, as indicated by use of the words must, must not, required, shall and shall not, and the relevant parts of its normative references. Standards referencing this standard must not loosen any of the requirements and prohibitions made by this standard, nor place additional requirements or prohibitions on the constructs defined herein.

{.note} Derived standards are not allowed to add or remove requirements or prohibitions on the facilities defined herein so as to preserve interoperability between applications. Data generated by one conformant application must always be acceptable to another conformant application, regardless of what additional standards each may conform to.

This standard depends on FHISO's Basic Concepts for Genealogical Standards standard. To be conformant with this standard, an application must also be conformant with the referenced parts of [Basic Concepts]. Concepts defined in that standard are used here without further definition.

{.note} In particular, precise meaning of string, character, whitespace, whitespace normalisation and term are given in [Basic Concepts].

Certain facilities in this standard are described as deprecated, which is a warning that they are likely to be removed from a future version of this standard. This has no bearing on whether a conformant application must implement the facility: they may be required, recommended or optional as described in this standard.

Indented text in grey or coloured boxes does not form a normative part of this standard, and is labelled as either an example or a note.

{.ednote} Editorial notes, such as this, are used to record outstanding issues, or points where there is not yet consensus; they will be resolved and removed for the final standard. Examples and notes will be retained in the standard.

The grammar given here uses the form of EBNF notation defined in §6 of [XML], except that no significance is attached to the capitalisation of grammar symbols. Conforming applications must not generate data not conforming to the syntax given here, but non-conforming syntax may be accepted and processed by a conforming application in an implementation-defined manner, providing a warning is issued to the user, except where this standard says otherwise.

{.note} In this form of EBNF, whitespace is only permitted where it is explicitly stated in the grammar. It is not automatically permitted between arbitrary tokens in the grammar.

The grammar productions in this standard uses the S and Char productions defined in §2 of [Basic Concepts] to match any non-empty sequence of whitespace characters or any valid character, respectively.

This standard uses the prefix notation, as defined in §4.3 of [Basic Concepts], when discussing specific terms. The following prefix bindings are assumed in this standard:

elf https://terms.fhiso.org/elf/ ex https://example.com/

{.note} Although prefix notation is included in this standard document (see {§prefix}), that is only in the context of serialised data. When used outside of a serialised example, prefix notation is simply a notational convenience to make the standard easier to read.

{.ednote} Review the previous note.

Overview {#overview}

The ELF serialisation format is a structured, line-based text format for encoding data in a hierarchical manner that is both machine-readable and human-readable.

At a logical level, an ELF document is built from structures, the name ELF gives to the basic hierarchical data structures used to represent data. Each structure consists of:

an optional structure identifier, which, if present, is a string used to uniquely identify the structure within the document;
a type identifier, which is a term that encodes the meaning of the structure;
an optional payload, which is either a string or a pointer to another structure; and
a sequence of zero or more child structures known as its substructures.

{.note} GEDCOM includes a means for splitting a logical document into multiple physical documents, sometimes called volumes. This dates to an era when documents were commonly stored and shipped on floppy disks, and a large GEDCOM document might exceed the storage capacity of a single disk. This functionality is no longer necessary and is not widely implemented in present applications. This functionality is not included in ELF.

A top-level structure which is not a substructure of any other structure is called a record. An ELF document or dataset can have arbitrarily many records.

{.ednote} This is either not strictly true or at least misleading, because HEAD and TRLR are not records. Probably.

{.note} The expressiveness of ELF is similar to that of XML. ELF's structures serve the same role as elements in XML, and nest similarly. But unlike XML, which has a single root-level element, an ELF dataset typically has multiple records.

The ELF serialisation format is a general purpose format that can be used to represent arbitrary data, depending on the type identifiers used in the dataset. A particular set of type identifiers, together with their meanings and restrictions on how they are to be used, is called a data model.

{.note} The ELF serialisation format is designed to be useable with various data models; however it is anticipated that most files using the ELF serialisation format will use the data model described in [ELF Data Model], which is based on and compatible with GEDCOM's lineage-linked form.

At a lexical level, a structure is encoded as sequence of lines, each terminated with a line break. The first line encodes the type identifier and payload of the structure, while any substructures are encoded in order on subsequent lines. Each line consists of the following components, in order:

a level, which is a non-negative decimal integer that records how many levels of substructures deep the current structure is nested;
an optional structure identifier, which is an identifier written between two "at" signs (U+0040) that can be referenced by a pointer in the payload of another structure;
a tag, which encodes the type identifier of the structure; and
the optional payload of the structure encoded by the line.

{.example ...} 0 HEAD 1 GEDC 2 VERS 5.5.1 2 ELF 1.0.0 2 FORM LINEAGE-LINKED 1 CHAR UTF-8 0 INDI 1 NAME Charlemagne 0 TRLR

This ELF document has three lines with level 0 which mark the start of the three top-level structures or records. These records have, respectively, three, one and zero substructures, which are denoted by the lines with level 1. The structure represented by the line with a CHAR tag is a substructure of the HEAD record because there is no intervening line with level one less than 1 (i.e. 0); the structure represented by the NAME line naming Charlemagne is a substructure of the INDI record, as that is the preceding line with a level 0. The TRLR record is an example of a record with no substructures.

Five of the lines in this example document have a payload. For example, the payload of the FORM line is the string "LINEAGE-LINKED", while the payload of the NAME line is the string "Charlemagne". None of the lines in this example have payload which are pointers, nor do any have a structure identifier. {/}

ELF applications {#applications}

A conformant application which parses the ELF serialisation format is called an ELF parser. A conformant application which outputs data in the ELF serialisation format is called an ELF writer.

{.note} Many applications will be both ELF parsers and ELF writers.

The input to an ELF parser and output of an ELF writer is an octet stream, which is a sequence of 8-bit bytes or octets each with a value between 0 and 255.

{.note} An octet stream is typically read from or written to a disk or the network. This standard does not define how these should be read, nor how the octets are represented in storage or in transit on a network.

This standard defines how an octet stream is parsed into a dataset, and how a dataset is serialised into an octet stream. Overviews of these processes can be found in {§parsing} and {§serialising}, respectively. An octet stream which this standard requires an ELF parser to be able to read is called a conformant source.

{.note} An octet stream which an ELF parser must be able to read successfully, but can process in an implementation-defined manner is nonetheless a conformant source.

An octet stream which is not a conformant source is called a non-conformant source. If the input to an ELF parser is not a conformant source, unless this standard says otherwise, the application must either terminate processing that octet stream or present a warning or error message to the user. If it continues processing, it does so in an implementation-defined manner.

This standard also recognises a class of application which reads data in the ELF serialisation format, applies a small number of changes to that data, and immediately produces output in the ELF serialisation format which is identical to the input, octet for octet, other where the requested changes have been made. Such an application is called an ELF editor.

{.note} ELF editors are intended to small programs or scripts that apply simple modifications to datasets, typically with little or no human interaction. For example, script which replaces some particular deprecated feature in the dataset with an equivalent would be an ELF editor. This definition of an ELF editor is not intended to include large, feature-rich applications which read ELF into an internal database, allow users to view and modify most aspects of the data, and later export it as ELF.

ELF editors are not required to conform to the full requirements of an ELF parser or ELF writer. The only requirement this standard places on ELF editors is that, when acting on a conformant source, they must either generate output which is a conformant source, or present a warning or error message to the user, or terminate.

{.note} This is a considerably weaker requirement than that placed on ELF parsers and ELF writers. In particular, there is no requirement for an ELF editor to detect invalid input, as an ELF parser is generally required to; nor do the stricter requirements on the output allowed from ELF writers apply. These relaxations allow ELF editors to do in-place editing of the octet stream, without fully parsing those parts of their input which are not going to be changed.

Parsing {#parsing}

The parsing process can be summarised as follows:

An octet stream is converted to a sequence of line strings by

a. determining its character encoding by

 i.  detecting a *character encoding* per {§detected-enc}, and 
 ii. using that *detected character encoding* to look for a
     *specified character encoding* in the *serialisation
     metadata* per {§specified-enc};

b. converting octets to characters using that character encoding; and c. splitting on line breaks per {§line-strings}.

Line strings are parsed as lines by parsing the level, tag, xref_id, and payload of each line.
Lines are parsed into xref structures by
- re-merging CONC and CONT-split payloads; violations of splitting rules are ignored
- using levels to properly nest xref structures
xref structures are parsed into tagged structures by simultaneously
- converting xrefs to pointers, with a special "point to null" if this fails
- unescaping @ characters
- preserving valid escapes and removing others
- converting unicode escapes into their represented characters,
the tagged structures that represent the schema are parsed
tags are converted into structure type identifiers using the schema and the resulting structures placed in the metadata or document as appropriate. Tags with no corresponding structure type identifier are converted into appropriate undefined tag identifiers.

Serialisation {#serialising}

The semantics of serialisation are defined by the following procedural outline.

Each structure is assigned a tag based on its structure type identifier, superstructure type identifier, and a schema which MAY be augmented during serialisation to allow all structures to have a tag.
The tagged structures are ordered and additional tagged structures created to represent serialisation metadata.

This step cannot happen before tagging because tagging may generate serialisation metadata that needs to be included in the tagged structures.
Payloads are converted to create xref structures by simultaneously
- assigning xref_ids and replacing pointer-valued payloads with string-valued xrefs
- escaping @ characters
- preserving valid escapes
- escaping unrepresentable characters
Semantically, these actions must happen concurrently because none of them should be applied to the others' results.

This step cannot happen before tagging because tags are needed to determine the set of valid escapes. This step cannot happen before adding serialisation metadata because it is applied to the serialisation metadata as well.
The dataset is converted to a sequence of lines by
- assigning levels
- splitting payloads, if needed, using CONT and CONC
- ordering substructures in a preorder traversal of the tagged structures
This step cannot happen before payload conversion because valid split points are dependant on proper escaping. This step must happen before encoding as octets because valid split points are determined by character, not octet.
The sequence of lines is converted to an octet stream by
- concatenating the lines with line-break terminators
- converting strings to octets using the character encoding

Constructs

This document uses five externally-visible constructs: dataset, metadata, document, structure, and octet stream. For clarity of presentation, it also uses several intermediate constructs internally: line, xref structure, and tagged structure. Each is defined in {§glossary}.

Glossary {#glossary}

{.ednote} Line, octet, octet stream, record and structure are now defined in {§overview}, while character encoding is defined in {§parsing-linestrs}, and line break is defined in {§line-strings}. The notion of a delimiter is being removed. Dataset and document are very nearly defined in {§overview} too, but we don't currently discuss metadata there — this is an issue which needs resolving.

Character encoding : The scheme used to map between an octet stream and a string of characters.

Dataset : Metadata and a document.

Delimiter : A sequence of one or more space or tabulation characters.

    Delim ::= [#20#9]+

During serialisation, a single space (U+0020) SHOULD be used
each place a *delimiter* is expected.

Document : An unordered set of structures.

Line : 1. A level, a non-negative integer 2. An optional xref_id 3. A tag, a string matching production Tag 4. An optional payload, which is a string containing any number of characters, but which must not contain a line-break.

Line break : A sequence of one or more newline and/or carriage return characters.

    LB ::= [#A#D]+

During serialisation, each *line break* MUST be one of 

- a single newline (U+000A)
- a single carriage return (U+000D)
- a single carriage return followed by a single newline (U+000D U+000A)

The same string SHOULD be used each place a *line break* is expected.

Metadata : A collections of structures intended to describe information about the dataset as a whole.

The relative order of *structures* with the same *structure type identifier* SHALL be preserved within this collection;
the relative order of *structures* with distinct *structure type identifiers* is not defined by this specification.

Octet : One of 256 values, often represented as the numbers 0 through 255. Also called a "byte."

Octet Stream : A sequence of octets.

Record : A structure, tagged structure, or xref structure whose superstructure is the document.

ELF Schema : Information needed to correctly parse tagged structures into structures: a mapping between structure type identifiers and tags and metadata relating to valid escapes and prefixes.

Serialisation Metadata : Tagged structures inserted during serialisation and removed (with all its substructures) during parsing. They are used to serialise the character encoding and ELF schema as well as to separate the metadata and the document.

Structure : - A structure type identifier, which is a term. - Optionally, a payload which is one of - A pointer to another structure, which must be a record within the same dataset. - A string or subtype thereof. - One superstructure, which is one of - Another structure; superstructure links MUST be acyclic. - The document. - The metadata. - A collection of any number of substructures, which are structures.

    The relative order of *structures* with the same *structure type identifier* SHALL be preserved within this collection;
    the relative order of *structures* with distinct *structure type identifiers* is not defined by this specification.

Superstructure type identifier : A term identifying the type of the superstructure of a structure. If the superstructure is the document, this is elf:Document. If the superstructure is the metadata, this is elf:Metadata. Otherwise, this is the structure type identifier of a structures's superstructure.

{.note} Superstructure type identifier is not transitive, applying only to the immediate superstructure.

{.example ...} Suppose an elf:INDIVIDUAL_RECORD is the superstructure of an elf:GRADUATION and the elf:GRADUATION is the superstructure of an elf:AGE_AT_EVENT. The superstructure type identifier of the elf:AGE_AT_EVENT is elf:GRADUATION, not elf:INDIVIDUAL_RECORD. {/}

Tagged Structure : Like a structure, except

- it has a *tag* instead of a *structure type identifier*.
- its *substructures* are stored in a sequence with defined order, not in a partially-ordered collection.

Undefined tag identifier : A term containing a single # (U+0023) with elf:Undefined before it and a string matching production Tag after it. The string after the U+0023 is called the tag of the undefined tag identifier.

Xref Structure : Like a tagged structure, except

- it may have an optional *xref_id*.
- its payload, if present, is always a *string*, not a *pointer*.

Encoding with `@`

ELF uses the character U+0040 (commercial at, @) to encode several special cases when encoding a tagged structure as an xref structure. In particular,

pointers are encoded be assigning an xref_id to the pointed-to tagged structure and using it as an xref in the pointing payload
characters outside the character encoding are encoded as unicode escapes
escapes that are not preserved escapes are removed
@ that are not part of escapes are encoded as @@

All of these steps involve @s, and MUST NOT be applied to one another's @s; semantically, they are applied concurrently.

During parsing, there is an inherent ambiguity when there are several contiguous @ in the payload. These SHALL be resolved in an earliest-match-first order.

{.example ...} The following xref structure's payloads are split into sequences as indicated:

payload of xref structure decomposed as

"name@example.com" "name", "@", "example.com" "name@@example.com" "name", "@@", "example.com" "name@@@example.com" "name", "@@", "@", "example.com" "name@@@@example.com" "name", "@@", "@@", "example.com" "some@#XYZ@ thing" "some", "@#XYZ@ ", "thing" "some@@#XYZ@ thing" "some", "@@", "#XYZ", "@", " thing" "some@@@#XYZ@ thing" "some", "@@", "@#XYZ@ ", "thing" {/}

Pointer conversion

If a tagged structure is pointed to by the pointer-valued payload of another tagged structure, the pointe-to tagged structure's corresponding xref structure SHALL be given an xref_id, a string matching production XrefID.

XrefID  ::= "@" ID "@"
ID      ::= [0-9A-Z_a-z] [#x20-#x3F#x41-#x7E]*

It MUST NOT be the case that two different xref structures be given the same xref_id. Conformant implementations MUST NOT attach semantic importance to the contents of an xref_id.

It is RECOMMENDED that an xref_id be no more than 22 characters (20 characters plus the leading and trailing U+0040)

{.note} [GEDCOM 5.5.1] REQUIRED that xref_id be no more than 22 characters. ELF weakens this to a RECOMMENDATION.

Each record SHOULD be given an xref_id; each non-record structure SHOULD NOT; and each serialisation metadata tagged structure MUST NOT be given an xref_id.

{.ednote} Since a pointed-to structure SHALL have an xref_id and a non-record MUST NOT, implicitly a structure SHOULD NOT point to a non-record. We should probably either make that explicit or remove it---the latter may make more sense as what is pointed to seems to be more a data model decision than a serialisation decision. However, GEDCOM is fairly clear that pointers to non-records might in the future be enabled with a non-standard xref_id syntax.

The xref structure that corresponds to a tagged structure with a pointer-valued payload has, as its payload, an xref: a string identical to the xref_id of the xref structure corresponding to the pointed-to tagged structure.

When parsing, if xref payloads are encountered that do not correspond to exactly one xref structure's xref_id, that payload SHALL be converted to to a pointer to a record with tag "UNDEF", which SHALL NOT have a payload nor substructures. It is recommended that one such "UNDEF" tagged structure be inserted for each distinct xref.

{.note} The undefined pointer rule is designed to minimize the information loss in the event of a bad serialised input.

{.note} This rule does not handle pointer-to-wrong-type; information needed to determine that is not known be serialisation and thus must be handled by the data model instead.

{.ednote} We could also allow pointer-to-nothing or pointer-to-multiple-things to be dropped from the dataset, and/or provide disambiguation heuristics for pointer-to-multiple-things situations. This draft does not do so as it is not obvious that the benefit is worth the complexity.

Escape preservation and removal

An escape is a substring of a string-valued payload of either a tagged structure or xref structure which matched production Escape. Its escape type is the portion of the escape that was matched by EscType.

Escape   ::= "@#" EscType EscText "@ "
EscType  ::= [A-Z]
EscText  ::= [^#xA#xD#x40]*

If the escape type is U (U+0055), the escape is a unicode escape and its handling is discussed in {§unicode-escape}; otherwise, it is handled according to this section.

Serialisation

If an escape is in the payload of an tagged structure whose tag is an escape preserving tag, and if the escape*'s escape type is in the tag's set of preserved escape types, then the escape SHALL be preserved unmodified in the corresponding xref structure's payload.

{.example} If a "DATE" tagged structure has payload "ABT @#DJULIAN@ 1540", its corresponding xref structure's payload is also "ABT @#DJULIAN@ 1540".

Otherwise, a modification of the escape SHALL be placed in the xref structure's payload which is identical to the original escape except that each of the two @ SHALL each be replaced with a pair of consecutive U+0040 @.

{.example} If a "NOTE" tagged structure has payload "ABT @#DJULIAN@ 1540", its corresponding xref structure's payload is "ABT @@#DJULIAN@@ 1540".

Parsing

If an escape is in the payload of an xref structure whose tag is an escape preserving tag, and the escape*'s escape type is in the tag's set of preserved escape types, the escape SHALL be preserved unmodified in the corresponding tagged structure's payload.

{.example} If a "DATE" xref structure has payload "ABT @#DJULIAN@ 1540", its corresponding tagged structure's payload is also "ABT @#DJULIAN@ 1540".

Otherwise, the escape SHALL be omitted from the corresponding tagged structure's payload.

{.example} If a "NOTE" xref structure has payload "ABT @#DJULIAN@ 1540", its corresponding tagged structure's payload is "ABT 1540".

{.note} The decision to remove most escapes is motivated in part because [GEDCOM 5.5.1] does not provide any meaning for an escape other than a date escape. This caused some ambiguity in how such escapes were handled, which ELF seeks to remove. Lacking a semantics to assign these escapes, ELF chooses to simply remove them. Implementations that had assigned semantics to them were actually imposing non-standard semantics to those payloads which are more accurately handled by using an alternative ELF schema to map those tags to different structure type identifiers with those semantics documented.

Unicode escapes {#unicode-escape}

{.note} [GEDCOM 5.5.1] neither has a notion of unicode escape nor any other feature for achieving the same end. Unicode escapes are designed to provide a means for encoding any character in any character encoding in a way that is maximally backwards-compatible from [GEDCOM 5.5.1].

Any character MAY be represented with a unicode escape consisting of:

The three characters U+0040, U+0023, and U+0055 (i.e., "@#U")
A hexadecimal encoding of the character's code point
The two characters U+0040 and U+0020 (i.e., "@ ")

{.ednote} Can we make the final space optional?

A unicode escape MUST be used for each character that cannot be encoded in the target character encoding; and SHOULD NOT be used otherwise.

{.ednote} Earlier drafts of this specification suggested using @#U20@ in place of U+0020 when a line's payload begins or ends with a space. Given the inherent ambiguity in the handling of delimiters at the ends of a line's payloads, it is not clear if that idea was better than simply clarifying that ambiguity.

{.example} If a tagged structure's payload is "João" and the character encoding is ASCII, the xref structure's payload is "Jo@#UE3@ o" (or "Joa@U#303@ o" if the original used a combining diacritic).

{.ednote} Unicode escapes use GEDCOM's general escape sequence syntax, in which the character after the @# prefix denotes the type of escape. In GEDCOM 5.5, the only escape type is D, used for calendars, though other escapes have been used in older versions of GEDCOM. In particular, A was used auxiliary file references, C was used to switch character set, F was used for file inclusion, and L was used to specify the length of a block of non-GEDCOM data included immediately after the escape. FHISO are unlikely to reuse these escapes, unless for a compatible purpose.

Encoding `@`s

{.ednote ...} It might be worthwhile to restrict this entire section to non-escape preserving tags; without that we have a (somewhat obscure) problem with the current system:

Consider the escape-preserving tag DATE. A serialisation/parsing sequence applied to the string "@@#Dx@@ yz" yields

encoded "@@#Dx@@ yz"
decoded "@#Dx@ yz"
encoded "@#Dx@ yz" -- not with @@ because it matches a date escape {/}

During serialisation, each U+0040 (@) that is not part of an escape SHALL be encoded as two consecutive U+0040 (@@).

{.example} The tagged structure payload "name@example.com" is serialised as the xref structure payload "name@@example.com"

During parsing, each consecutive pair of U+0040 (@@) SHALL be parsed as a single U+0040 (@).

{.example} The xref structure payload "name@@example.com" is parsed as the tagged structure payload "name@example.com"

During parsing, a lone U+0040 is left unmodified.

{.example} If an xref structure's payload is "name@example.com", it is parsed as the tagged structure payload "name@example.com"; that in turn will be re-serialised as "name@@example.com".

Serialisation metadata

The tagged structures representing the dataset are ordered as follows:

A serialisation metadata tagged structure with tag "HEAD" and the following substructures:
- A serialisation metadata tagged structure with tag "CHAR" and payload identifying the character encoding used; see {§encoding} for details.
- A serialisation metadata tagged structure with tag "SCHMA" and no payload, with substructures encoding the ELF Schema; see {§schema} for details.
- Each tagged structure with the superstructure type identifier elf:Metadata, in an order consistent with the partial order of structures present in the metadata.
Each tagged structure with the superstructure type identifier elf:Document, in arbitrary order.
A serialisation metadata tagged structure with tag "TRLR" and no payload or substructures.

Charcter encoding names {#encoding}

The character encoding SHALL be serialised in the "CHAR" tagged structure's payload encoding name in the following table:

Encoding Description

ASCII The US version of ASCII defined in [ASCII].

ANSEL The extended Latin character set for bibliographic use defined in [ANSEL].

UNICODE Either the UTF-16LE or the UTF-16BE encodings of Unicode defined in [ISO 10646].

UTF-8 The UTF-8 encodings of Unicode defined in [ISO 10646].

{.note} This value is read as the specified character encoding per {§specified-enc}.

It is REQUIRED that the encoding used should be able to represent all code points within the string; unicode escapes (see {§unicode-escape}) allow this to be achieved for any supported encoding. It is RECOMMENDED that UTF-8 be used for all datasets.

ELF Schema {#schema}

The ELF Schema is a serialisation metadata tagged structure with tag "SCHMA" and no payload; it may contain as substructures any number of external schema structures, prefix abbreviation structures, IRI definition structures, and escape preservation structures.

If, during parsing, no ELF Schema is found, the default ELF schema defined in {§default-schema} SHALL be used.

{.ednote} Do we need to make the default dependant on the GEDC metadata?

If multiple ELF Schemas are found, they SHALL be treated as if all of their substructures were part of the same ELF schema.

During serialisation exactly one ELF Schema SHOULD be included.

External schema structure

An external schema structure is a tagged structure with an ELF Schema as its superstructure, tag SCHMA, no substructures, and an IRI as its payload. The IRI SHOULD use the http or https scheme and an HTTP GET request sent to it with an Accept header of application/x-fhiso-elf1-schema SHOULD return a dataset serialised in accordance with this specification containing an ELF Schema defining the full data model in structure type descriptions.

{.ednote} Is application/x-fhiso-elf1-schema a MIME-type we are happy with?

{.example ...} When using the [ELF Data Model] version 1.0.0, the serialisation schema could be serialised as

0 HEAD
1 SCHMA
2 SCHMA https://fhiso.org/TR/elf-data-model/v1.0.0

{/}

{.example} An HTTP GET request sent to it with an Accept header of application/x-fhiso-elf1-schema to https://fhiso.org/TR/elf-data-model/v1.0.0 will return the contents of {§default-schema} or the equivalent.

When retrieving a serialised dataset via an HTTP GET request to the IRI of an external schema structure, all contents of that dataset except ELF Schemas SHALL be ignored. Additional external schema structure SHOULD NOT be present within that ELF Schema and if they are, they MAY be ignored.

{.note} The recommendation against external schema structures inside other external schema structures is designed to simplify parsing.

Prefix abbreviation structure {#prefix}

{.ednote} Should this section cite §4.3 of [Basic Concepts] instead of its current text?

A prefix abbreviation structure is a tagged structure with an ELF Schema as its superstructure, tag PRFX, and no substructures. Its payload consist of two whitespace-separated tokens: the first is a prefix and the second is that prefix's corresponding IRI.

To prefix expand a string, if that string begins with a defined prefix followed by a colon (U+003A :) then replace that prefix and colon with the prefix's corresponding IRI. To prefix shorten a string, replace it with a string that prefix expansion would convert to the original string.

{.example ...} Given a PRFX

2 PRFX elf https://fhiso.org/elf/

the IRI https://fhiso.org/elf/ADDRESS may be abbreviated as elf:ADDRESS. {/}

IRI definition structure

An IRI definition structure is a tagged structure with an ELF Schema as its superstructure and tag "IRI". Its payload is an IRI, which MAY be prefix shortened during serialisation and MUST be prefix expanded during parsing. The remainder of this section calls this prefix expanded payload $I$. An IRI definition structure may have, as substructures, any number of supertype definition structures and tag definition structures.

A supertype definition structure is a tagged structure with an IRI definition structure as its superstructure, tag "ISA", and no substructures. Its payload is a structure type identifier which MAY be prefix shortened during serialisation and MUST be prefix expanded during parsing. The remainder of this section calls this prefix expanded payload $I'$. Each supertype definition structure encodes a single supertype definition, specifying that $I'$ is a supertype of $I$.

{.example ...} That elf:ParentPointer is a supertype of elf:PARENT1_POINTER can be encoded in a supertype definition structure as

2 IRI elf:PARENT1_POINTER
3 ISA elf:ParentPointer

{/}

A tag definition structure is a tagged structure with an IRI definition structure as its superstructure, tag "TAG", and no substructure. Its payload is a whitespace-separated list of two or more tokens. The first token $T$ MUST match production Tag; each remaining token $S$ is an IRI, which MAY be prefix shortened during serialisation and MUST be prefix expanded during parsing. Each such $S$ encodes an tag definition between structure type identifier $I$ and (tag, superstructure type identifier) pair $(T, S)$.

{.example ...} The following tag definitions

the structure type identifier of "HUSB" is elf:Parent1Age if its superstructure is an elf:FamilyEvent.
the structure type identifier of "HUSB" is elf:PARENT1_POINTER if its superstructure is an elf:FAM_RECORD.
the structure type identifier of "FORM" is elf:MULTIMEDIA_FORMAT if its superstructure is an elf:MULTIMEDIA_RECORD.
the structure type identifier of "FORM" is elf:MULTIMEDIA_FORMAT if its superstructure is an elf:MULTIMEDIA_FILE_REFERENCE.
the structure type identifier of "EMAIL" is elf:ADDRESS_EMAIL if its superstructure is an elf:Agent.
the structure type identifier of "EMAI" is elf:ADDRESS_EMAIL if its superstructure is an elf:Agent.

can be encoded in tag definition structures as

0 HEAD
1 SCHMA
2 PRFX elf https://fhiso.org/elf/
2 IRI elf:PARENT1_POINTER
3 TAG HUSB elf:FAM_RECORD
2 IRI elf:Parent1Age
3 TAG HUSB elf:FamilyEvent
2 IRI elf:MULTIMEDIA_FORMAT
3 TAG FORM elf:MULTIMEDIA_FILE_REFERENCE elf:MULTIMEDIA_RECORD
2 IRI elf:ADDRESS_EMAIL
3 TAG EMAIL elf:Agent
3 TAG EMAI elf:Agent

{/}

Escape-preserving tags

{.note} This entire section, and all of the related functionality, is present to help cope with the idiosyncratic behaviour of date escapes in [GEDCOM 5.5.1]. Escapes in previous editions of GEDCOM were serialisation-specific and if encountered in ELF should generally be ignored, but date escapes are instead part of a microformat. While escape-preserving tags are not elegant, they are adequate to handle this idiosyncrasy.

{.ednote} I wrote the above note from somewhat fuzzy memory. It might be good to review and summarise all the uses of escapes in various GEDCOM releases...

Some tags may be defined as escape-preserving tags, each with a list of single-character preserved escape types each of which MUST match production UserEscType.

UserEscType ::= [A-TV-Z]

An escape preservation structure is a tagged structure with an ELF schema as its superstructure, tag "ESC", and no substructures. Its payload is composed of two whitespace-separated tokens; the first is the escape-preserving tag and the second is a concatenation of all preserved escape types of that tag; each preserved escape type SHOULD be included in the second token only once.

Two escape preservation structures MUST NOT differ only in the set of preserved escape sequences they define for a given tag.

Escape-preserving tags are included for backwards compatibility, and MUST NOT be used for new extensions.

{.note} The only known escape-preserving tag is "DATE", with the preserved escape type of "D"

{.example ...} The following is the only escape preservation structure in ELF 1.0.0:

0 HEAD
1 SCHMA
2 ESC DATE D

{/}

{.example ...} The following defines tag _OLD_EXTENSION to preserve G and Q escapes:

0 HEAD
1 SCHMA
2 ESC _OLD_EXTENSION QG

The ESC could have equivalently been written as

2 ESC _OLD_EXTENSION GQ

or even

2 ESC _OLD_EXTENSION QGGQQQGGGG

... though that last version is needlessly redundant and verbose and is NOT RECOMMENDED.

Such a definition MUST NOT be used except as backwards compatibility support for an escape-dependent _OLD_EXTENSION that predates ELF 1.0.0. {/}

{.note} This specification uses tag and not structure type to indicate escape preservation because the main motivating case (DATE) applies it to all of the several structure types that share that tag.

Tags

Definitions {#tags}

A tag is a string that matches production Tag

Tag ::= [0-9a-zA-Z_]+

A tag SHOULD be no more than 15 characters in length.

{.note} [GEDCOM 5.5.1] required tags to be unique within the first 15 characters and no more than 31 characters in length. As memory constraints that motivated those requirements are no longer common, ELF has changed that recommended status instead.

A tag SHOULD begin with an underscore (_, U+005F) unless it is defined in a FHISO standard.

{.note} [GEDCOM 5.5.1] required all tags other than those it defined to begin with an underscore. ELF's use of structure type identifiers largely obviates that need, but it remains recommended in ELF 1.0.0 to support legacy systems that have special-case handling for underscore-prefixed tags. FHISO is considering removing that recommendation in a subsequent version of ELF.

{.example} "HEAD" is a valid tag; so is "_UUID". "23" and "UUID" are also valid, but SHOULD NOT be used as they are not defined in a FHISO standard and do not begin with an underscore. "_UNCLE_OF_THE_BRIDE" is valid, but SHOULD NOT be used as it is 19 characters long, more than the 15-character recommended maximum length.

Structure type identifiers are serialised as tags by utilizing tag definitions and supertypes, as outlined below.

Supertypes

A supertype definition specifies one structure type identifier that is defined to be a supertype of another.

{.example ...} The following are example supertype definitions in the default ELF schema:

elf:FamilyEvent is a supertype of elf:MARRIAGE
elf:Event is a supertype of elf:FamilyEvent
elf:Agent is a supertype of elf:SUBMITTER_RECORD
elf:Record is a supertype of elf:SUBMITTER_RECORD {/}

An eventual supertype of a structure type identifier is either

the structure type identifier itself
an eventual supertype of at least one of the structure type identifier's supertypes

{.example ...} Continuing the previous example,

elf:MARRIAGE, elf:FamilyEvent, and elf:Event are eventual supertypes of elf:MARRIAGE
elf:FamilyEvent, and elf:Event are eventual supertypes of elf:FamilyEvent
elf:SUBMITTER_RECORD, elf:Agent, and elf:Record are eventual supertypes of elf:SUBMITTER_RECORD {/}

If $X$ is an eventual supertype of $Y$, then $Y$ is an eventual subtype of $X$.

{.example ...} Continuing the previous example,

elf:MARRIAGE, elf:FamilyEvent, and elf:Event are eventual subtypes of elf:Event
elf:SUBMITTER_RECORD and elf:Agent are eventual subtypes of elf:Agent
elf:SUBMITTER_RECORD and elf:Record are eventual subtypes of elf:Record {/}

The supertype defined in this specification is only intended to facilitate tag definitions and MUST NOT be taken to indicate any semantic relationship between the structure types they describe.

{.note} It is expected that underlying data models will often define a semantic supertype-like relationship that mirrors the supertype definitions in this document; see [Elf-DataModel] for an example of what this might look like. The prohibition against assuming such from the supertype definitions alone provides a clearer separation between data model and serialisation.

{.ednote} We could decide to REQUIRE that any supertype definition has meaning in the underlying data model; I chose not to do so in this draft as it required discussing semantics, which this specification otherwise does not need to do.

Tag definitions {#tag-definitions}

The correspondence between tags and structure type identifiers is provided by a set of tag definitions. Each tag definition gives the unique structure type identifier that a particular tag corresponds to if its superstructure type identifier is an eventual subtype of a given superstructure type identifier.

{.example ...} The following are example tag definitions in the default ELF schema:

the structure type identifier of "HUSB" is elf:Parent1Age if its superstructure is an elf:FamilyEvent.
the structure type identifier of "HUSB" is elf:PARENT1_POINTER if its superstructure is an elf:FAM_RECORD.
the structure type identifier of "CAUS" is elf:CAUSE_OF_EVENT if its superstructure is an elf:Event.

If a tagged structure has tag "CAUS" and superstructure type identifier elf:MARRIAGE, it's structure type identifier is elf:CAUSE_OF_EVENT because of the last of the above tag definitions and because elf:MARRIAGE is an eventual subtype of elf:Event. {/}

The set of tag definitions and supertype definitions MUST NOT provide two (or more) different structure type identifiers for any single structure.

{.example ...} The following, taken together, are not permitted

elf:Agent is a supertype of elf:SUBMITTER_RECORD.
elf:Record is a supertype of elf:SUBMITTER_RECORD.
ex:AgentKind is the structure type identifier of an "_EX_KIND" if its superstructure is an elf:Agent.
ex:RecordKind is the structure type identifier of an "_EX_KIND" if its superstructure is an elf:Record.

These provide two contradictory tag definitions for the tag "_EX_KIND" as a substructure of an elf:SUBMITTER_RECORD. {/}

{.example ...} The following, taken together, are permitted

elf:Agent is a supertype of elf:SUBMITTER_RECORD.
elf:Record is a supertype of elf:SUBMITTER_RECORD.
ex:Kind is the structure type identifier of an "_EX_KIND" if its superstructure is an elf:Agent.
ex:Kind is the structure type identifier of an "_EX_KIND" if its superstructure is an elf:Record.

These provide two tag definitions for the tag "_EX_KIND" as a substructure of an elf:SUBMITTER_RECORD, but because both provide the same structure type identifier they are permitted. {/}

A tag definition is said to apply to a structure if and only if the structure's structure type identifier is that of the tag definition and its superstructure type identifier is an eventual subtype of the tag definition's superstructure type identifier.

A tag definition is said to apply to a tagged structure if and only if the tagged structure's tag is that of the tag definition and its superstructure type identifier is an eventual subtype of the tag definition's superstructure type identifier.

Serialisation {#tag-serialisation}

During serialisation, a conformant application SHALL ensure the presence of sufficient tag definitions that at each structure has a defined tag, creating new tag definitions if needed to achieve this end.

{.note} The above is not the same as saying that a tag definition is created for each structure type identifier because a structure with identifier "elf:Undefined" or an undefined tag identifier has a defined tag without a tag definition.

New tag definitions may be selected arbitrarily, subject to the limitations on tags (see {§tags}) and tag definitions (see {§tag-definitions}) and to the following:

the tag MUST NOT be "CONT", "CONC", "ERROR", "UNDEF", or the tag of any undefined tag identifier in the dataset.

{.note} "CONT", "CONC", "ERROR", and "UNDEF" are special tags that can be created at any location within the dataset during deserialisation.

the structure type identifier MUST NOT be any of "elf:Document", "elf:Metadata", "elf:Undefined", or an undefined tag identifier.

{.note} "elf:Undefined" structures are used for errors and are serialised differently than other structures.

the (tag, superstructure type identifier) pair MUST NOT be any of (HEAD, elf:Document), (TRLR, elf:Document), (CHAR, elf:Metadata), or (SCHMA, elf:Metadata).

{.note} These tags and contexts are reserved for encoding serialisation metadata.

all tag definitions for a given structure type identifier SHOULD use the same tag

{.note} [GEDCOM 5.5.1] never intentionally violates the above RECOMMENDATION, but via a typo it provides both EMAI and EMAIL as tags for elf:ADDRESS_EMAIL. Other aliases exist due to similar mistakes in applications and to multiple extensions inserting the same concept via different tags. The ability to handle these aliases is the reason this is a RECOMMENDATION, not a REQUIREMENT, in ELF.

the tag definitions in the default ELF Schema (see {§default-schema}) SHOULD be used in place of any alternative tag definitions for the same structures in the same contexts.

Each structure is converted to a tagged structure with the tag being

UNDEF if the structure type identifier is elf:Undefined.
The tag of the undefined tag identifier if the structure type identifier is an undefined tag identifier.
The tag from one of the tag definitions that applies to that structure otherwise.

In the event that more than one such tag exists, applications SHOULD select the same tag in each instance where this choice exists.

{.note} If processing structures into tagged structures in place, it may be easiest to perform a postorder traversal of each structure hierarchy; this way the superstructure of a structure being converted will still have a structure type identifier, not a tag, which will simplify looking up applicable tag definitions.

The substructures of a tagged structure are stored in a sequence, not set. This ordering of substructures of a tagged structure MUST maintain the relative order of those substructures that were ordered in the corresponding structure. It is RECOMMENDED that all substructures with the same tag be grouped together, but doing so is NOT REQUIRED.

{.example ...} Consider the following structure hierarchy

elf:INDIVIDUAL_RECORD
- elf:BIRTH
  - elf:DATE_VALUE 20 JUN 1881
- elf:GRADUATIONs:
  1. elf:GRADUATION
    - elf:AGE_AT_EVENT 18
  2. elf:GRADUATION
    - elf:AGE_AT_EVENT 22

This may be converted to any of the following three tagged structure hierarchies, though the second is NOT RECOMMENDED:

INDI
1. BIRTH
  - DATE 20 JUN 1881
2. GRAD
  - AGE 18
3. GRAD
  - AGE 22
INDI
1. GRAD
  - AGE 18
2. BIRTH
  - DATE 20 JUN 1881
3. GRAD
  - AGE 22
INDI
1. GRAD
  - AGE 18
2. GRAD
  - AGE 22
3. BIRTH
  - DATE 20 JUN 1881

However, the following puts the tagged structure graduations in a different order than the corresponding structure graduations and MUST NOT be used:

INDI
1. GRAD
  - AGE 22
2. GRAD
  - AGE 18
3. BIRTH
  - DATE 20 JUN 1881 {/}

Parsing

When parsing tagged structures into structures, add the structure type identifier from the the applicable tag definition.

If there is no applicable tag definition, or if there are multiple applicable tag definitions providing different structure type identifiers, then the structure type identifier SHALL be elf:Undefined if the tag is UNDEF, or the undefined tag identifier constructed by concatenating elf:Undefined# and the tag otherwise.

{.note} The special tag "ERROR" does not require special handling; because it never has a tag definition, it becomes the undefined tag identifier elf:Undefined#ERROR.

References

Normative references

[ANSEL] : NISO (National Information Standards Organization). ANSI/NISO Z39.47-1993. Extended Latin Alphabet Coded Character Set for Bibliographic Use. 1993. (See http://www.niso.org/apps/group_public/project/details.php?project_id=10.) Standard withdrawn, 2013.

[Basic Concepts] : FHISO (Family History Information Standards Organisation). Basic Concepts for Genealogical Standards. Public draft. (See https://fhiso.org/TR/basic-concepts.)

[ASCII] : ANSI (American National Standards Institute). ANSI X3.4-1986. Coded Character Sets -- 7-Bit American National Standard Code for Information Interchange (7-Bit ASCII). 1986.

[ISO 10646] : ISO (International Organization for Standardization). ISO/IEC 10646:2014. Information technology — Universal Coded Character Set (UCS). 2014.

[RFC 2119] : IETF (Internet Engineering Task Force). RFC 2119: Key words for use in RFCs to Indicate Requirement Levels. Scott Bradner, 1997. (See http://tools.ietf.org/html/rfc2119.)

[XML] : W3C (World Wide Web Consortium). Extensible Markup Language (XML) 1.1, 2nd edition. Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, François Yergeau, and John Cowan eds., 2006. W3C Recommendation. (See https://www.w3.org/TR/xml11/.)

Other references

[GEDCOM 5.5.1] : The Church of Jesus Christ of Latter-day Saints. The GEDCOM Standard, draft release 5.5.1. 2 Oct 1999.

[GEDCOM 5.5] : The Church of Jesus Christ of Latter-day Saints. The GEDCOM Standard, release 5.5. 1996.

[XML Names] : W3 (World Wide Web Consortium). Namespaces in XML 1.1, 2nd edition. Tim Bray, Dave Hollander, Andrew Layman and Richard Tobin, eds., 2006. W3C Recommendation. See https://www.w3.org/TR/xml-names11/.

[ELF Data Model] : FHISO (Family History Information Standards Organisation) Extended Legacy Format (ELF): Data Model.

[Unicode] : The Unicode Consortium. The Unicode Standard – Core Specification, version 12.1.0. See https://www.unicode.org/versions/Unicode12.1.0/.

Appendix A: Default Schema {#default-schema}

The following is a minimal ELF file with the default ELF Schema, which includes all tag definitions and supertype definitions listed in [Elf-DataModel].

{#include schema.ged}

\vfill

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schemas.md

schemas.md

Extended Legacy Format (ELF):
Schemas

Conventions used

Overview {#overview}

ELF applications {#applications}

Parsing {#parsing}

Serialisation {#serialising}

Constructs

Glossary {#glossary}

Encoding with `@`

Pointer conversion

Escape preservation and removal

Serialisation

Parsing

Unicode escapes {#unicode-escape}

Encoding `@`s

Serialisation metadata

Charcter encoding names {#encoding}

ELF Schema {#schema}

External schema structure

Prefix abbreviation structure {#prefix}

IRI definition structure

Escape-preserving tags

Tags

Definitions {#tags}

Supertypes

Tag definitions {#tag-definitions}

Serialisation {#tag-serialisation}

Parsing

References

Normative references

Other references

Appendix A: Default Schema {#default-schema}

Files

schemas.md

Latest commit

History

schemas.md

File metadata and controls

Extended Legacy Format (ELF): Schemas

Conventions used

Overview {#overview}

ELF applications {#applications}

Parsing {#parsing}

Serialisation {#serialising}

Constructs

Glossary {#glossary}

Encoding with @

Pointer conversion

Escape preservation and removal

Serialisation

Parsing

Unicode escapes {#unicode-escape}

Encoding @s

Serialisation metadata

Charcter encoding names {#encoding}

ELF Schema {#schema}

External schema structure

Prefix abbreviation structure {#prefix}

IRI definition structure

Escape-preserving tags

Tags

Definitions {#tags}

Supertypes

Tag definitions {#tag-definitions}

Serialisation {#tag-serialisation}

Parsing

References

Normative references

Other references

Appendix A: Default Schema {#default-schema}

Extended Legacy Format (ELF):
Schemas

Encoding with `@`

Encoding `@`s