-
Notifications
You must be signed in to change notification settings - Fork 5
HoneyBadgerFish
The NeXML standard (https://github.com/nexml/nexml/wiki/NeXML-Manual) describes how to express the core data of a phylogenetic study in XML.
The standard also allows arbitrary key-value pairs to be added to any entity through the use of meta
child elements.
Each meta
can either be of type LiteralMeta or ResourceMeta.
Because Open Tree's study curation app's manipulations are primarily the addition, deletion, and changing of these meta
elements, it makes sense for us to make them accessible.
In a naive transformation of NeXML to JSON, finding a meta
property requires iterating through every child meta
object, checking the "@property" for the desired property name, and then looking for the value in one of few places ("@content" or "$" for LiteralMeta elements, and "@href" or "$" for ResourceMeta).
The ot:*
key-value pairs that the Open Tree project is using to add extra info are documented on the NexSON page.
The NexSON files are produced using a syntactic convention based on the BadgerFish convention (see below).
The XML tree will be mirrored as a tree of JS objects. The topmost object contains the root of the XML tree. Each element in the NeXML is processed using the following rules, such that an XML element becomes an JS object inside its parent.
The first 4 rules only deviate slightly from BadgerFish (see Note in rule #3)
-
The XML element name becomes the name of the property in the parent JS object.
-
The text value of the XML element is contained in the
$
property of the object. Whitespace is stripped from the ends. If the text value of an XML element is broken up by intervening child elements, the$
of the object is produced by stripping leading and trailing whitespace from each fragment and concatenating fragments. -
The child elements in XML maps to an array of objects. Note: in BadgerFish single elements are mapped to single JS objects. In the NeXML schema, all of the core objects can be repeated. So an array (of any length) is a more natural mapping. Missing elements are omitted (not written as empty arrays).
-
XML attributes become properties of the object with a name that is a prefix of
@
before the property name. So:<alice charlie="david">bob</alice>
at the top level would become:
{"alice": [{ "$" : "bob", "@charlie" : "david" }]}
Rules 5 and 6 deal with XML namespaces. They mainly differ from BadgerFish in that the namespaces are only added to the root object:
- The default namespace becomes the
$
property of an@xmlns
object, and other namespaces become properties of that object. The names of the properties are the names of XML namespaces without the "xmlns:" qualifier. So
<alice xmlns="http://some-namespace" xmlns:charlie="http://some-other-namespace">bob</alice>
as a top-level object becomes:
{"alice": [{ "$" : "bob", "@xmnls" : { "$": "http://some-namespace", "charlie": "http://some-other-namespace"}]}
Unlike BadgerFish, this @xmlns
in only added to the root object.
- prefixes in an element or attribute name is just treated as part of the name (no substitution of the URL or cropping of the element name to exclude the prefix.
Rules 7-9 are special case handling of meta elements:
- If an element has
meta
child element withxsi:type="nex:LiteralMeta"
then it must have
- a
property
attribute; we will call the value of this attribute prop-value; - a
datatype
specifying whichxsd:
datatype the element holds; we will call the value of this attribute datatype-value; and - the data in a
content
attribute OR in the text content of the element; we will call this the content-value;
This sort of meta element will appear in the parent object under a name with a ^
prefix followed by prop-val. The content-value will be coerced to the JavaScript type that corresponds to datatype-value.
The exact representation of the property depends on what needs to be conveyed:
- Rule 7A: If there are no other attributes of the meta element needing to be mapped, then the key-value pair will have a JS primitive type as its value.
- Rule 7B: If there are other attributes that need to be written (such as an
id
attribute), then the value will be a JS object with content-value stored in the$
field.
- If an element has
meta
child element withxsi:type="nex:ResourceMeta"
then it must have
- a
rel
attribute; we will call the value of this attribute prop-value; - the data in an
href
attribute OR a nestedmeta
element; we will call this the content-value;
This sort of meta element will appear in the parent object under a name with a ^
prefix followed by prop-val. The value will be a JS object with:
- Rule 8A: if the data is in a
href
attribute, then@href
property will hold the href string - Rule 8B: ifa nested
meta
element holds the data, then a$
property will map to a JavaScript object that holds the representation of the innermeta
.
- Many of the meta attributes can only occur once per element. To streamline the
meta
encoding (and as an exception to Rule 3 above) we use the BadgerFish convention for dealing with cardinality:
- Rule 9A: If there is one element that maps to a property name, the value is the object described above (either a primitive for simple
nex:LiteralMeta
-type elements, or a full JS object otherwise). - Rule 9B: If there are multiple elements that map to a property name, then value of the property is an array which holds each of the object represenations as described above.
Note that the type hints (datatype
and xsi:type
attributes) are not present in the JSON.
Reverse translation is possible by relying on:
- If the value is a primitive, then
nex:LiteralMeta
will be used. - If the value is an object with a
$
that is a primitive, thennex:LiteralMeta
will be used. - If the value is an object with ah
href
property, thennex:ResourceMeta
will be used.
-
If there is an
about
attribute with a value that refers to the same element'sid
, then a@about
is not present in the JSON. -
The top-level object in JSON will have a
@nexml2json
property that maps to a version string such as "1.0.0a" or "1.0.0". Direct BadgerFish translations to JSON will lack this property, or will have a version string that starts with "0." (because most projects tweak the BadgerFish rules at least a little bit, it seems like a good idea to leave some room in the 0... namespace for distinguishing between versions JSON produced by those conventions).
There are three ways (that we are aware of) that roundtrip of XML -> JSON -> XML might not result in identical syntax:
-
The attribute and element order is not preserved. This is an trivial barrier to using diff to test roundtrips, but not a serious issue.
-
Introspection will provide the
datatype
ofnex:LiteralMeta
elements. This means thatxsd:integer
andxsd:float
values will be used for integer and floating point numbers. Thus the details of the meta properties (e.g. integer vs long or float vs double) may not be "round-trip-able". We do not know of cases in NeXML documents in which this fine-grained distinctions of type is needed. -
A LiteralMeta form of
meta
can store its value in acontent
attribute or the text body of the element. Both of these map to$
in JSON, so the exact placement cannot be recovered. This is not a substantive concern, as there is no indication in the NeXML standard that the two locations for the data should affect handling of the data.
The NeXML snippet below was pieced together from multiple files. So it does not make sense biologically. It was constructed to be valid NeXML and to show a diversity of the meta cases that introduce complexity:
The version-controlled home for the file is at https://github.com/OpenTreeOfLife/api.opentreeoflife.org/blob/roundtrip2xml/nexson-validator/tests/nexml/otu.xml
<?xml version="1.0" encoding="UTF-8"?> <nex:nexml xmlns:nex="http://www.nexml.org/2009" xmlns="http://www.nexml.org/2009" version="0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ot="http://purl.org/opentree/nexson" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:tb="http://purl.org/phylo/treebase/2.0/terms#" xmlns:skos="http://www.w3.org/2004/02/skos/core#"> <meta property="ot:tag" xsi:type="nex:LiteralMeta" datatype="xsd:string">cpDNA</meta> <meta property="ot:tag" xsi:type="nex:LiteralMeta" datatype="xsd:string">ingroup added</meta> <meta property="ot:candidateTreeForSynthesis" xsi:type="nex:LiteralMeta" datatype="xsd:string">tr1</meta> <otus id="ob1"> <otu about="#otu88801" id="otu88801" label="Ancyromonas sigmoides"> <meta property="ot:ottId" xsi:type="nex:LiteralMeta" datatype="xsd:integer">415973</meta> <meta property="ot:originalLabel" id="bogus" xsi:type="nex:LiteralMeta" datatype="xsd:string">Ancyromonas sigmoides</meta> <meta href="http://dx.doi.org/10.3732/ajb.94.12.2026" rel="ot:studyPublication" xsi:type="nex:ResourceMeta"/> <meta content="7002" datatype="xsd:long" id="m0" property="tb:identifier.taxon" xsi:type="nex:LiteralMeta"/> <meta href="http://purl.uniprot.org/taxonomy/94215" id="meta4912509" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/> <meta href="http://purl.uniprot.org/taxonomy/102624" id="meta4912517" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/> </otu> </otus> <trees id="tb1" otus="ob1"> <tree id="tr1" xsi:type="nex:FloatTree"> <node id="n1" otu="otu88801"/> <node id="n0"/> <edge id="e0" source="n0" target="n1"/> </tree> </trees> </nex:nexml>
will be represented as (there is not much of interest after the otu
object):
{ "nex:nexml": { "@version": "0.9", "@xmlns": { "$": "http://www.nexml.org/2009", "nex": "http://www.nexml.org/2009", "ot": "http://purl.org/opentree/nexson", "skos": "http://www.w3.org/2004/02/skos/core#", "tb": "http://purl.org/phylo/treebase/2.0/terms#", "xsd": "http://www.w3.org/2001/XMLSchema#", "xsi": "http://www.w3.org/2001/XMLSchema-instance" }, "^ot:candidateTreeForSynthesis": "tr1", # Rule 7A, 9A "^ot:tag": ["cpDNA", "ingroup added"], # Rule 7A, 9B "otus": [{ "@id": "ob1", "otu": [{ "@id": "otu88801", "@label": "Ancyromonas sigmoides", "^ot:originalLabel": { # Rule 7B, 9A "$": "Ancyromonas sigmoides", "@id": "bogus" }, "^ot:ottId": 415973, # Rule 7A, 9A "^ot:studyPublication": { # Rule 8A, 9A "@href": "http://dx.doi.org/10.3732/ajb.94.12.2026" }, "^skos:closeMatch": [{ # Rule 8A, 9B "@href": "http://purl.uniprot.org/taxonomy/94215", "@id": "meta4912509"},{ "@href": "http://purl.uniprot.org/taxonomy/102624", "@id": "meta4912517" }], "^tb:identifier.taxon": { # Rule 7B, 9B "$": 7002, "@id": "m0" } } ] }], "trees": [{ "@id": "tb1", "@otus": "ob1", "tree": [{ "@id": "tr1", "@xsi:type": "nex:FloatTree", "edge": [{ "@id": "e0", "@source": "n0", "@target": "n1" } ], "node": [{ "@id": "n1", "@otu": "otu88801" },{ "@id": "n0" } ] } ] } ] } }
We can probably avoid supporting this form - it was proposed in email, but not implemented.
This representation is very similar to the @nexml2json=1.1.*
with the following exception: a "byId" representation is used for some fields rather than an array. In this representation:
- a single object is used in place of array in the 1.0.0 syntax,
- The only permitted keys in the object are the
id
attributes of the element, - The value associated with the key is an object identical to the 1.0.0 reprsentation except that the
@id
is not included. - The NeXML form of the object is sequence of elements, one for each key-value pair.
Specifically:
- Instead of
node
andedge
array, the tree representation is expressed as:-
internalEdge
andterminalEdge
arrays instead ofedge
(which if concatenated would recreate theedge
array of the 1.0.* representation). -
leafById
andinternalNodeById
objects are used instead of anode
, and: - The
^ot:isLeaf
field is omitted (since the presence inleaf
conveys this info). - an
otuByID
object replaces aotu
array. - an
otusByID
object replaces aotus
array and the parent (nexml) object will have a^ot:otusElementOrder
key with an array of otusIDs to supply the order of the otus elements. - a
treesByID
object replaces atrees
array and the parent (nexml) object will have a^ot:treesElementOrder
key with an array of treesIDs to supply the order of the trees elements - a trees group object will have a
^ot:treeElementOrder
key with an array of treeIDs to supply the order of the tree elements
-
This is the form (1.2.1) that MTH thinks should be stored in serialized form, but on-the-fly translation could make that decision less important for tools other than the api.opentree.org services.
This is the same as syntax 1.1.* except:
- the
internalEdge
andterminalEdge
arrays are replaced by anedgeBySourceId
objects with the following rules:
- The only permitted keys in the object are the
@source
attributes of the egde, - The value associated with the key is an object with keys being the edge ids of the edges have that
@source
. Despite the fact that the@source
would not need to be included in minimally sized representation. The@source
is retained because most clients will want create "edgeById" and/or "edgeByTargeId" maps; the duplication here allows all 3 maps to share references to the same object. Note: in 1.2.0 the value was an array of edges; that is no longer supported by peyotl.
-
Each object in the
tree
array will have a "^ot:rootNodeId" property that holds the ID of the node of the tree that is not the@target
of any edge. The@root
property is still retained in that node. The "^ot:specifiedRoot" is not identical to this, because that property is used to determine if the rooting is arbitrary. -
Instead of
leafById
andinternalNodeById
there is just anodeById
object; there is still no^ot:isLeaf
required because internal node ids will be keys inedgeBySourceId
, enabling a fast answer to the "isLeaf" question.
This representation allows for a very rapid construction of the tree:
- Start at "^ot:rootNodeId"
- build the tree in preorder by looking up all of the outgoing edges in edgeBySourceId
Each of these lookups can be done in constant time, so tree can built in order(N) time without any code to deal with partially connected trees during the building process or any additional memory. Subtrees can also be built by starting at the MRCA.
BadgerFish is one of several schemes for rendering XML as JSON. Several sites, including a site that appears to be the original, and several refinements were consulted in developing the mapping appropriate for NeXML.
Correctness of translation was verified by using a backtranslator and validating the resulting XML using the validator on the NeXML home page.
We were straying from strict BadgerFish by not emitting the active XML namespaces in each object, and occasionally omitting the "datatype" for "meta" elements.
Given that roundtripping a file required special tools, we decided to take the leap and clean up several aspects of the BadgerFish mapping to make data access easier on clients and reduce the size of NexSON.
MTH intends to add logic to the API code produce our old (close to straight BadgerFish conversion) via the API layer if the call include a output_nexml2json=0.*
argument to calls.
Jim Allman, Karen Cranston, Cody Hinchliff, Mark Holder, Peter Midford, and Jonathan Rees participated in discussions and design of NexSON.