While many vocabularies exist that claim to be able to describe dataset metadata, none of them is able to describe some of the most common real world dataset deployments. This is in line with the empirical observation that usable dataset metadata descriptions do not occur in the wild.
LDmeta is an attempt to formulate a dataset metadata vocabulary that can actually be used to describe datasets, dataset distributions, and location on the internet where such distributions might be downloaded from (or otherwise accessed).
This section enumerates criteria that should be met by a dataset metadata vocabulary in order to be useful.
VoID fails at this criterion, since it specifically allows inaccurate data to be specified, while in the same process disallowing accutate data to be specified.
Both DCAT and VoID fail at this since properties like void:format
,
dcat:byteSize
, dcat:last-modified
, and dcat:mediaType
can only
be specified for datasets (VoID) or distributions (DCAT).
VoID fails at this, since it defines a dataset as:
A set of RDF triples that are published, maintained or aggregated by a single provider.
The RDF 1.1 standards specify a dataset as consisting of a single default graph and an arbitrary number of named graphs.
LDmeta is a dataset metadata vocabulary that tries to implement the
criteria specified in the previous section. We use alias ldm
to
denote IRI prefix https://ldm.cc/
.
ldm:Dataset a rdfs:Class;
rdfs:comment "<div lang="en"><p>An RDF dataset is a collection of RDF graphs, and comprises:</p><ul><li>Exactly one default graph, being an RDF graph. The default graph does not have a name and MAY be empty.</li><li>Zero or more named graphs. Each named graph is a pair consisting of an IRI or a blank node (the graph name), and an RDF graph. Graph names are unique within an RDF dataset.</li></ul></div>"^^rdf:HTML;
rdfs:label "class"@en.
ldm:Distribution a rdfs:Class;
rdfs:comment "A concrete representation of a dataset in terms of electronic files."@en;
rdfs:label "distribution"@en.
ldm:File a rdfs:Class;
rdfs:comment "A binary file that can be stored on a computer and transmitted over a network."@en;
rdfs:label "file"@en.
ldm:byteSize a rdf:Property;
rdfs:comment "The number of bytes contains in a specific file."@en;
rdfs:domain ldm:File;
rdfs:label "byte size"@en
rdfs:range ldm:Distribution.
ldm:distribution a rdf:Property;
rdfs:comment "A dataset can have one or more distributions."@en;
rdfs:domain ldm:Dataset;
rdfs:label "distribution"@en;
rdfs:range ldm:Distribution.
ldm:downloadLocation a rdf:Property;
rdfs:comment "The location from which an electronic file can be downloaded. The same file may be downloaded form multiple locations."@en;
rdfs:domain ldm:File;
rdfs:label "download location"@en;
rdfs:range xsd:anyURI.
ldm:encoding a rdf:Property;
rdfs:comment "The encoding of a specific file."@en;
rdfs:domain ldm:File;
rdfs:label "encoding"@en;
rdfs:range xsd:string.
ldm:file a rdf:Property
rdfs:comment "A distribution can contain one or more files."@en;
rdfs:domain ldm:Distribution;
rdfs:label "file"@en;
rdfs:range ldm:File.
ldm:fileName a rdf:Property;
rdfs:comment "The name of a file. This is typically componend of a base name, followed by a dot, followed by a file extension."@en;
rdfs:domain ldm:File;
rdfs:label "file name"@en;
rdfs:range xsd:string.
ldm:mediaType a rdf:Property;
rdfs:comment "The Media Type of a specific file."@en;
rdfs:domain ldm:File;
rdfs:label "media type"@en;
rdfs:range xsd:string.
For example, the following is a correct DCAT description. It is not
possible to publish <file1>
and <file2>
as part of the same
distributions, since the Media Types can then no longer be related to
the appropriate files.
<dataset> a dcat:Dataset;
dcat:distribution
<distribution1>,
<distribution2>.
<distribution1> a dcat:Distribution;
dcat:downloadURL <file1>;
dcat:mediaType "text/xml".
<distribution2> a dcat:Distribution;
dcat:downloadURL <file2>;
dcat:mediaType "application/json".
The range of dcat:byteSize
is xsd:decimal
. Since it does not
possible for byte-sizes to be non-whole numbers or negative numbers,
this range specification is unnecessarily broad, facilitating
incorrect descriptions. (This is why the range of ldm:byteSize
is
xsd:nonNegativeInteger
.)
Some VoID statistics properties are about the number of syntactic
terms (e.g., void:distinctSubjects
, void:distinctObjects
), while
others are about the number of semantic objects (e.g., void:classes
,
void:properties
). In practice, this leads to confusion, as most
people seem to use the semantic properties in a syntactic way. E.g.,
the following RDF snippet will often be characterized as containing
one distinct property, even though it in fact contains two such
properties:
rdfs:subClassOf rdfs:domain rdfs:Class.
On the Semantic Web, it is generally not possible to determine whether
or not two (syntactic) terms do or do not denote the same property or
class. This means that the properties void:properties
and
void:classes
cannot be specified. For example, the following RDF
snippet contains two distinct class-denoting terms, but this does not
imply that the snippet also references two distinct classes.
<s:s> a <c:c> , <d:d>.
Since in Linked Data it is common for data sources to make (schema) assertions about terms that also appear in other data sources, some other data source may or may not contain the following triples:
<c:c> owl:sameAs <d:d>.
Is it possible to represent the very common case in which one dataset
is serialized into two or more RDF serialization formats. VoID
includes the void:feature
property, which seems to be included for
this specific purpose. With this property it is possible to describe
a dataset that has two dump files, one of which is encoded in RDF/XML
while the other is encoded in Turtle. Unfortunately, it is not
possible to encode which file uses which encoding.
<dataset> a void:Dataset;
void:dataDump
<file1>,
<file2>;
void:feature
formats:RDF_XML,
formats:Turtle.
The VoID vocabulary standard contains the following piece of text:
As a general rule, statistics in VoID can always be provided as approximate numbers.
This statement allows the fomulation of dataset metadata descriptions that are incorrect. However, it has a far worse and far-reaching consequence: VoID does not only allow inaccurate metadata to exist, it prevents accurate metadata from being expressed.
Suppose I want to publish a dataset with exactly 1,000 triples, and I want to assert that as a fact:
<dataset> a void:Dataset;
void:triples "1000"^^xsd:nonNegativeInteger.
A data consumer that reads the above Turtle snippet is unable to determine the size of the described dataset, even if the data consumer believes the metadata description to be true.
This README file uses the following RDF prefix declarations: