Home | Getting Started
If you are new to publishing schema.org, here are some general tips to getting started.
- Goal
- Approach
- Prerequisites
- Introduction
- Using schema.org
- Resource Modification Time
- JSON-LD Graph Techniques
To provide a place for the scientific data community to work out how best to implement schema.org and other external vocabularies on web pages by publishing guidance documents. Pull requests and Github Issues are welcome!
- To be pragmatic with our use of schema.org and external vocabulary adoption.
- To consider schema.org classes and properties first before considering external vocabularies.
- Use JSON-LD in our guidance documents for simplicity and terseness as compared to Microdata and RDFa. For more, see Why JSON-LD? from the Conventions document.
- Presently, the Google Rich Results Tool enforces use of schema.org classes and properties by displaying an error whenever external vocabularies are used. schema.org proposes linking to external vocabularies using the schema:additionalType property. While this property is defined as a sub property of rdf:type, its data type is a literal. However, using the Schema.org Validator allows for the use of external vocabularies. We encourage the use of JSON-LD
'@type'
for typing classes to external vocabularies. For more, see Typing to External Vocabularies from the Conventions document. - See Governance for how we will govern the project.
- See Conventions for guidance on creating/editing guidance documents.
JSON-LD is valid JSON, so standard developer tools that support JSON can be used. For some specific JSON-LD and schema.org help though, there are some other resources.
JSON-LD resources https://json-ld.org
Generating the JSON-LD is best done via libraries like those you can find at https://json-ld.org. There are libraries for; Javascript, Python, PHP, Ruby, Java, C# and Go. While JSON-LD is just JSON and can be generated many ways, these libraries can generate valid JSON-LD spec output.
JSON-LD playground https://json-ld.org/playground/
The playground is hosted at the very useful JSON-LD web site site. You can explore examples of JSON-LD and view how they convert to RDF, flatten, etc. Note that JSON-LD is not associated with schema.org. It can be used for much more and so most examples at the JSON-LD website don't use schema.org and the site will NOT look to see if you are using schema.org types and properties correctly; it will only check that your JSON-LD is well-formed.
- We assume that you've heard about schema.org and have already decided that it's useful to you.
- We assume that you have a general understanding of what may describe a scientific dataset.
Let's go!
There is an emerging practice to leverage structured metadata to aid in the discovery of web-based resources. Much of this work is taking place in the context (no pun intended) of schema.org. This approach has extended to the resource type Dataset. This page will present approaches, tools and references that will aid in the understanding and development of schema.org in JSON-LD and its connection to external vocabularies. For a more thorough presentation on this, visit the Google AI Blog entry of January 24 2017 at https://ai.googleblog.com/2017/01/facilitating-discovery-of-public.html .
JSON-LD should be incorporated into the landing page html inside the <head></head>
as a <script>
element with a type of application/ld+json
.
<html>
<head>
...
<script id="schemaorg" type="application/ld+json">
{
"@context": "https://schema.org/",
"@id": "http://opencoredata.org/id/dataset/bcd15975-680c-47db-a062-ac0bb6e66816",
"@type": "Dataset",
"description": "Janus Thermal Conductivity for ocean drilling ..."
}
</script>
...
</head>
...
</html>
The context
in a JSON-LD document defines the namespaces used in the document and their mappings to URIs when they
are referenced using prefix notation. The JSON-LD 1.1 specification
provides many rules that impact how the context is
loaded and how it is retrieved, but ultimately the goal is to define a context map with the namespace mappings for each
vocabulary used in the document. For the schema.org vocabulary specifically, the official namespace is http://schema.org/
(note this is not an https
URI), but the context file for schema.org can be retrieved from the https
web location at
https://schema.org
by following the JSON-LD processing rules. For providers, this translates to a few simple recommendations.
- We recommend retrieving the context file from its
https
location using the following syntax:
{
"@context": "https://schema.org/",
"@type": "Dataset",
"name": "Example dataset title",
...
}
Using this approach, the schema.org namespace will be set to http
URIs. For example the Dataset
type will be expanded to
http://schema.org/Dataset
.
- Should you need to define additional namespaces in your context, it can be done by expanding the context using a JSON array as follows:
{
"@context": [
"https://schema.org/",
{
"prov": "http://www.w3.org/ns/prov#"
}
],
"@type": "Dataset",
"name": "Example dataset title",
"prov:wasDerivedFrom": {
"@id": "https://doi.org/10.xxxx/Dataset-1"
}
}
Note the square brackets, in which the first entry is the URL of a context file to be retrieved, and the second value is a JSON object to be
combined with the retrieved context. This approach still retrieves the context from the secure https
URL at schema.org, but then adds an additional
namespace for the prov
vocabulary to the context. Now, terms from the PROV namespace can be referenced using prefix notation (e.g., prov:wasDerivedFrom
).
- Additional approaches to defining the context are possible, but users should use care to ensure that the terms within schema.org use the
http://schema.org/
namespace as defined in the official schema.org context file. Because contributors to schema.org are working towards acceptinghttps://schema.org/
as an equivalent namespace URI for all terms, processors should treat schema.org terms in the http and https URI spaces as equivalent, but providers might find it safer to continue to usehttp://schema.org/
as the official namespace for now. This particularly applies when defining a default vocabulary for un-prefixed terms, in which case we recommend using"@vocab": "http://schema.org/"
if this is necessary. That said, most users should not have need to define@vocab
in typical usage.
Many harvesters and aggregators depend on the existence of a sitemap.xml
file on your site that lists all of the dataset landing pages from your site that you want to be harvested and indexed for search. Google Dataset Search, DataONE, and Geocodes all can make use of a sitemap to more efficiently harvest your site. A sitemap is a simple text file that lists each page that you want harvested. This can contain any webpage, but in this context we specifically want to list pages that contain a schema:Dataset
entry to be harvested. Here's an example sitemap.xml
file listing two Dataset landing pages, along with their lastmod
date:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2GH9BB0Q</loc>
<lastmod>2021-12-06</lastmod>
</url>
<url>
<loc>https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2ST7DZ2Q</loc>
<lastmod>2021-12-07T12:15:05Z</lastmod>
</url>
</urlset>
Note the <lastmod>
field, which indicates the last date on which the page was modified and is formatted as a W3C DateTime and may vary in precision. Most harvesters will use that date along with the HTTP Last-modified
header to determine if a page has changed since the last time that a harvest was attempted. Keeping accurate lastmod
values can massively improve the efficiency of indexing your catalog, as only the few items that have changed will need to be indexed.
Location: The sitemap.xml file can be located anywhere on your site that is above the path in the hierarchy in which your pages are listed. Typically, the sitemap.xml is placed at the root of the site, but other locations can be used as well. A great way to indicate to harvesters where your sitemap is located would be to include it in your robots.txt
file, which is basically an instruction manual for harvesters, at the root of your web site. For example, you might have a robots.txt file with the following contents:
User-agent: * Sitemap: https://arcticdata.io/sitemap1.xml
Sitemaps are limited to 50,000 records and 50MB, so if your site is larger than that you can break up your sitemap into multiple files, linked together using a sitemap index. Details about sitemap-index.xml
and other aspects of sitemaps are provided in the https://sitemaps.org site, as well as from the Google sitemap documentation.
By providing a sitemap and advertising its location, you make it simple for harvesters to find and index your Dataset listings.
For each schema.org type, such as Person or Event, there are fields that let you specify more information about that type. Each of these fields has an expected data type that is defined in the documentation as you can see from Figure 1..
Figure 1. schema.org field data types
The expected data type for each field appears in the middle column. The left column is the name of the field, the middle column is the data type, and the right column is the description of the field.
Every data type is either a resource or a literal. Resources refer to other schema.org types. For example, a Dataset type has a field called 'author' of which the data type can be either a 'Person' or an 'Organization'. Because 'Person' and 'Organization' are other schema.org "types" with their own fields, they are called resources. In JSON-LD, you specify resources by using curly brackets {}
:
{ "@context": "https://schema.org/", "@type": "Dataset", "author": { "@type": "Person", "name": "Jane Goodall" } }
In the JSON-LD above, the 'author' is a resource of type 'Person'. Fields that simply have a value are called literal data types. For examples, the 'Person' type above has a 'name' of "Jane Goodall" - a literal text value.
Schema.org defines six literal, or primitive, data types: Text, Number, Boolean, Date, DateTime, and Time. Text has two special variations: URL and how to specify when text is actually HTML.
When using schema.org, literal data types are not specified using curly brackets {}
as these are reserved for specifying 'objects' or 'resources' such as other schema.org types like Person
, Organization
, etc. First, let's see how to use a primitive data type by using fields of CreativeWork, the superclass for Dataset.
Imagine we want to say the name of our Creative Work is "Passenger Manifest for H.M.S. Titanic". The name field of CreativeWork specifies that it expects Text as the data type. We would use it in this way:
{ "@context": "https://schema.org/", "@type": "CreativeWork", "name": "Passenger Manifest for H.M.S. Titanic" }
Let's say we want to specify the version number of our manifest using the version field of CreativeWork which expects a Number. To specify numbers in JSON-LD, we omit the quotations surrounding the value:
{ "@context": "https://schema.org/", "@type": "CreativeWork", "name": "Passenger Manifest for H.M.S. Titanic", "version": 1 }
Now, let's specify the URL of our manifest using the url field of CreativeWork, an inherited field from Thing. This fields expects a valid URL represented as Text:
{ "@context": "https://schema.org/", "@type": "CreativeWork", "name": "Passenger Manifest for H.M.S. Titanic", "version": 1, "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv" }
Using the Boolean value, we can specify that our manifest is accessible for free using the field isAccessibleForFree by using the text true
or false
and omitting the quotes:
{ "@context": "https://schema.org/", "@type": "CreativeWork", "name": "Passenger Manifest for H.M.S. Titanic", "version": 1, "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv", "isAccessibleForFree": true }
To specify the datePublished, which allows either a Date or DateTime, as a Date, we can use any ISO 8601 date format by wrapping the date in double-quotes:
{ "@context": "https://schema.org/", "@type": "CreativeWork", "name": "Passenger Manifest for H.M.S. Titanic", "version": 1, "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv", "isAccessibleForFree": true, "datePublished": "2018-07-29" }
To specify the dateModified as a DateTime, as a Date, we must follow the ISO 8601 format for combining date and time representations using the form [-]CCYY-MM-DDThh:mm:ss[Z|(+|-)hh:mm]
:
{ "@context": "https://schema.org/", "@type": "CreativeWork", "name": "Passenger Manifest for H.M.S. Titanic", "version": 1, "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv", "isAccessibleForFree": true, "datePublished": "2018-07-29", "dateModified": "2018-07-30T14:30Z" }
Time is a rarely-used data type because it must represent a point in time recurring on multiple days following the XML Schema definition using the form hh:mm:ss[Z|(+|-)hh:mm]
(see XML schema for details).
{ "@context": "https://schema.org/", "@type": "CreativeWork", "name": "Passenger Manifest for H.M.S. Titanic", "version": 1, "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv", "isAccessibleForFree": true, "datePublished": "2018-07-29", "dateModified": "2018-07-30T14:30Z" }
The HTML data type is a special variation of the Text
data type. In some cases where Text
is the expected data type, our actual data type may be HTML (because we are dealing with web pages). In this case, the schema.org JSON-LD context defines HTML
to mean rdf:HTML, the data type for specifying that a string of text should be interpreted as HTML. Let's say that we have a description of our manifest and want to use the description field, but we have HTML inside that text. Using the text field as we did above for the name
field, we would specify the description
as:
{ "@context": "https://schema.org/", "@type": "CreativeWork", "name": "Passenger Manifest for H.M.S. Titanic", "version": 1, "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv", "isAccessibleForFree": true, "datePublished": "2018-07-29", "dateModified": "2018-07-30T14:30Z", "description": "<h3>Acquisition</h3><p>The data was acquired from an office outside of <a href\"https://en.wikipedia.org/wiki/New_York_City\">New York City</a>." }
However, to specify that the description
field should be interpreted as HTML, you specify description
as a resource, setting the @type
of that resource to "HTML" and placing the HTML string in a JSON-LD property @value
:
{ "@context": "https://schema.org/", "@type": "CreativeWork", "name": "Passenger Manifest for H.M.S. Titanic", "version": 1, "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv", "isAccessibleForFree": true, "datePublished": "2018-07-29", "dateModified": "2018-07-30T14:30Z", "description": { "@type": "HTML", "@value": "<h3>Acquisition</h3><p>The data was acquired from an office outside of <a href\"https://en.wikipedia.org/wiki/New_York_City\">New York City</a>." } }
All schema.org resources should make use of the @type
property which 'classifies' the resources as a specific type. For example, an un-typed resource would look like:
{ "@context": "https://schema.org/", "name": "My Dataset" }
Even though the above resource has a name of 'My Dataset', harvesters are unaware that your intent was to classify it as a Dataset. Un-typed resources are not valid schema.org resources, and so they require the @type
property:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "My Dataset" }
In some cases, it useful to multi-type a resource. One example of this may be a data repository. A data repository is typically functioning as both an 'Organization' that employs people and has an address, but also as a 'Service' to its user community. To assign multiple types to a resource, we use JSON arrays:
{ "@context": "https://schema.org/", "@type": ["Organization", "Service"], "name": "My Data Repository" }
All schema.org types may be found here.
An indication of when a resource was modified is valuable to a consumer for a variety of reasons.
A consumer tracking changes in a collection of SO:Dataset
or similar resources being advertised
with a sitemap.xml
or similar mechanism has at least three timestamps that can be examined to
determine if an already retrieved resource may have been modified: the schema.org/dateModified
property in the JSON-LD, the Last-Modified
time reported by the web server, and the <lastmod>
time that may be reported in a sitemap.xml
document.
The schema.org/dateModified
value should be considered authoritative for indicating when the
resource was modified. The Last-Modified
header should reflect the corresponding
schema.org/dateModified
entry. This property provides an important hint for consumers as to
whether a cached copy of a resource should be updated for example. Similarly the <lastmod>
entry should reflect the Last-Modified
header and the schema.org/dateModified
value.
A typical pattern for a consumer interesting in synchronizing a cache of resource is:
- Examine the sitemap for new or updated entries using hints from
<lastmod>
- Retrieve the resource directly or by previewing with a HTTP HEAD request. A
Last-Modified
provides a hint as to whether the resource should be retrieved. - Examine the
schema.org/dateModified
property of the resource(s) extracted from the resource.
Providing accurate hints early in the process can reduce requirements for effectively sharing data resources.
Each schema.org
instance derived from schema.org/CreativeWork
may have a dateModified
property to indicate "The date on which
the CreativeWork was most recently modified or when the item's entry was modified within a DataFeed."
This property should be provided with any instance of schema.org/Dataset
or any other schema.org
entity published in a landing page or though other mechanisms. The JSON spec does not include a
built-in type for date time values, but the general consensus and a sensible practice is to
represent a date time value as a time zone aware ISO 8601 formatted string. For example:
{
"dateModified": "2018-12-10T13:45:00.000Z"
}
A schema.org instance is typically embedded in a landing page or may be accessed directly as a
JSON-LD document over the HTTP protocol. HTTP resource providers (i.e. web servers) may include
a Last-Modified
header which contains the
date and time at which the origin server believes the resource was last modified. The format for
the date value follows the RFC 2616 specification. For
example:
Last-Modified: Mon, 10 Dec 2018 13:45:00 GMT
A sitemap.xml
document provides a mechanism for a
resource server to advertise available resources. Each <url>
element may include a <lastmod>
tag to indicate when the resource identified by the <url>/<loc>
was last modified. The
specification is fairly loose, indicating that date in the
W3C Datetime format of YYYY-MM-DD
may be
sufficient. However, for the purposes of content synchronization, a higher precision is
desireable, and should be provided where possible. For example:
2018-12-10T13:45:00.000Z
JSON-LD documents represent a graph model, even though at times that graph is implicit rather than being named. Here are some techniques that may be useful when constructing such graphs.
Unlike plain JSON, collections in JSON-LD are unordered [1, 2]. In cases where ordering of items needs to be preserved, we can use the @list
keyword to specify that order should be preserved for a collection. Ordered lists would be important, for example, when a list of authors or creators should be ordered as intended when rendering a view of the metadata, or when a list of bounding box coordinates in an array need to come in a particular order.
In the following example, the list of creator
items is not ordered, and so client tools could return the creator names in any order, and different tools may return them in different orders. This would be problematic for building a citation, for example.
Example 1. Ordering for this list of creators will not be preserved:
{
"@context": "https://schema.org/",
"@id": "unordered_01",
"@type": "Dataset",
"creator": [
{
"@id": "https://www.sample-data-repository.org/person/51317",
"@type": "Person",
"name": "Dr Uta Passow"
},
{
"@id": "https://www.sample-data-repository.org/person/50663",
"@type": "Person",
"name": "Dr Mark Brzezinski"
}
]
}
To order a list, use the JSON-LD @list
keyword`, as shown in Example 2:
Example 2. Order will be preserved for this list of creators:
{
"@context": "https://schema.org/",
"@id": "order_01",
"@type": "Dataset",
"creator": {
"@list": [
{
"@id": "https://www.sample-data-repository.org/person/51317",
"@type": "Person",
"name": "Dr Uta Passow"
},
{
"@id": "https://www.sample-data-repository.org/person/50663",
"@type": "Person",
"name": "Dr Mark Brzezinski"
}
]
}
}
Ordering may be specified globally within the document by specifying the container type in a context. For example,
after retrieving the context file from schema.org, we can define the schema:creator
to be a list container globally
in the document using the @container
property:
Example 3. Ordering of a list of creators is preserved anywhere such a list appears within the context.
{
"@context": [
"https://schema.org/",
{
"creator": {
"@container": "@list"
}
}
],
"@id": "order_02",
"@type": "Dataset",
"creator": [
{
"@id": "https://www.sample-data-repository.org/person/51317",
"@type": "Person",
"name": "Dr Uta Passow"
},
{
"@id": "https://www.sample-data-repository.org/person/50663",
"@type": "Person",
"name": "Dr Mark Brzezinski"
}
]
}
With this technique, ordering can be set once in the context using @list
, and then order will be preserved any time that concept is used in the document.