Skip to content

Commit

Permalink
Merge pull request #12 from EUDAT-B2FIND/guidelines
Browse files Browse the repository at this point in the history
Guidelines
  • Loading branch information
mkurtz authored Aug 8, 2017
2 parents 39a60e3 + a9c9157 commit a7fb2c8
Show file tree
Hide file tree
Showing 3 changed files with 224 additions and 84 deletions.
76 changes: 65 additions & 11 deletions ckanext/b2find/templates/ckanext/guidelines/harvesting.html
Original file line number Diff line number Diff line change
Expand Up @@ -11,22 +11,76 @@
<div class="module-content">
<h1 class="page-heading">{{ _('Harvesting of Metadata') }}</h1>

<h2><strong class="headerboxbodylogo">Contains</strong></h2>
<h2><strong class="headerboxbodylogo">Contents</strong></h2>
<br/>&nbsp;&nbsp;&nbsp;<a href="#harvesting-channels">Harvesting channels</a>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;<a href="#oai-pmh">OAI-PMH</a>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;<a href="#json-api">JSON-API</a>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;<a href="#csw">CSW2.0</a>
<br/>&nbsp;&nbsp;&nbsp;<a href="#initial_uptake">Initial uptake of a new data provider</a>
<br/>&nbsp;&nbsp;&nbsp;<a href="#operational_ingestion">Synchronous and operational Ingestion</a>

<h3><a name="channels">Harvesting channels</a></h3>
<h2><a name="channels">Harvesting channels</a></h2>

<p>EUDAT-B2FIND supports several protocols to fetch metadata records from data provider site</p>
<p>Harvesting is the process of automatically fetching remote metadata. This paragraph describes how EUDAT-B2FIND harvests metadata records from data provider sites. While OAI-PMH as the de facto standard for metadata harvesting is preferred, B2FIND supports also other APIs as described in the section <a href="#harvesting-channels"> Harvesting channels </a>. Once one of these transfer methods has been successfully implemented, B2FIND performs first an initial uptake of a few test samples to analyse their content, as described in the section <href= ‘Initial uptake of a new data provider’. As soon as the harvesting and mapping has been consolidated and the data provider gives its consent, the metadata are published in the B2FIND database and an operational and stable ingestion process is established (see section ‘Synchronous and operational Ingestion’).</p>

<h3><a name="oai-pmh">OAI-PMH</a></h3>

OAI-PMH is the metadata harvesting protocol preferably used by EUDAT-B2FIND to fetch metadata directly from the data providers within research communities. The simplicity of the protocol allows a controlled and easy to manage transfer of metadata and only little information must be provided to enable B2FIND to perform the harvesting process :
<ul>
<li>OAI endpoint : This is the URL of the OAI provider server on data provider site, which must be open for OAI-PMH read requests.</li>
<li>OAI mdprefix : This is the OAI acronym for the metadata schema in which the provided XML records are coded in.</li>
<li>OAI sets (optional) : It is recommended to group your records in subsets, because this simplifies the controlled harvesting.</li>
</ul>

<h4>Example</h4>
To harvest all Dublincore (OAI mdprefix is <em>oai_dc</em>) from the subset <em>ANDS-Centre_1</em> of the OAI provider of DataCite (<em>oai.datacite.org/oai</em>),we submit a HTTP request with the ‘verb’ <em>ListRecords</em> and the following OAI options set :


<code>https://oai.datacite.org/oai?verb=ListRecords&metadataPrefix=oai_dc&set=ANDS.CENTRE-1</code>

<ul>
<li>Preferably the OAI-PMH v2.0 protocol is used for harvesting metadata directly from the data providers of the research communities. If needed or required, EUDAT gives as well support to set up an OAI-PMH provider. How this can be done by installing the jOAI software is as well described in module 02 of the B2FIND Training

EUDAT-B2FIND uses preferably the OAI-PMH v2.0 protocol for harvesting metadata dircectly from the data providers of the research communities.
If needed or required for EUDAT give as well support to set up an OAI-PMH provider. How this can be done by installing the jOAI software is as well described in module 02 of the <a href="https://github.com/EUDAT-Training/B2FIND-Training"> B2FIND Training </a> </p>
</li>
<li>Furthermore metadata are fetched via the JSON-API ...</li>
<li>For geo portals which provide a corresponding interface harvesting via the protocol CSW2.0 is possible. The implementation is under development</li>
</ul>
<li>

<p>The community Seadatanet ( see seadatanet.org ) exposes georeferenced metadata via the base geonetwork portal with URL endpoint http://sextant.ifremer.fr/geonetwork/srv/fre/csw-SEADATANET . To retrieve the ISO19139 XML records (namespace specification gmd:MD_Metadata ) B2FIND submits a GetRecords request as follows :
http://sextant.ifremer.fr/geonetwork/srv/fre/csw-SEADATANET?SERVICE=CSW&REQUEST=GetRecords&VERSION=2.0.2&typeNames=gmd:MD_Metadata
If necessary, EUDAT will help the data providers to enable OAI-PMH harvesting of their metadata. Please check also module 02 of the <href="..." > B2FIND Training materials </a>, where you find a step by step guide to setup and configure an OAI server. For a detailed documentation of the OAI-PMH protocoal we refer to <a href="http://www.openarchives.org/OAI/openarchivesprotocol.html" > http://www.openarchives.org/OAI/openarchivesprotocol.html </a> .</p>

<h3><a name="json-api">JSON-API</a></h3>

<p>Some data providers offer their metadata encoded as JSON records, which can be retrieved, queried and browsed via a REST API. The API is generally RESTFUL and returns results in JSON, as the API follows the JSONAPI specification.</p>
<h4>Example</h4>
The community GBIF ( see gbif.org ) provides there metadata via the JSON-API at the base URL http://api.gbif.org/v1.
By the following request the first 100 JSON records are retrieved from the repository.
<code>http://api.gbif.org/v1/dataset?offset=0&limit=100</code>

<h3><a name="csw">CSW</a></h3>


<p>Catalog Service for the Web (CSW) is a standard for exposing a catalogue of geospatial records in XML on the Internet (over HTTP). The catalogue is made up of records that describe geospatial data and services. B2FIND uses a CSW 2.0 implementation to harvest XML records from so called GEO network portals.</p>

<h4>Example</h4>
<p>The community Seadatanet ( see seadatanet.org ) exposes georeferenced metadata via the base geonetwork portal with URL endpoint <code>http://sextant.ifremer.fr/geonetwork/srv/fre/csw-SEADATANET</code> . To retrieve the ISO19139 XML records (namespace specification <code>gmd:MD_Metadata</code> ) B2FIND submits a <code>GetRecords</code> request as follows :

<code>http://sextant.ifremer.fr/geonetwork/srv/fre/csw-SEADATANET?SERVICE=CSW&REQUEST=GetRecords&VERSION=2.0.2&typeNames=gmd:MD_Metadata</code>

<h2><a name="initial_uptake">Initial uptake of a new data provider</a></h2>

<p>Once one of the harvesting methods has been deployed successfully and is working, B2FIND starts with an initial harvesting of a few metadata records. This samples are analysed, metadata elements mapped to correct database indices, and the metadata records are uploaded to a B2FIND test / development server.</p>

<p>When both - harvesting and mapping - is at least functional working, some of the issues already mentioned in the paragraph ‘ProvidingMetadata’ has to been negotiated with the data provider :
<ul>
<li>Scope and extent : Shrink the metadata exposed to B2FIND to those which refer to ‘research data’. Best practice would be to gather all records, which are foreseen to get published in B2FIND, in dedicated subsets.</li>
<li>Grouping and partitioning : Choose the subsets (e.g. OAI sets) which should be harvested ( in some cases whole subsets can be assigned to a ‘Discipline’ or can be grouped to ‘sub-communities’ in the B2FIND portal)</li>
<li>
Selection, assigning and mapping : Check the quality of the mapping of your specific fields to the B2FIND metadata schema (see paragraph ‘Mapping onto EUDAT-B2FIND Schema’ ).</li>
</ul>
</p>

<h2><a name="operational_ingestion">Synchronous and Operational Ingestion</a></h2>

<p>In long term it is not only important to have a reliable and sustainable harvesting mechanism established, but also to implement a frequent harvesting schedule. This will guarantee a sufficient synchronicity between the provider (community) database and EUDAT service (B2FIND).
With OAI-PMH the parameter ‘from’ can be used to harvest only records which are newly created or changed during a given period. If for instance an update interval of once a week is agreed, B2FIND establishes a cronjob that is triggered on a weekly basis and with the option ‘from’ set to a date one week earlier.</p>


</article>
{% endblock %}
Expand Down
28 changes: 21 additions & 7 deletions ckanext/b2find/templates/ckanext/guidelines/introduction.html
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,10 @@ <h4><i class="icon-wrench"></i> UNDER CONSTRUCTION <i class="icon-wrench"></i></
</div>

<div class="module-content">
<h1 class="page-heading">{{ _('Introduction') }}</h1>

<p>Welcome to the guidelines of the metadata service EUDAT-B2FIND for data providers that help repository managers to publish metadata of research data. in the EUDAT metadata catalogue B2FIND. Especially information about the requirements for successful integration in B2FIND are provided.</p>

<p>Welcome to the guidelines of the metadata service EUDAT-B2FIND for data providers. These guidelines are intended to provide information about the requirements for successful integration in B2FIND.</p>

<!--
<h2><strong class="headerboxbodylogo">Contents</strong></h2>
Expand All @@ -37,23 +39,27 @@ <h3><a name="audience">Intended Audience</a></h3>

<h3><a name="audience">Objective</a></h3>

<p>EUDAT-B2FIND gathers together metadata related to research output of many heterogeneous sources, with the aim of providing a discovery portal allowing search over a wide cross-disciplinary scope and access the underlying data collections. </p>

<p>The B2FIND guidelines will provide instructions for data providers to expose their metadata to the B2FIND catalogue.</p>
<p>EUDAT-B2FIND gathers together diverse metadata related to research output of many heterogeneous sources, with the aim of providing a unified discovery portal allowing widespread and cross-disciplinary search and access to the underlying data collections. </p>

<p>The B2FIND guidelines provide instructions to be followed and policies to be fulfilled during the establishment of the associated ingestion workflow.
</p>


<h3><a name="principles">Open research data principles</a></h3>

<p>In general EUDAT propagates 'best practices' which follow the so called FAIR principles. We refer here to the paper <a href="http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf"> Guidelines on FAIR Data Management in Horizon 2020 </a>. Amongst other principles, the concordat promotes:
<p>EUDAT propagates 'best practices' which follow the so called FAIR principles. We refer here to the paper <a href="http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf"> Guidelines on FAIR Data Management in Horizon 2020 </a>. Amongst other principles, the concordat promotes
<ul>
<li>the availability of data supporting scholarly publications,</li>
<li>the use of data repositories, </li>
<li>the value of data curation to enable data access and reuse,</li>
<li> support for developing researchers’ data skills,</li>
<li> cultural norms of academia that ensure individuals can gain credit for data sharing.</li>
</ul>
</p>

As far as this principles affect metadata we address them in the following sections.</p>
<p>To the extend that these principles affect metadata we address them in the following sections.</p>
<p>In general great importance is lied to provide a low-barrier approach to allow easy integration in B2FIND.</p>

<h3><a name="content">Contents of the Guidelines</a></h3>
<p>The guidelines are divided in three thematic paragraphs :
Expand All @@ -66,16 +72,24 @@ <h3><a name="content">Contents of the Guidelines</a></h3>

<h3><a name="workflow">B2FIND Workflow</a></h3>

<p> The associated ingestion workflow is divided in the three sub processes harvesting, mapping and uploading as schematically shown in the following figure.</p>
<p>The B2FIND metadata ingestion workflow is schematically shown in the following figure.</p>

<figure>
<img border="1" vspace="1" class="left" src="/images/B2FIND_Workflow.png" width="600" height="120" alt="B2FIND ingestion workflow">
<figcaption> The B2FIND ingestion workflow</figcaption>
</figure>

<p>Beside the first (MD Generation) and last (MD Uploading) step, the sub processes ‘MD Providing’, ‘MD Harvesting’ and ‘MD Mapping’ correspond to the paragraphs of the guidelines.</p>
<p>While we don't go in the technical details, we refer
<ul>
<li>data managers who are interested in the step by step implementation of the whole MD ingestion workflow to the <a href="https://github.com/EUDAT-Training/B2FIND-Training"> B2FIND-Training </a> and </li>
<li>developers who are interested in the underlying software and the sourcecode to the github repository <a href="https://github.com/EUDAT-B2FIND"> EUDAT-B2FIND </a>.</li>.

<h3><a name="versions">Versions</a></h3>
<ul>
<li>1.0 May 2017 - Initial publication</li>
<li>1.0 August 2017 - Initial publication on <a href="http://b2find.eudat.eu"> the productive B2FIND instance </a></li>
<li>0.3 June 2017 - Initial publication on <a href="http://trng-b2find.eudat.eu"> the training instance </a>.</li>
<li>0.2. May 2017 - Reviewed version and first draft for web pages</li>
<li>0.1 February 2017 - Initial internal reviewed document</li>
<li>0.0.2 January 2017 - added changes and proposels - still working paper</li>
<li>0.0.1 December 2017 - Early draft, B2FIND team.</li>
Expand Down
Loading

0 comments on commit a7fb2c8

Please sign in to comment.