How_to_publish_UNIS_data.tex

\documentclass[a4paper,12pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb} 
\usepackage{enumitem}
\usepackage{geometry}
\usepackage{natbib}
\geometry{margin=1in}

\newcommand{\emailauthors}{\href{mailto:lukem@unis.no,steinha@unis.no}{lukem@unis.no, steinha@unis.no}}

\bibliographystyle{abbrvnat}
\setcitestyle{authoryear,open={(},close={)}}

\title{How to Publish UNIS Data - A Step by Step Guide}
\author{
    Luke Marsden, Stein Haaland \\
    \emailauthors
}
\date{\today}

\begin{document}

\maketitle

\begin{abstract}
The most important product of research is knowledge. At UNIS, most of the knowledge is derived from in-situ observations of nature during field work. To preserve this information, UNIS has teamed up with Svalbard Integrated Earth Observation System (SIOS) to make scientific data available and easily accessible according to the Findable Accessible Interoperable Reusable (FAIR) principles \citep{wilkinson2016fair}.

This document outlines a step-by-step procedure to go from digital measurements to data availability via the SIOS portal.
\end{abstract}

\tableofcontents

\section*{Version Information}
\label{sec:version-info}
\begin{itemize}
    \item Version 1.0 - Initial release - \today
\end{itemize}

\newpage

\section{Data Collection, Preparation and Quality Control}
\label{sec:data-collection-quality-control}

\subsection{Think ahead}
It is a good idea to have publishing in mind already when acquisition takes place, so make sure you record where and when the data were obtained, equipment used, condition etc., and whether the data can be classified as open (UNIS Data Classification - see \url{https://unissvalbard.sharepoint.com/Data%20Handling%20Science/ScienceDataGuidelines.pdf?web=1}). Also, make sure your observations are properly calibrated and have proper units. Erroneous data should not be archived, uncalibrated data sets should be clearly labeled as so in the metadata. 

\subsection{Consistent and structured files}

Converting customised files to a FAIR-compliant data formats can be time consuming. However, time can be saved by ensuring that the files are well-structured and populated in a consistent way.

\begin{itemize}
\item Be consistent in how you are entering values.
\begin{itemize}
\item Don't mix numbers and characters in the same column
\item Have a convention for missing values. Either leave the cells blank or use some fill value that you indicate elsewhere in your file
\end{itemize}
\item Use clear column headers. Don't use abbreviations.
\item Specify the units for each column in the headers or in a separate file.
\item Use consistent naming conventions for variables and also file names.
\item Cell formatting in Excel will disappear when you load the data into Python, R, Matlab, or other software that you might use to work with your data. Don't rely on Excel formatting to make your data understandable.
\end{itemize}

\subsection{Use templates}

If an individual uses the same template for multiple data collections, they can write some code to automate converting their data. If communities develop templates, code can be shared or software can be developed for converting the templates. Data in the templates themselves can also be shared and more easily understood by other members of the community. Both data sharing and conversion to CF-NetCDF or Darwin Core Archives can be further simplified if the CF conventions and/or Darwin Core terms are considered when designing templates.

If you will later be creating a Darwin Core Archive or CF-NetCDF file and you like to work with spreadsheets, consider using the Nansen Legacy template generator \citep{templategenerator, marsden2024nansen}.\\
\url{https://www.nordatanet.no/aen/template-generator}.\\

Key features:
\begin{itemize}
\item Separate configurations to help you create either a Darwin Core Archive or CF-NetCDF file
\item Select from a full list of Darwin Core terms and CF standard names to use as column headers
\item Outlines required and recommended data and metadata terms
\item Descriptions for each term as notes when you select a cell
\item Cell restrictions to prevent you from entering invalid values 
\end{itemize}

If the template generator helps you and your work, consider citing it just like you would a paper or data publication.

%As a very simple example, we use a time series of sea temperatures taken outside Svalbard. The measurement series looks something like:
%
%\begin{table}[h!]
%\centering
%\caption{Example time series of sea temperatures}
%\label{table:sea-temperatures}
%\begin{tabular}{lcccc}
%\hline
%\textbf{Time UTC} & \textbf{Depth} & \textbf{Latitude} & \textbf{Longitude} & \textbf{Temperature} \\
%\hline
%2024-06-07 13:10:23 & 10.12 & 78.3122 & 8.1012 & 5.33 \\
%2024-06-07 13:11:23 & 10.56 & 78.3163 & 8.1032 & 5.27 \\
%2024-06-07 13:12:23 & 10.23 & 78.3196 & 8.1061 & 5.25 \\
%\hline
%\end{tabular}
%\end{table}

\section{Suitable Data Formats}
\label{sec:suitable-data-formats}

\subsection{Why choice of data format matters}

As a scientific community, we are now quite good at publishing our data in a way that makes them both findable and accessible. However, FAIR data must also be interoperable and reusable. Central to the FAIR principles is the requirement that data and metadata be fully readable and understandable by machines, a point emphasized throughout by Wilkinson et al. (2016). This machine-readability is crucial for building efficient and effective services on top of data at scale. Examples of services include the integration of multiple datasets, on-the-fly visualization of datasets, the development of monitoring systems and forecasting. This is increasingly important in the age of big data.

Some particularly noteworthy services that deserve attention are:
\begin{itemize}
    \item \textbf{Destination Earth}: A project to develop a digital twin of the Earth. \url{https://destination-earth.eu/}
    \item \textbf{Global Biodiversity Information Facility (GBIF)}: An international network and data infrastructure that provides open access to data about all types of life on Earth, enabling research and informed decision-making in biodiversity conservation. \url{https://www.gbif.org/}
    \item \textbf{Copernicus}: The European Union's Sentinel Earth observation programme, providing comprehensive remote-sensing data for environmental monitoring, climate change analysis, and disaster management.\url{https://dataspace.copernicus.eu/explore-data}
\end{itemize}

These projects exemplify the potential of FAIR data in enabling advanced research, integrated environmental monitoring, and comprehensive data-driven decision-making systems. By publishing FAIR data, UNIS can contribute to services like these. 

\subsubsection{Should all data be made FAIR-compliant?}

This is a question that sparks some debate. Whilst it is relatively simple to publish certain data in FAIR-compliant data formats (e.g. time series of data, vertical profiles, biodiversity data), for some data it is more complicated (e.g. data from complex experiments, qualitative data). It is up to the data management community to make this process easier for scientists by providing support and developing tools to simplify the data transformation. 

The UNIS data policy \citep{UNISdatapolicy} mandates that "All data are published in a self describing form ... where possible". This does not explicitely mention FAIR data. So, to elaborate here:
\begin{itemize}
\item Data that can be published in a FAIR-compliant format should be wherever plausible (can be done within an acceptable timeframe by an experienced user - new users should expect this to take more time).
\item For complex datasets, if there are certain important variables that can be published in FAIR-compliant formats, these variables should be. 
\item Some data might take an excessive time to publish in a FAIR-compliant data format and it simply isn't worth the time that would need to be invested to do so. In some cases the data don't fit well into such formats. In other cases, suitable data formats are not available. In such cases, we should be aiming for human-readable data. See section \ref{sec:humanreadable}.   
\end{itemize} 

If you are not sure whether your data should be published in a FAIR-compliant data format, email us.
Email: \emailauthors  \\

\subsection{What Makes a Data Format FAIR-compliant}

To be considered FAIR-compliant, a data format must adhere to the following principles:

\begin{itemize}
\item \textbf{Software Independence:} Data formats should not be tied to a specific software. FAIR-compliant formats must be accessible using various software applications across different operating systems.
\item \textbf{Inclusion of Metadata:} All necessary metadata required to understand and utilise the data should be embedded within the file.
\item \textbf{Use of Controlled Vocabularies:} Both data and metadata terms should be derived from controlled vocabularies. A controlled vocabulary is a well-documented glossary of terms that is accessible online and actively maintained. Each term should include:
\begin{itemize}
\item A clear description
\item Instructions on how to use the term
\item A unique identifier to distinguish it from other terms
\end{itemize}
\item \textbf{Standardised Structure:} There should be defined conventions for the placement of data and metadata within the file. All files following these conventions should organise their data and metadata in the same way. The specific conventions adhered to by the dataset should be explicitly stated.
\item \textbf{Machine Readability/Interoperability:} The aforementioned points are crucial in reducing the variability in file creation, thereby enhancing interoperability. This ensures that software or services can be developed to work with all datasets following the same conventions.
\item \textbf{Documentation:} The data format and conventions should be well-documented, with comprehensive guidelines and examples provided to facilitate its use by others. Good documentation ensures that users can correctly interpret and utilise the data.
\item \textbf{Versioning:} It is essential to maintain version control for data formats to track changes and updates over time. Additionally, data formats should ensure that past versions remain supported by future versions to guarantee long-term usability.
\end{itemize}

SIOS provides interoperability guidelines (\url{https://sios-svalbard.org/sites/sios-svalbard.org/files/common/SDMS_Interoperability_Guidelines.pdf}) that provide details regarding suitable and less suitable data formats. In the sections that follow, we refer to a few suitable data formats for UNIS data and refer to resources available to help you create these.

Please note that this list might not cover all suitable FAIR-compliant data formats. If you think that something is missed, please let us know by emailing the authors.\\ \\
Email: \emailauthors  \\

We also acknowledge that there may not be a suitable FAIR-compliant data format in all cases.

\subsection{CF-NetCDF}

Designed to facilitate the creation, access, and sharing of array-based scientific data with any number of dimensions.

\subsubsection{Data that should be published in a CF-NetCDF file}
\begin{itemize}
\item \textbf{Meteorological data:} such as temperature, pressure, and humidity measured at different altitudes and times
\item \textbf{Oceanographic data:} such as sea temperature, salinity, or the concentration of chlorophyll a or different nutrients.
\item \textbf{Model outputs:} consisting of multi-dimensional data arrays representing various environmental parameters, perhaps over time.
\item \textbf{Environmental science:} such as ir quality, hydrology, and soil moisture
\item \textbf{Point clouds/DEMs}
\item \textbf{Some geological data:} such as sedimentary logs, borehole measurements
\end{itemize}

There is an ongoing effort to promote the use of NetCDF files across more disciplines – a movement quickly gaining traction. NetCDF files are suitable for any array-oriented scientific data with any number of dimensions. Using the same data format across multiple disciplines facilitates cross-disciplinary collaboration; scientists from different fields can more easily access and understand each other’s data without being impeded by unfamiliar formats.

\subsubsection{Tutorials}

\begin{itemize}
    \item \textbf{Introduction to CF-NetCDF}: \url{https://lhmarsden.github.io/Introduction_to_CF-NetCDF} - includes tools and software that can be used to work with NetCDF files without code.
    \item \textbf{Working with NetCDF files in Python}: \url{https://lhmarsden.github.io/NetCDF_in_Python_from_beginner_to_pro} \citep{marsden_netcdf_python}
    \item \textbf{Working with NetCDF files in R}: \url{https://lhmarsden.github.io/NetCDF_in_R_from_beginner_to_pro} \citep{marsden_netcdf_r}
\end{itemize}

\subsubsection{Validators}

Validators you can use to ensure that your files are compliant with the CF and ACDD conventions before you publish them. For example:
\begin{itemize}
    \item \url{https://compliance.ioos.us/index.html}
    \item \url{https://sios-svalbard.org/dataset_validation/form} (need a SIOS account)
\end{itemize}

\subsubsection{Recommendations}
\begin{itemize}
\item NetCDF files should be encoded adhering to the Climate and Forecast (CF) conventions (\url{https://cfconventions.org/}). NetCDF is a container like JSON and XML and such not a recommended file format for data if it doesn't adhere to CF.
\item CF-NetCDF files should also include global attributes according to the
Attribute Convention for Dataset Discovery (\url{https://wiki.esipfed.org/Attribute_Convention_for_Data_Discovery_1-3}).
\item It is \textbf{not} recommended to combine data from several stations in
a single NetCDF/CF file.
\end{itemize}

\subsection{Darwin Core Archive}

Darwin Core is a data standard originally developed for biodiversity informatics, though this has expanded to be useful for various types of data associated with one or a list of organisms. Darwin Core includes
\begin{itemize}
\item Darwin Core terms: A controlled vocabulary of terms - \url{https://dwc.tdwg.org/terms/}
\item Darwin Core Archive: A more-or-less FAIR-compliant data format
\end{itemize}

\subsubsection{Data that should be published in a Darwin Core Archive}
\begin{itemize}
\item{Biodiversity data}
\item{Measurements or facts related to organisms (e.g. color, leaf size, height)}
\item{Fossil specimens}
\item{Some experimental data}
\item{Species lists derived from DNA data}
\end{itemize}

\subsubsection{Where to publish a Darwin Core Archive}

\begin{itemize}
\item \textbf{Marine data:} The Norwegian Marine Data Centre. Once published, the data will automatically become available via OBIS \url{www.obis.org} and GBIF \url{www.gbif.org}. Contact \href{mailto:datahjelp@hi.no}{datahjelp@hi.no} to start the process.
\item \textbf{Terrestrial data: } GBIF Norway. Contact \href{mailto:helpdesk@gbif.no}{helpdesk@gbif.no} to start the process. Luke also has admin credentials to their Integrated Publishing Toolkit at the time of writing. 
\end{itemize}

\subsubsection{Tutorials}
 
The following document has been created to provide help you create and publish a Darwin Core Archive for scientific data. This also includes where you should publish the Darwin Core Archive.
\url{https://github.com/lhmarsden/Darwin_Core_Archive_workshop/blob/main/How_to_create_a_DwCA_for_scientists.pdf}

\subsubsection{Validators}

The Integrated Publishing Toolkit includes a validator to check that your data are okay before publishing. We recommend that you publish your data using the Integrated Publishing Toolkit.

\subsubsection{Recommendations}

\begin{itemize}
\item Use the Nansen Legacy template generator (\url{https://www.nordatanet.no/aen/template-generator/config%3DDarwin%20Core}) to prepare your data. It allows you to create structured spreadsheets containing separate sheets for each core or extensions. It provides you with requirements and recommendations for what terms you should be including as well as descriptions for each term.
\item Publish your data using the Integrated Publishing Toolkit. Nodes are hosted by either GBIF Norway or the Norwegian Marine Data Centre.
\item We advise that you publish measurements of the physical environment (e.g. soil moisture content if quantitative, air temperature, wind speed) separately in CF-NetCDF files. These data are useful to people who are not neccessarily interested in your `biological' data. Each publication can reference the other in the metadata so that someone interested in the data can see that they are related and how.
\end{itemize}

\subsection{Human-readable data}
\label{sec:humanreadable}

Data that can't plausibly be published in FAIR-compliant data formats should be published in a way that is human-readable and understandable. This means that people who are not experts in your field should be able to understand your data using only the files that you have published. 

You should consider the following:
\begin{itemize}
\item Use descriptive names for your variables. Don't use abbreviations. Bonus points if you take terms from controlled vocabularies and refer to the URI of the term used in your publication, e.g. \url{http://vocab.nerc.ac.uk/collection/P07/current/CFSN0023/} - clicking on this URI will give you a description for the term. You can search for terms here: \url{https://vocab.nerc.ac.uk/search_nvs/}.
\item Don't rely on cell formatting (colours, bold font, merged cells) in software like Excel to make your data understandable. This formatting is lost when data are loaded into other software like R, Python, Matlab etc.
\item Use software-independent files. Someone should not have to download extra software to read your data. CSV files are preferred to XLSX files.
\item You should publish a README file in PDF format alongside your publication that includes:
\begin{itemize}
\item Metadata related to each data file published (how they were collected, where, when, how, by whom, what instruments or equiment were used). Consider following the list of requirements here \url{https://adc.met.no/node/4}. These terms are usually used for  what do each of the files in the publication 
\item A paragraph (or a few) describing your data. Don't assume that the data user has or will read your paper.
\item A description of what is included in each file published and how each file is formatted.
\item Descriptions and units for each column header/variable name used.
\end{itemize} 
\end{itemize}

\section{Verification and Review}
\label{sec:verification-review}

Perform an internal review of your dataset before final submission. Just as with any other publication, all authors should have a chance to review the dataset be grant their approval before it is published.

Also, please make sure to run your data file through a validator if possible to ensure that it adheres to the conventions that you are stating that it should. See the section above for validators for different file formats.

\section{Selecting a Data Centre}

\subsection{What makes a good data centre}

There are hundreds of data centres. Where should you publish your data? A good data centre should:
\begin{itemize}
    \item Provide your dataset with a DOI, unique to your dataset.
    \item Ensure that your data will remain available through time. Data centres can achieve this by running routines to routinely open and close files to ensure that they have not become lost or corrupted. Some data centres (e.g. Zenodo) don’t offer this service.
    \item Make your data as easy to find as possible.
\end{itemize}

Let’s elaborate on that last point. Data are important in their own right, independent of any associated paper publication. It can be surprising which datasets get used again. Furthermore, you should not consider your dataset in isolation. Your data are your contribution to a much larger collection of similar data that someone might want to use altogether. This is particularly relevant in the age of big data.

A good data centre should require you to provide thorough `discovery’ metadata (metadata that helps someone find the data through a search engine – e.g. when and where the data were collected, by whom, some keywords). When assessing a data centre, you can test this yourself. How easy is it for you to speculatively find data from a certain region, collected in a certain year, for example?

There are hundreds of data centres. It is impractical for a potential data user to search through all of them. Data access portals aim to make data from different data centres available through a single searchable catalogue. According to the UNIS data management plan, all UNIS data should be made available via the SIOS data access portal, which is a catalogue of data relevant to Svalbard. SIOS does not host any data themselves; they harvest metadata from contributing data centres to provide links to the data.

\subsection{Data centres that contribute to SIOS}

The best and easiest way to make your data available via the SIOS access portal is to publish to data centres that contribute to the SIOS. These are listed in the section `Allocation of resources’ of the SIOS data management plan (\url{https://sios-svalbard.org/system/files/common/Documents/SIOS_Data_Management_Plan.pdf})

\subsection{The main data centres for UNIS data}

Of the data centres that contribute to SIOS, Norwegian data centres should be prioritised for UNIS data.

\subsubsection{NIRD Research Data Archive (NIRD RDA)}
\begin{itemize}
    \item \textbf{Link to site:} \url{https://archive.norstore.no/}
    \item \textbf{Email address:} \href{mailto:archive.manager@norstore.no}{archive.manager@norstore.no}
    \item \textbf{How to make data available via SIOS:} Metadata collection form, see section \ref{sec: linking data to sios}.
    \item \textbf{How to start the process}: Data collection form on their site. For datasets too large to upload, begin the data collection form and email them to arrange data transfer.
    \item \textbf{Use for: } any data
\end{itemize}

\subsubsection{Norwegian Marine Data Centre (NMDC)}
\begin{itemize}
    \item \textbf{Link to site:} \url{https://www.nmdc.no}
    \item \textbf{Email address:} \href{mailto:datahjelp@imr.no}{datahjelp@imr.no}
    \item \textbf{How to make data available via SIOS:} Automatic
    \item \textbf{How to start the process}: Email them
    \item \textbf{Use for: } any marine data
\end{itemize}

\subsubsection{Norwegian Polar Data Centre (NPDC)}
\begin{itemize}
    \item \textbf{Link to site:} \url{https://data.npolar.no/home/}
    \item \textbf{Email address:} \href{mailto:data@npolar.no}{data@npolar.no}
    \item \textbf{How to make data available via SIOS:} Automatic
    \item \textbf{How to start the process}: Email them
    \item \textbf{Use for: } any data
\end{itemize}

\subsubsection{Arctic Data Centre (MET)}
\begin{itemize}
    \item \textbf{Link to site:} \url{https://adc.met.no/}
    \item \textbf{Email address:} \href{mailto:adc-support@met.no}{adc-support@met.no}
    \item \textbf{How to make data available via SIOS:} Automatic
    \item \textbf{How to start the process}: Email them
    \item \textbf{Use for: } any data that contributes to the objectives of MET (e.g. meteorology, physical oceanography, sea ice, remote sensing, air pollution, models and climate analysis).
\end{itemize}

\subsubsection{Norwegian Institute for Air Research (NILU)}
\begin{itemize}
    \item \textbf{Link to site:} \url{https://www.nilu.com/open-data/}
    \item \textbf{Email address:} \href{mailto:nilu@nilu.no}{nilu@nilu.no}
    \item \textbf{How to make data available via SIOS:} Automatic
    \item \textbf{How to start the process}: Email them
    \item \textbf{Use for: } atmospheric composition data
\end{itemize}

\subsection{Exceptions - publishing your data elsewhere}

Data can be published to the data centres listed above in most cases. However, there are some data services that exist that are the default for a certain type of data. Crucially, people actively use these services to look for data to research or making data-driven decisions. We want UNIS to contribute data to these services. 

If you think we have missed something, please email us.\\

\emailauthors

\subsubsection{GBIF}

All biodiveristy data should be available via GBIF (\url{www.gbif.org}). GBIF is part of an interlinked network of data centres that host and share biodiversity data, along with OBIS (\url{www.obis.org}), Living Norway (\url{https://livingnorway.no/}) and iNaturalist (\url{https://www.inaturalist.org/}). 

If you have marine data, you can publish them to the Norwegian Marine Data Centre and they will be made available via SIOS, GBIF and OBIS automatically.

For terrestrial data, it is possible to publish a Darwin Core Archive with a data centre that contributes to SIOS and use the same DOI to publish it to GBIF. 

It is likely that GBIF will be harvestable via SIOS in the next few years.

\subsubsection{Sequence data}

The International Nucleotide Sequence Database Collaboration (INSDC) is a global collaboration of independent governmental or non-profit organisations that manage nucleotide sequence databases.

Participating databases are listed at 
\url{https://www.insdc.org/about-insdc/#vf-tabs__section-participating-databases}. 
This includes the European Nucleotide Archive (ENA) and GenBank.

Data published to these services should be linked to SIOS using the metadata collection form as described in section \ref{sec: linking data to sios}.

\section{Making your data available via SIOS}
\label{sec: linking data to sios}

If you publish your data to a data centre that SIOS is harvesting from, your data will automatically be made available via the SIOS data access portal. You might find that they are also available via other access portals too!
 
Data published to a data centre that does not contribute to the SIOS data access portal must be linked manually using a metadata collection form hosted by SIOS: \url{https://sios-svalbard.org/metadata-collection-form}. Please note that SIOS are not able to create services on top of data linked manually to SIOS (e.g. visualisation, aggregation). Only a link to the data will be provided. 

\subsection{NIRD RDA}

Note that at the time of writing, NIRD RDA is not being harvested by SIOS. This is likely to change in the near future. 

There will soon be a way of publish CF-NetCDF files to NIRD RDA through NorDataNet (\url{https://www.nordatanet.no/en}). This will
\begin{itemize}
\item Retrieve most of the discovery metadata from the CF-NetCDF file so you don't have to enter it again.
\item Make the data available via SIOS
\end{itemize}

To see whether this tool is now available: 

\begin{enumerate}[label=\arabic*.]
    \item Visit \href{https://www.nordatanet.no/en}{\texttt{https://www.nordatanet.no/en}}
    \item Go to \texttt{Submit data} in the navigation bar at the top
    \item Select \texttt{Submit data as NetCDF/CF}
\end{enumerate}

\section{Citing Your Data}

Include the DOI in your publications and share it with collaborators. Your can include the citation for your data in your list of references, just as you would any other publication. 

Properly citing datasets is important for several reasons:
\begin{itemize}
    \item \textbf{Credit to data creators}: Providers proper credit to the data providers in a way that academia recognises. 
    \item \textbf{Tracking data use}: Some data centres scan through the references of publications for the DOIs of the datasets they host to provide statistics on data use on the landing page of the dataset. Tracking the impact and reach of datasets provides feedback to data creators.  
    \item \textbf{Easier to find the data}: Provind the full citation makes it easier for someone to find the dataset, which is crucial for replicating the study or using the data for someone else.
    \item \textbf{Legal and ethical responsibility}: Fulfills legal and ethical obligations to acknowledge the original data sources.
    \item \textbf{Supporting data sharing initiatives}: Encourages the culture of data sharing by showing that datasets are valuable and acknowledged.
\end{itemize}

Some journals require you to include a data availability statement. This is fine, but this should be as well as (not instead of) including the citation in your list of references.

\bibliography{ref}

\end{document}