-
Notifications
You must be signed in to change notification settings - Fork 12
RedBloodCellOutputStrategy
This page serves to discuss possible output formats for red blood cells in HemeLB (ticket #708).
Initially XDR and HDF5 were suggested as possible output formats, however, comparisons with CDF and netCDF (and others) are also included.
Features required from an output format are:
- A C/C++ API (potentially python too)
- Compatibility with MPI-IO
- Ability to store timeseries data efficiently
- Ability to grow with number of timesteps
- Ability to grow with number of RBCs (red blood cells are created and destroyed at several timesteps during a simulation)
- Ability to change level of detail in output for each RBC (barycenters, normals, facets, etc)
h3. XDR
XDR (eXternal Data Representation) is a standard for data serialisation (http://tools.ietf.org/html/rfc1014). It specifies how data types are to be serialised in a platform independent way so that the same data can be used on different platforms (i.e. storing the endian-ness of ints as well as whether it's in one's or two's complement, etc.; the number of bits used for mantissa and exponent in floating point, etc.). It is not a file format itself but is used by other file formats such as netCDF. It is also used by NFS to send data across networks and by R to store workspace variables to disk.
h3. HDF5
HDF5 (Hierarchical Data Format version 5) was originally developed by NCSA and is now supported by the HDF Group (http://www.hdfgroup.org/HDF5). It is released under a BSD-like Open Source Compatible Licence (https://www.hdfgroup.org/ftp/HDF5/current/src/unpacked/COPYING).
HDF5 organises data into Datasets, Groups and Attributes. Datasets contain multidimensional data with the number of dimensions fixed at creation time. The sizes of all but one dimensions must be specified at creation time. Datasets can be organised into Groups in a similar fashion to the Unix filesystem where an HDF5 Group is analogous to a directory and an HDF5 Dataset is analogous to a file (soft/hard links are also supported within files). Datasets also have attributes that can be used to describe the data in the Dataset. It is up to the application to define how data is to be organised within an HDF5 file.
HDF5 comes in two different versions (which ship with the same library name, although they are incompatible). The serial HDF5 library contains Fortran, C, C++, Java and Python interfaces but is not thread-safe and does not support MPI-IO although this can be worked around by having one MPI worker be responsible for writing all the data. The parallel HDF5 library only has a C interface, but is thread-safe and supports MPI-IO. (https://www.hdfgroup.org/hdf5-quest.html#p5thread)
HDF5 can store, index and lookup time series data efficiently and also supports various levels of compression for the output files. Adding additional timesteps to an existing file is possible by designating the 'time' dimension in any HDF5 Dataset as the one that is not fixed.
How to store varying levels of detail about the RBCs is up to the application to decide. As it is not possible to create a Dataset that grows in more than one dimension one solution would be to simply create Datasets that store all possible RBC details ahead of time. Another, better, solution would be to store the additional details in separate Datasets with the same timesteps, allowing them to be cross-referenced with each other. A similar approach could be taken to grow the file with the number of cells.
h3. CDF
CDF (Common Data Format) is developed by NASA. It is most similar to HDF5 but without groups or parallel IO. It has Fortran, C and Java APIs as well as optional Perl and C# APIs. Utilities exist to convert between netCDF/HDF5 and CDF and vice-versa.
h3. netCDF
From http://www.unidata.ucar.edu/software/netcdf/docs/: NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
It is open source and is available on GitHub: https://github.com/Unidata/netcdf-c.
The current version (4) appears to be a wrapper around HDF5 that adds a few features such as more than one expandable array dimension (http://www.unidata.ucar.edu/software/netcdf/docs/getting_and_building_netcdf.html#build_parallel). If no netCDF extensions are used then the files created are also valid HDF5 files. There are netCDF-C and netCDF-Java libraries which implement C and Java interfaces, respectively, both of which support MPI-IO. There is (was) an official C++ interface developed but it has not been kept up-to-date and is deprecated in favour of the C interface (and, being based on HDF5, does not support MPI-IO). A new C++ interface is under development at https://github.com/Unidata/netcdf-cxx4. Third-party Python APIs are available as well as converters between different file formats such as CDF and HDF4 and 5. (http://www.unidata.ucar.edu/software/netcdf/software.html#Python)
All the example programs that come with netCDF are based on earth sciences which allow values to be associated with grid dimensions as well as time steps which might be useful. (http://www.unidata.ucar.edu/software/netcdf/docs/examples1.html)
Best practices for writing netCDF files: http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html Discussion on how to represent timeseries data in netCDF: http://www.unidata.ucar.edu/software/netcdf/time/
h3. Others
Other options (Facebook's Thrift, Google's Protocol Buffers...) are mainly used to serialise data for RPC. They can be used to serialise data to disk but in most cases require rewriting the structs/classes used (such as Thrift requiring the use of a custom IDL compiler).