Skip to content

seaMass Input format (smi)

Andrew Dowsey edited this page Nov 29, 2016 · 6 revisions

This format can be used to process data through seaMass that cannot be converted to mzMLb with proteowizard. The format is based on netCDF4-compatible HDF5. This format is very close to seaMass' internal processing format; while seaMass performs some data transformation from mzMLb to its internal representation, in this generic format it is the users job to ensure that the input is exactly how seaMass requires e.g. input data consists of a number of bin locations (i.e. denoted by a start and end m/z) and ion counts for each bin. Please create your input file with the following HDF5 datasets in the HDF5 root folder (the data type of each dataset is unimportant):

1-dimensional data

  • bin_counts: A 1-dimensional vector of the (ion) counts at each data point, sorted by increasing location.

  • bin_edges: A 1-dimensional vector with length equal to the length of bin_counts + 1. Contains the start m/z for each bin followed by the end m/z for the last bin.

  • exposures: An optional 1-dimensional vector of length one. This contains the fraction of the sample actually exposed to the detector. By default this is 1.0, but can be less where automatic gain control (AGC) is used.

2- or more dimensional data

  • bin_counts and bin_edges: As above but please sort spectra by acquisition time and concatenate into a single 1-dimensional vector.

  • spectrum_index: A 1-dimensional vector of length equal to the number of input spectra. Contains the start index (zero-based indexing) of each spectrum in the bin_counts dataset. Note, since bin_locations contains one extra value per spectrum, the start index of the i'th spectrum here is given by spectrum_index[i] + i

  • start_times: If only a single extra dimension (e.g. retention time for standard LC-MS or ion mobility time for IMS without chromatography), this is a 1-dimensional vector of length equal to the number of input spectra containing the scan start time for each spectrum. If there are multiple extra dimensions, simply make into a 2-dimensional dataset with an extra column for the spectrum start times of each extra dimension.

  • finish_times: Same as start_times but for the scan finish times for each spectrum.

  • exposures: An optional 1-dimensional vector of length equal to the number of input spectra. This contains the fraction of the sample actually exposed to the detector. By default this is 1.0 for all spectra denoting that all the sample is exposed, but in cases where automatic gain control (AGC) is used, this can vary from spectrum to spectrum.

Clone this wiki locally