Question about on-disk storage (netcdf-3) #2224

gsjaardema · 2022-02-10T20:41:13Z

gsjaardema
Feb 10, 2022

If I have a variable double time(time_step) with dimension time_step = UNLIMITED how is time stored on the disk --

will it be contiguous following the fixed-size header;
will it be stored in time_step separate pieces spread across the file (assume I have several other variables also with at least one unlimited dimension);
will it be stored in 1 or more chunks of some know size spread across the file.

I have always assumed that it is contiguous, but a user is seeing some strange behavior querying this variable which makes it look like the read time is proportional to the size of time_step which would not make sense if the variable were stored contiguously...

WardF · 2022-02-10T20:46:49Z

WardF
Feb 10, 2022
Maintainer

So assuming netCDF3 completely absent the libhdf5-supplied storage layer, then there aren't any 'chunks', so to speak. The file itself is contiguous, although the byte locations for the data stored along the unlimited dimension might not be contiguous, particularly (I think) if the file has been appended? This is off the top of my head, let me take a closer look, starting here.

0 replies

WardF · 2022-02-10T20:48:15Z

WardF
Feb 10, 2022
Maintainer

Converting this to a discussion since it's an open question, which (depending where we end up), may have an 'action item' we can create an issue for. This will also help me evaluate how well the 'discussion' tools Github provide fit into our workflow :).

0 replies

DennisHeimbigner · 2022-02-10T20:51:20Z

DennisHeimbigner
Feb 10, 2022
Collaborator

Ward's comment about netcdf-3 is correct.
In netcdf-4, since it uses HDF5, the values are stored in a B-tree
whose leaves are the chunks. This effectively makes them non-contiguous.

0 replies

ethanrd · 2022-02-11T07:18:05Z

ethanrd
Feb 11, 2022

In netCDF-3, all variables that use the unlimited dimensions (also called record variables) are interleaved along the unlimited dimension. So a record variable is not contiguous unless it is the only record variable in the file. The comments in the netCDF-3 file format specification (NUG Appendix B. “File Format Specification”) are the best description I've found:

The data for all record variables are stored interleaved at the end of the file.

Each record consists of the n-th slab from each record variable, for example x[n,...], y[n,...], z[n,...] where the first index is the record number, which is the unlimited dimension index.

0 replies

Alexander-Barth · 2023-10-11T12:21:22Z

Alexander-Barth
Oct 11, 2023

I am wondering if nc_inq_var_chunking should report such a variable (variable with unlimited dimension stored in a NetCDF 3 file) as chunked as it is stored in a non-contiguous way on disk. Currently it is reported as contiguous with chunk size 0.

0 replies

DennisHeimbigner · 2023-10-11T18:32:50Z

DennisHeimbigner
Oct 11, 2023
Collaborator

Possibly. But since the layout on disk looks in no way like chunking, I think it would be misleading to report it
as chunked. BTW, what chunk size are you proposing?

0 replies

Alexander-Barth · 2023-10-11T19:37:06Z

Alexander-Barth
Oct 11, 2023

But since the layout on disk looks in no way like chunking

I understand that HDF5 use more sophisticated approach to chunking, but is it not in both cases that the array is decomposed into individual blocks or chunks and within every chunk data is stored contiguously. But between different chunks there can be gaps (where data from different variables are stored). For best performance the data, should process a complete chunk at at a time.

For example, in the case below, a chunk-size of (1,11,10) for the variable var and a chunk size of (1) for the variable time would be a natural choice (for netCDF-3) as in a single record there is one element of the variable time followed by are 11*10 elements for the variable var.

dimensions:
    lon = 10
    lat = 11
    time = 12 // unlimited
variables:
   float time(time)
   float var(time,lat,lon)

0 replies

DennisHeimbigner · 2023-10-11T19:49:48Z

DennisHeimbigner
Oct 11, 2023
Collaborator

The netcdf-c layout for unlimited is AFAIK this:

Let N be the current size of the single unlimited dimension
Let R be the number of variables that have the unlimited dimension, call them V1, V2,...VR
Then here is an allocated space on disk into which the data for ALL unlimited variables are placed.
this allocated space consists of a sequence of N "records". And the i'th record looks this:

V1[i],V2[i],...VR[i]

That is the i'th record is the concatenation of the i'th element of each of the R variables.

This layout seems to me to not be similar to HDF5 chunks in any way.

0 replies

rafaqz · 2023-10-12T06:58:57Z

rafaqz
Oct 12, 2023

But if we were to process a variable larger than memory, the fastest read pattern would be treating these strips along the unlimited dimension as chunks?

Thats suggests (1,11,10) chunk size suggested by @Alexander-Barth still makes sense even if the layout is different to HDF5.

0 replies

Alexander-Barth · 2023-10-12T11:21:26Z

Alexander-Barth
Oct 12, 2023

I agree that the implementation of the chunking/record variable is different in NetCDF4 and NetCDF3.
Some time ago, I implemented the netcdf-c layout for unlimited variable in pure julia following the approach that @DennisHeimbigner described:

https://github.com/Alexander-Barth/NetCDF3.jl/blob/main/src/variables.jl#L140-L146

A contiguous variable (as they are currently reported by nc_inq_var_chunking for unlimited variables) would imply that one can read the whole data in a single read operation (or wouldn't it?). However, for unlimited variables one would need to seek from one chunk to the next.
If you have 3 records, then on-disk on would have (as you have shown):

V1[1],V2[1],...VR[1],  V1[2],V2[2],...VR[2],  V1[3],V2[3],...VR[3]

The block V1[2] is not directly next to the block V1[1].
What is the exact file position of each chunk is certainly different between NetCDF3 and NetCDF4 as they are implemented differently. But from the user perspective it is interesting to know what is the largest block of contiguously stored data for a given variable (the chunk size) in order to choose the optional read pattern as @rafaqz pointed out.

Or maybe for NetCDF, there is a different definition what a contiguous on disk layout is that I am not aware of.

0 replies

DennisHeimbigner · 2023-10-12T17:36:35Z

DennisHeimbigner
Oct 12, 2023
Collaborator

One problem is that the terms "contiguous" and "chunked" come from the concepts
as implemented in HDF5 (and inherited by netcdf-4). So neither term is accurate
for netcdf-3 unlimited record layout.
But let me back up a bit. Is there a specific reason you want to change how netcdf-c reports storage
on netcdf-3 from contiguous to chunked?

0 replies

Alexander-Barth · 2023-10-13T07:27:10Z

Alexander-Barth
Oct 13, 2023

The package NCDatasets was recently integrated with a package DiskArrays which provides special types for chunked and contiguous arrays. The idea is that a user (or library developer) can provide specialized functions for one or the other case when performance is important. So far we used nc_inq_var_chunking to decide whether the DiskArray type should be a chunked or a contiguous arrays. However, for variables with unlimited dimensions, the data is not actually contiguous on disk and the chunked DiskArray array type should probably be used instead. This made me wonder whether the output of nc_inq_var_chunking is actually appropriate.

In any case, we can adapt in NCDatasets to add a special case for NetCDF-3 files.

0 replies

ethanrd · 2023-10-13T15:15:30Z

ethanrd
Oct 13, 2023

The idea of more explicitly representing how a netCDF-3 unlimited dimension affects the layout of the data came up years ago but never went further. Describing the layout in terms of chunking seems like a nice way to capture that information. Just like for nc-4/HDF5 and Zarr, it would provide users with some guidance on how various access patterns might perform, comparatively.

0 replies

DennisHeimbigner · 2023-10-13T19:04:01Z

DennisHeimbigner
Oct 13, 2023
Collaborator

But the netcdf-3 access pattern won't look anything like the netcdf-4 chunk-based pattern.

2 replies

ethanrd Oct 13, 2023

@DennisHeimbigner - I'm not sure I understand what you mean. Do you mean the libraries will have to locate and read the bytes differently depending on whether it is a nc-4, Zarr, or nc-3 dataset?

I was thinking about this in terms of information to help a user decide how they will request data. For the example given above with chunking scheme of (1,11,10), the user would know that a request that cuts through time will be slower than a request that cuts across lat or lon. Whether the dataset is stored as nc-4 or Zarr or nc-3, that chunking scheme would indicate that the data for neighboring time steps are not in the same "chunk" whereas neighboring lat/lon points, for a given time step, are in the same "chunk".

rafaqz Oct 13, 2023

@ethanrd this kind of information is exactly what we need in DiskArrays.jl

The actual structure on disk is not so important, but chunking patterns that gives the fastest lazy read are useful.

DennisHeimbigner · 2023-10-13T21:44:33Z

DennisHeimbigner
Oct 13, 2023
Collaborator

The fastest read pattern for nc3 is to read the same element but for all variables at once
i.e. V1[1],V2[1],...VR[1].
But note that on nc3, that requires issuing R instances of nc_get_vara and hope that the nc3 code
is caching records in memory.

2 replies

rafaqz Oct 13, 2023

In DiskArrays.jl we are working at a higher level of abstraction - its a generic chunked read/write library used by many packages.

We dont have the capacity to optimise for chunked reads accross multiple variables. But it is useful to know the fastest lazy read pattern for any single variable, like the (1,11,10) chunk pattern.

This lets us choose to read e.g. (4, 11,10) rather than the slower (10, 11, 4), for example, knowing only the chunk pattern.

Alexander-Barth Oct 17, 2023

This would require that a users knows the order in which chunks are written to disk and currently there is no API for request it and this would be an overkill in my opinion to add it (I know that one can deduce it from the variable identifier for NC3 files, but I don't think that this part of the API). But there is an API to check if a variable if contiguous on disk or not (and the chunk size).

Another related point to the current API: if you create a NetCDF4 with an unlimited dimension the file will be reported as chunked even without calling explicitly the calling nc_def_var_chunking and the data will be chunked automatically along the unlimited dimension. Changing the behavior of nc_inq_var_chunking for NetCDF 3 would make the API more uniform.

DennisHeimbigner · 2023-10-17T18:52:31Z

DennisHeimbigner
Oct 17, 2023
Collaborator

If you are reading a single nc3 variable, then the fastest way is to read v[0],v[1],...v[n]. But of course, this will
read a whole record for each v[i].

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about on-disk storage (netcdf-3) #2224

{{title}}

Replies: 16 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Question about on-disk storage (netcdf-3) #2224

Replies: 16 comments · 4 replies

WardF Feb 10, 2022 Maintainer

WardF Feb 10, 2022 Maintainer

DennisHeimbigner Feb 10, 2022 Collaborator

DennisHeimbigner Oct 11, 2023 Collaborator

DennisHeimbigner Oct 11, 2023 Collaborator

DennisHeimbigner Oct 12, 2023 Collaborator

DennisHeimbigner Oct 13, 2023 Collaborator

DennisHeimbigner Oct 13, 2023 Collaborator

DennisHeimbigner Oct 17, 2023 Collaborator

Replies: 16 comments 4 replies

WardF
Feb 10, 2022
Maintainer

WardF
Feb 10, 2022
Maintainer

DennisHeimbigner
Feb 10, 2022
Collaborator

DennisHeimbigner
Oct 11, 2023
Collaborator

DennisHeimbigner
Oct 11, 2023
Collaborator

DennisHeimbigner
Oct 12, 2023
Collaborator

DennisHeimbigner
Oct 13, 2023
Collaborator

DennisHeimbigner
Oct 13, 2023
Collaborator

DennisHeimbigner
Oct 17, 2023
Collaborator