Question about on-disk storage (netcdf-3) #2224
Replies: 16 comments 4 replies
-
So assuming netCDF3 completely absent the libhdf5-supplied storage layer, then there aren't any 'chunks', so to speak. The file itself is contiguous, although the byte locations for the data stored along the unlimited dimension might not be contiguous, particularly (I think) if the file has been appended? This is off the top of my head, let me take a closer look, starting here. |
Beta Was this translation helpful? Give feedback.
-
Converting this to a discussion since it's an open question, which (depending where we end up), may have an 'action item' we can create an issue for. This will also help me evaluate how well the 'discussion' tools Github provide fit into our workflow :). |
Beta Was this translation helpful? Give feedback.
-
Ward's comment about netcdf-3 is correct. |
Beta Was this translation helpful? Give feedback.
-
In netCDF-3, all variables that use the unlimited dimensions (also called record variables) are interleaved along the unlimited dimension. So a record variable is not contiguous unless it is the only record variable in the file. The comments in the netCDF-3 file format specification (NUG Appendix B. “File Format Specification”) are the best description I've found:
|
Beta Was this translation helpful? Give feedback.
-
I am wondering if nc_inq_var_chunking should report such a variable (variable with unlimited dimension stored in a NetCDF 3 file) as chunked as it is stored in a non-contiguous way on disk. Currently it is reported as contiguous with chunk size 0. |
Beta Was this translation helpful? Give feedback.
-
Possibly. But since the layout on disk looks in no way like chunking, I think it would be misleading to report it |
Beta Was this translation helpful? Give feedback.
-
I understand that HDF5 use more sophisticated approach to chunking, but is it not in both cases that the array is decomposed into individual blocks or chunks and within every chunk data is stored contiguously. But between different chunks there can be gaps (where data from different variables are stored). For best performance the data, should process a complete chunk at at a time. For example, in the case below, a chunk-size of (1,11,10) for the variable
|
Beta Was this translation helpful? Give feedback.
-
The netcdf-c layout for unlimited is AFAIK this:
That is the i'th record is the concatenation of the i'th element of each of the R variables. This layout seems to me to not be similar to HDF5 chunks in any way. |
Beta Was this translation helpful? Give feedback.
-
But if we were to process a variable larger than memory, the fastest read pattern would be treating these strips along the unlimited dimension as chunks? Thats suggests (1,11,10) chunk size suggested by @Alexander-Barth still makes sense even if the layout is different to HDF5. |
Beta Was this translation helpful? Give feedback.
-
I agree that the implementation of the chunking/record variable is different in NetCDF4 and NetCDF3. https://github.com/Alexander-Barth/NetCDF3.jl/blob/main/src/variables.jl#L140-L146 A contiguous variable (as they are currently reported by
The block Or maybe for NetCDF, there is a different definition what a contiguous on disk layout is that I am not aware of. |
Beta Was this translation helpful? Give feedback.
-
One problem is that the terms "contiguous" and "chunked" come from the concepts |
Beta Was this translation helpful? Give feedback.
-
The package NCDatasets was recently integrated with a package DiskArrays which provides special types for chunked and contiguous arrays. The idea is that a user (or library developer) can provide specialized functions for one or the other case when performance is important. So far we used In any case, we can adapt in NCDatasets to add a special case for NetCDF-3 files. |
Beta Was this translation helpful? Give feedback.
-
The idea of more explicitly representing how a netCDF-3 unlimited dimension affects the layout of the data came up years ago but never went further. Describing the layout in terms of chunking seems like a nice way to capture that information. Just like for nc-4/HDF5 and Zarr, it would provide users with some guidance on how various access patterns might perform, comparatively. |
Beta Was this translation helpful? Give feedback.
-
But the netcdf-3 access pattern won't look anything like the netcdf-4 chunk-based pattern. |
Beta Was this translation helpful? Give feedback.
-
The fastest read pattern for nc3 is to read the same element but for all variables at once |
Beta Was this translation helpful? Give feedback.
-
If you are reading a single nc3 variable, then the fastest way is to read v[0],v[1],...v[n]. But of course, this will |
Beta Was this translation helpful? Give feedback.
-
If I have a variable
double time(time_step)
with dimensiontime_step = UNLIMITED
how istime
stored on the disk --time_step
separate pieces spread across the file (assume I have several other variables also with at least one unlimited dimension);I have always assumed that it is contiguous, but a user is seeing some strange behavior querying this variable which makes it look like the read time is proportional to the size of
time_step
which would not make sense if the variable were stored contiguously...Beta Was this translation helpful? Give feedback.
All reactions