Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDF and N-Dimensional Variable Support #26

Merged
merged 28 commits into from
Feb 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
f588583
added support for v2 and v3 attribute messages; consistent use of war…
jpswinski Dec 18, 2023
f2e659d
added perf profiling to utils
jpswinski Dec 20, 2023
404584a
adding hyperslicing
jpswinski Dec 20, 2023
5d01f88
bones of multidimensional support
jpswinski Dec 22, 2023
aa3ab20
Fixed enableFill parameter
jpswinski Dec 22, 2023
fc8cbb1
initial code for reading hyperslices from compact and contiguous data…
jpswinski Jan 12, 2024
032de62
written out multidimensional code... not working
jpswinski Jan 16, 2024
f5779f5
made some parameter names more consistent
jpswinski Jan 23, 2024
a3a750b
fix for chunk slice calculation
jpswinski Jan 23, 2024
4996eba
suport for multidimensional datasets added to xarray backend
jpswinski Feb 1, 2024
f5ac835
uncommented section that was left commented
jpswinski Feb 1, 2024
8084120
cosemetic updates to imports; conversions added as an option to utils
jpswinski Feb 2, 2024
e7f7a7c
added multiprocessing of chunk data
jpswinski Feb 2, 2024
713968e
fixed shared memory leaks
jpswinski Feb 2, 2024
9e1f22c
fixed reading slice for ndims=1
jpswinski Feb 5, 2024
f5000b7
fixes for multidimensional reading
jpswinski Feb 5, 2024
c12ed6f
multidimensional chunking is working when checking for high level nod…
jpswinski Feb 6, 2024
14f87cc
fixed check in 2d test
jpswinski Feb 6, 2024
2ad3680
disable multiprocessing for xarray by default
jpswinski Feb 6, 2024
770549a
merging in changes from main manually
jpswinski Feb 6, 2024
2775f97
#27 - fixes for reading symbol table
jpswinski Feb 7, 2024
523a5c9
Merge branch 'netcdf-2' of github.com:ICESat2-SlideRule/h5coro into n…
jpswinski Feb 7, 2024
f4d76e2
updated the benchmarking code
jpswinski Feb 7, 2024
66ed242
fix for reading slice
jpswinski Feb 8, 2024
1028186
added AOS test
jpswinski Feb 8, 2024
d3d1dcc
merged PR #28
jpswinski Feb 8, 2024
455723b
add docstring for group argument
rwegener2 Feb 12, 2024
7b40f0c
add 2d access to examples
rwegener2 Feb 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 3 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,7 @@

For a full list of which parts of the HDF5 specification **h5coro** implements, see the [compatibility](#compatibility) section at the end of this readme. The major limitations currently present in the package are:
* The code only implements a subset of the HDF5 specification. **h5coro** has been shown to work on a number of different datasets, but depending on the version of the HDF5 C library used to write the file, and what options were used during its creation, it is very possible that some part of **h5coro** will need to be updated to support reading it. Hopefully, over time as more of the spec is implemented, this will become less of a problem.
* The code is not optimized for local file access. If you are reading data that is local, you will get much better performance from `h5py`.
* It is a read-only library and has no functionality to write HDF5 data.
* It targets one dimensional datasets and is not optimized for high dimensionality data. In practice this means that if you subset a dataset that is two or more dimensions, only the first dimension will be subsetted with all of the other dimensions collapsed into one serial array of elements. This limitation will be addressed in future releases.

## Installation

Expand Down Expand Up @@ -59,8 +57,8 @@ from h5coro import h5coro, s3driver
h5obj = h5coro.H5Coro(f'{my_bucket}/{path_to_hdf5_file}', s3driver.S3Driver)

# (3) read
datasets = [{'dataset': '/path/to/dataset1', 'startrow': 0, 'numrows': h5coro.ALL_ROWS},
{'dataset': '/path/to/dataset2', 'startrow': 324, 'numrows': 50}]
datasets = [{'dataset': '/path/to/dataset1', 'hyperslice': []},
{'dataset': '/path/to/dataset2', 'hyperslice': [324, 374]}]
promise = h5obj.readDatasets(datasets=datasets, block=True)

# (4) display
Expand Down Expand Up @@ -154,7 +152,7 @@ We follow a standard Forking Workflow for code changes and additions. Submitted
| ___Bogus Message___ | <span style="color:red">No</span> | | Unversioned |
| ___Group Info Message___ | <span style="color:red">No</span> | | Version 0 |
| ___Filter Pipeline Message___ | <span style="color:green">Yes</span> | Version 1, 2 | |
| ___Attribute Message___ | <span style="color:blue">Partial</span> | Version 1 | Version 2, 3 |
| ___Attribute Message___ | <span style="color:blue">Partial</span> | Version 1, 2, 3 | Shared message support for v3 |
| ___Object Comment Message___ | <span style="color:red">No</span> | | Unversioned |
| ___Object Modification Time (Old) Message___ | <span style="color:red">No</span> | | Unversioned |
| ___Shared Message Table Message___ | <span style="color:red">No</span> | | Version 0 |
Expand Down
134 changes: 134 additions & 0 deletions benchmarks/aos_profiler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# imports
import time
import pathlib
import s3fs
import h5py
import boto3
import h5coro
from h5coro import s3driver, logger

# configure h5coro
import logging
logger.config(logLevel=logging.CRITICAL)

# h5py test
def eso_h5py_test(hyperslice, with_s3fs):

# test parameters
granule = 'OR_ABI-L1b-RadF-M6C14_G16_s20192950450355_e20192950500042_c20192950500130.nc'
bucketname = 'eso-west2-curated'
objectname = 'AOS/PoR/geo/GOES-16-ABI-L1B-FULLD/2019/295/04/' + granule
objecturl = 's3://' + bucketname + '/' + objectname
filename = '/data/AOS/' + granule
variable = '/Rad'

# remove file first
try:
pathlib.Path.unlink(filename)
except:
pass

# read dataset
start = time.perf_counter()
if with_s3fs:
s3 = s3fs.S3FileSystem()
fp = h5py.File(s3.open(objecturl, 'rb'), mode='r')
else:
s3 = boto3.client('s3')
with open(filename, 'wb') as f:
s3.download_fileobj(bucketname, objectname, f)
fp = h5py.File(filename, mode='r')
duration1 = time.perf_counter() - start

# sum dataset
start = time.perf_counter()
total = 0
if len(hyperslice) > 0:
for row in fp[variable][hyperslice[0][0]:hyperslice[0][1]]:
total += sum(row[hyperslice[1][0]:hyperslice[1][1]])
else:
for row in fp[variable]:
total += sum(row)
duration2 = time.perf_counter() - start

# return results
return duration1, duration2, total

# h5coro test
def eso_h5coro_test(hyperslice):

# test parameters
granule = 'eso-west2-curated/AOS/PoR/geo/GOES-16-ABI-L1B-FULLD/2019/295/04/OR_ABI-L1b-RadF-M6C14_G16_s20192950450355_e20192950500042_c20192950500130.nc'
variable = '/Rad'
datasets = [ {"dataset": variable, "hyperslice": hyperslice } ]
credentials = {"profile":"default"}

# read dataset
start = time.perf_counter()
h5obj = h5coro.H5Coro(granule, s3driver.S3Driver, errorChecking=True, verbose=False, credentials=credentials, multiProcess=False)
promise = h5obj.readDatasets(datasets, block=True, enableAttributes=False)
duration1 = time.perf_counter() - start

# sum dataset
start = time.perf_counter()
total = 0
for row in promise[variable]:
total += sum(row)
duration2 = time.perf_counter() - start

# return results
return duration1, duration2, total

# h5py - hypersliced - with s3fs
hyperslice=[(0,10), (0,10)]
request_time, sum_time, result = eso_h5py_test(hyperslice, True)
print(f'\ns3fs: {hyperslice}\n======================')
print(f'Result = {result}')
print(f'Opening Time = {request_time:.3f} secs')
print(f'Summing Time = {sum_time:.3f} secs')
print(f'Total Time = {sum_time + request_time:.3f} secs')

# h5py - full dataset - with s3fs
hyperslice=[]
request_time, sum_time, result = eso_h5py_test(hyperslice, True)
print(f'\ns3fs: {hyperslice}\n======================')
print(f'Result = {result}')
print(f'Opening Time = {request_time:.3f} secs')
print(f'Summing Time = {sum_time:.3f} secs')
print(f'Total Time = {sum_time + request_time:.3f} secs')

# h5py - hypersliced - download
hyperslice=[(0,10), (0,10)]
request_time, sum_time, result = eso_h5py_test(hyperslice, False)
print(f'\nh5py: {hyperslice}\n======================')
print(f'Result = {result}')
print(f'Opening Time = {request_time:.3f} secs')
print(f'Summing Time = {sum_time:.3f} secs')
print(f'Total Time = {sum_time + request_time:.3f} secs')

# h5py - full dataset - download
hyperslice=[]
request_time, sum_time, result = eso_h5py_test(hyperslice, True)
print(f'\nh5py: {hyperslice}\n======================')
print(f'Result = {result}')
print(f'Opening Time = {request_time:.3f} secs')
print(f'Summing Time = {sum_time:.3f} secs')
print(f'Total Time = {sum_time + request_time:.3f} secs')

# h5coro - hypersliced
hyperslice=[(0,10), (0,10)]
request_time, sum_time, result = eso_h5coro_test(hyperslice=hyperslice)
print(f'\nh5coro: {hyperslice}\n======================')
print(f'Result = {result}')
print(f'Opening Time = {request_time:.3f} secs')
print(f'Summing Time = {sum_time:.3f} secs')
print(f'Total Time = {sum_time + request_time:.3f} secs')

# h5coro - full dataset
hyperslice=[]
request_time, sum_time, result = eso_h5coro_test(hyperslice=hyperslice)
print(f'\nh5coro: {hyperslice}\n======================')
print(f'Result = {result}')
print(f'Opening Time = {request_time:.3f} secs')
print(f'Summing Time = {sum_time:.3f} secs')
print(f'Total Time = {sum_time + request_time:.3f} secs')
Loading