Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error opening doi="10.5067/H93644NLXWX9" data #370

Closed
jrbourbeau opened this issue Nov 29, 2023 · 7 comments · Fixed by fsspec/filesystem_spec#1440
Closed

Error opening doi="10.5067/H93644NLXWX9" data #370

jrbourbeau opened this issue Nov 29, 2023 · 7 comments · Fixed by fsspec/filesystem_spec#1440
Labels
type: bug Something isn't working

Comments

@jrbourbeau
Copy link
Collaborator

I was working with someone who ran into an issue opening some doi="10.5067/H93644NLXWX9" data files. Here's a similar example:

import datetime as dt
import earthaccess
import h5py

# Get data granules
results = earthaccess.search_data(
    doi="10.5067/H93644NLXWX9",
    temporal=("2022-02-01", "2022-02-02"),
)
# Open file
f = earthaccess.open(results[:1])[0]
print(f.fs.info(f.path))   # Note: `'size': None` here
# Hand off to h5py
result = h5py.File(f, mode="r")

which outputs the following:

Granules found: 25
Opening 1 granules, approx size: 0.92 GB
QUEUEING TASKS | : 1it [00:00, 954.34it/s]
PROCESSING TASKS | : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.57s/it]
COLLECTING RESULTS | : 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 23831.27it/s]
{'name': 'https://data.gesdisc.earthdata.nasa.gov/data/OCO2_DATA/OCO2_L1B_Science.11r/2022/032/oco2_L1bScGL_40349a_220201_B11006r_220505132311.h5', 'size': None, 'ETag': '"85458100bcb44a3e7a584e2b87aff489"', 'type': 'file'}
Traceback (most recent call last):
  File "/Users/james/projects/nsidc/earthaccess/aronne-test.py", line 12, in <module>
    result = h5py.File(f, mode="r")
  File "/Users/james/mambaforge/envs/earthaccess-dev/lib/python3.9/site-packages/h5py/_hl/files.py", line 562, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/Users/james/mambaforge/envs/earthaccess-dev/lib/python3.9/site-packages/h5py/_hl/files.py", line 235, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 102, in h5py.h5f.open
  File "h5py/h5fd.pyx", line 155, in h5py.h5fd.H5FD_fileobj_get_eof
  File "h5py/h5fd.pyx", line 155, in h5py.h5fd.H5FD_fileobj_get_eof
  File "h5py/h5fd.pyx", line 155, in h5py.h5fd.H5FD_fileobj_get_eof
  File "/Users/james/mambaforge/envs/earthaccess-dev/lib/python3.9/site-packages/fsspec/spec.py", line 1743, in seek
    nloc = self.size + loc
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

Note that the file size output in f.fs.info(f.path) is actually None -- looking at other datasets, this number is an integer (e.g. using the same code with doi="10.5067/LESQUBLWS18H" works just fine).

It's not totally clear to me if this is an issue with how earthaccess is asking for the data, something going wrong in the backend server where the data is hosted, or something else. I'm hoping others have a better sense for where we should fix things.

@betolink
Copy link
Member

I could download the file with earthaccess and CURL gets to it using bearer tokens, I wonder if there are some weird redirects that fsspec is not handling correctly for this particular dataset when we open it.

@jrbourbeau
Copy link
Collaborator Author

cc @martindurant in case you have any thoughts

@betolink
Copy link
Member

betolink commented Nov 29, 2023

the weird thing is that this works just fine:

granule = "https://data.gesdisc.earthdata.nasa.gov/data/OCO2_DATA/OCO2_L1B_Science.11r/2022/032/oco2_L1bScGL_40349a_220201_B11006r_220505132311.h5"

fs = earthaccess.get_fsspec_https_session()

with fs.open(granule) as f:
    print(f.read(10))

The code that earthaccess uses to open the files is https://github.com/nsidc/earthaccess/blob/69f9e46dfda72ae82045a81635f489dcb041c4f3/earthaccess/store.py#L46C5-L46C16 the main difference is that we are using the EarthAccessFile wrapper and we are not using a context to open the files. For most cases this is not a problem.

I also ran this and it worked without a context

granule = "https://data.gesdisc.earthdata.nasa.gov/data/OCO2_DATA/OCO2_L1B_Science.11r/2022/032/oco2_L1bScGL_40349a_220201_B11006r_220505132311.h5"
fs = earthaccess.get_fsspec_https_session()

f = fs.open(granule)
print(f.read(10))

My guess is that there is a bug with the way the earthaccess opens the files in the .open() method, still weird that it works for other datasets

@martindurant
Copy link

The None is normally a response to a HEAD request, or looking for the content-length in a GET request. It is required for random access (which h5py uses to jump around) in conjunction with readahead caching - because you can't ask for bytes beyond the end of the file.

For kerchunking HDF5 files, the "first" cacher is better, because most, but not all, metadata is near the start of the file. In that case you wouldn't need the size, so long as the server does at least support byte-range gets. However, this path is not currently implemented in code.

print(f.read(10))

This is probably done by streaming and then cancelling the read after enough bytes are read.

The raw URL seems to return a 303, so this is why HEAD doesn't work? I tried a GET with stream manually and it did return the size, so I'll have a dig around.

@martindurant
Copy link

OK I have it: the response headers was missing the "encoding" field. Or, actually, it was there but blank (which is not allowed: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding#syntax )

@jrbourbeau
Copy link
Collaborator Author

Thanks for digging into this one @martindurant! I can confirm fsspec/filesystem_spec#1440 fixes the original failing example here 👍

@jrbourbeau
Copy link
Collaborator Author

Closing as fsspec/filesystem_spec#1440 has been merged. Thanks again @martindurant

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in earthaccess project Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants