-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can I choose parallel=True
like in xr.open_mfdataset
?
#70
Comments
intake-thredds doesn't use open_mfdataset and therefore won't use parallel. intake-thredds/intake_thredds/source.py Line 101 in a659ae8
Have you tried a smaller example e.g. just 2 instead of 127 files? Try with f11? . |
Thank you for a thorough, reproducible example, @kthyng!
It's my understanding that the bottleneck is in these lines intake-thredds/intake_thredds/source.py Lines 96 to 99 in a659ae8
As far as I can tell, there is no reason why we can't parallelize the for-loop (in some way) since there're no dependencies within this loop. We could use Python's built-in concurrent futures or Dask delayed. |
@andersy005 could intake-thredds be refactored using open_mfdataset (to be changed also in intake-xarray) to use parallel or try to implement dask delayed here in intake-thredds? And would that speed up the call or do you think there is a different bottleneck? |
We could refactor intake-thredds and implement our own drivers, but I'm wondering if it;s worth the effort?? :). I like the current approach of outsourcing the data loading work to |
I tink intake-xarray should pass |
Yes I saw that @martindurant's idea seems relatively straight-forward: From the could be used in opendap.py: But, I'm not as sure what should change in intake-thredds/intake_thredds/source.py Lines 84 to 103 in a659ae8
|
works for me: import xarray as xr
dates = xr.cftime_range(start='2021-12-29',periods=3,freq='1D')
date=dates[0]
cat_url = 'https://opendap.co-ops.nos.noaa.gov/thredds/catalog/NOAA/LEOFS/MODELS/catalog.xml'
source = intake.open_thredds_merged(
cat_url, path=[date.strftime('%Y'),
date.strftime('%m'),
date.strftime('%d'),
date.strftime(f'nos.leofs.fields.f11?.%Y%m%d.t12z.nc')],
concat_kwargs={"dim": "time",
'data_vars': 'minimal',
'coords': 'minimal',
'compat': 'override',
'combine_attrs': "override"
},
xarray_kwargs=dict(
drop_variables=['siglay','siglev','Itime2'],
),
)
source.to_dask()
Dataset(s): 100%|██████████████████████████████| 10/10 [01:25<00:00, 8.51s/it]
xarray.Dataset
Dimensions:
nele: 11509node: 6106three: 3time: 10maxnode: 11maxelem: 9four: 4siglay: 20siglev: 21 This only fetches 10 datasets and took ~30sec. |
@aaronspring Thanks for demonstrating that. That makes sense that it will work for fewer files, but I do really need to be able to access more files at once to be able to construct enough of a model time series to work with. Plus, I would think that access to the files should be comparable whether using |
As I have been trying to work on this, I see that a difficulty in the change I'm looking for is that a catalog entry would need to be created that has a list of urlpaths to then be passed to |
I had a realization since I posted this! I can get the behavior I want with With the small code change in that PR, I can do this (just showing 2 files to save space):
With the full 127 file locations this just took 2.5 min so seems comparable to running with just I'll close this next week if people don't have anything else they want to say at that point. |
I would like to keep this issue open for a parallelisation of intake-thredds |
The |
Hi! I stumbled across this
intake
driver and it looks like what I need, so thank you so much for your work!I have already made an example for myself where I have a bunch of file locations that are on a thredds server, and then I read them in with
xr.open_mfdataset()
, like so:This ended up being about 2 minutes for 127 files.
But, I need to get this combined dataset represented in an
intake
catalog, which is whereintake-thredds
comes in. I think I have properly mapped the keywords I used inxr.open_mfdataset
to the API inintake-thredds
like this:But, when I then try to look at the resultant combined lazily loaded Dataset with
source.to_dask()
, it takes forever to try to load and breaks with "Bad Gateway" before it can finish. The only difference I think is that I don't see how to useparallel=True
in the call tointake.open_thredds_merged
which I used in callingxr.open_mfdataset
.Is there a way to use
parallel=True
? Thank you for your help!The text was updated successfully, but these errors were encountered: