You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently, there doesn't seem to be any source class for Intake-esm collections, meaning that any Intake catalogs containing them must use intake_esm.esm_datastore as the driver (seen in Pangeo's climate catalog)
plugins:
source:
- module: intake_esmsources:
cmip6_gcs:
args:
esmcol_obj: "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"description: 'CMIP6 in Google Cloud Storage'driver: intake_esm.esm_datastoremetadata: {}
This means that accessing these entries directly calls the intake_esm.esm_datastore constructor and consequently loads the Intake-esm collection's underlying DataFrame into memory:
In [1]: importintakeIn [2]: cat=intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/climate.yaml")
In [3]: cat["cmip6_gcs"]
Out[3]: <pangeo-cmip6catalogwith5822dataset(s) from351564asset(s)>
This can be a computationally expensive task for larger collections, and in some cases completely unnecessary if we only wish to view the metadata of the collection's entry.
Describe the solution you'd like
The implementation of an ESMSource class, similar to intake-xarray's ZarrSource, which would store the initial arguments to create an esm_datastore, but wouldn't initialize it until a dedicated method was called:
This ESMSource could then be supplied as a driver in Intake catalogs, making it substantially faster to crawl catalogs containing ESM collections.
Describe alternatives you've considered
The current implementation of ESM collections within Intake catalogs works fine for accessing singular collections; when crawling catalogs with ESM collections, I typically use cat._entries["some_esm_collection"] to avoid directly loading the collections. This succeeds in getting the metadata of an ESM collection without opening it, but can be a cumbersome use case when crawling catalogs with mixed entry types.
The text was updated successfully, but these errors were encountered:
Looking through intake-esm/source.py, it seems I've spoken too soon! There are the ESMDataSource and ESMGroupDataSource classes, which can be used as drivers for Intake, although their behavior is different from something like intake-xarray.ZarrSource.
In particular, the source classes look for a pandas.Series or pandas.DataFrame as input, respectively, which I'm not exactly sure how to do in Intake - would this be accomplished by providing something like the output of pandas.*.to_json() but YAML formatted?
Regardless, I'm happy to conceptualize a data source class that takes the URL of an ESM collection as its primary argument (maybe called ESMCollectionDataSource?).
Is your feature request related to a problem? Please describe.
Currently, there doesn't seem to be any source class for Intake-esm collections, meaning that any Intake catalogs containing them must use
intake_esm.esm_datastore
as the driver (seen in Pangeo's climate catalog)This means that accessing these entries directly calls the
intake_esm.esm_datastore
constructor and consequently loads the Intake-esm collection's underlying DataFrame into memory:This can be a computationally expensive task for larger collections, and in some cases completely unnecessary if we only wish to view the metadata of the collection's entry.
Describe the solution you'd like
The implementation of an
ESMSource
class, similar to intake-xarray'sZarrSource
, which would store the initial arguments to create anesm_datastore
, but wouldn't initialize it until a dedicated method was called:This
ESMSource
could then be supplied as a driver in Intake catalogs, making it substantially faster to crawl catalogs containing ESM collections.Describe alternatives you've considered
The current implementation of ESM collections within Intake catalogs works fine for accessing singular collections; when crawling catalogs with ESM collections, I typically use
cat._entries["some_esm_collection"]
to avoid directly loading the collections. This succeeds in getting the metadata of an ESM collection without opening it, but can be a cumbersome use case when crawling catalogs with mixed entry types.The text was updated successfully, but these errors were encountered: