Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESMSource class for collections in Intake catalogs #304

Open
charlesbluca opened this issue Dec 9, 2020 · 3 comments
Open

ESMSource class for collections in Intake catalogs #304

charlesbluca opened this issue Dec 9, 2020 · 3 comments
Labels
enhancement Issues that are found to be a reasonable candidate feature additions feature

Comments

@charlesbluca
Copy link

Is your feature request related to a problem? Please describe.

Currently, there doesn't seem to be any source class for Intake-esm collections, meaning that any Intake catalogs containing them must use intake_esm.esm_datastore as the driver (seen in Pangeo's climate catalog)

plugins:
  source:
    - module: intake_esm

sources:
  cmip6_gcs:
    args:
      esmcol_obj: "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
    description: 'CMIP6 in Google Cloud Storage'
    driver: intake_esm.esm_datastore
    metadata: {}

This means that accessing these entries directly calls the intake_esm.esm_datastore constructor and consequently loads the Intake-esm collection's underlying DataFrame into memory:

In [1]: import intake

In [2]: cat = intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/climate.yaml")

In [3]: cat["cmip6_gcs"]
Out[3]: <pangeo-cmip6 catalog with 5822 dataset(s) from 351564 asset(s)>

This can be a computationally expensive task for larger collections, and in some cases completely unnecessary if we only wish to view the metadata of the collection's entry.

Describe the solution you'd like

The implementation of an ESMSource class, similar to intake-xarray's ZarrSource, which would store the initial arguments to create an esm_datastore, but wouldn't initialize it until a dedicated method was called:

In [4]: cat["cmip6_gcs"]
Out[4]: <name: cmip6_gcs>

In [5]: cat["cmip6_gcs"].load()
Out[5]: <pangeo-cmip6 catalog with 5822 dataset(s) from 351564 asset(s)>

This ESMSource could then be supplied as a driver in Intake catalogs, making it substantially faster to crawl catalogs containing ESM collections.

Describe alternatives you've considered

The current implementation of ESM collections within Intake catalogs works fine for accessing singular collections; when crawling catalogs with ESM collections, I typically use cat._entries["some_esm_collection"] to avoid directly loading the collections. This succeeds in getting the metadata of an ESM collection without opening it, but can be a cumbersome use case when crawling catalogs with mixed entry types.

@andersy005
Copy link
Member

@charlesbluca, I think this is a great idea, would you be interested in submitting a PR? :)

@andersy005 andersy005 added enhancement Issues that are found to be a reasonable candidate feature additions feature labels Dec 10, 2020
@charlesbluca
Copy link
Author

Sure! I'll use this issue for any outstanding questions I have in working on this.

@charlesbluca
Copy link
Author

Looking through intake-esm/source.py, it seems I've spoken too soon! There are the ESMDataSource and ESMGroupDataSource classes, which can be used as drivers for Intake, although their behavior is different from something like intake-xarray.ZarrSource.

In particular, the source classes look for a pandas.Series or pandas.DataFrame as input, respectively, which I'm not exactly sure how to do in Intake - would this be accomplished by providing something like the output of pandas.*.to_json() but YAML formatted?

Regardless, I'm happy to conceptualize a data source class that takes the URL of an ESM collection as its primary argument (maybe called ESMCollectionDataSource?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issues that are found to be a reasonable candidate feature additions feature
Projects
None yet
Development

No branches or pull requests

2 participants