ESMSource class for collections in Intake catalogs #304

charlesbluca · 2020-12-09T21:24:02Z

Is your feature request related to a problem? Please describe.

Currently, there doesn't seem to be any source class for Intake-esm collections, meaning that any Intake catalogs containing them must use intake_esm.esm_datastore as the driver (seen in Pangeo's climate catalog)

plugins:
  source:
    - module: intake_esm

sources:
  cmip6_gcs:
    args:
      esmcol_obj: "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
    description: 'CMIP6 in Google Cloud Storage'
    driver: intake_esm.esm_datastore
    metadata: {}

This means that accessing these entries directly calls the intake_esm.esm_datastore constructor and consequently loads the Intake-esm collection's underlying DataFrame into memory:

In [1]: import intake

In [2]: cat = intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/climate.yaml")

In [3]: cat["cmip6_gcs"]
Out[3]: <pangeo-cmip6 catalog with 5822 dataset(s) from 351564 asset(s)>

This can be a computationally expensive task for larger collections, and in some cases completely unnecessary if we only wish to view the metadata of the collection's entry.

Describe the solution you'd like

The implementation of an ESMSource class, similar to intake-xarray's ZarrSource, which would store the initial arguments to create an esm_datastore, but wouldn't initialize it until a dedicated method was called:

In [4]: cat["cmip6_gcs"]
Out[4]: <name: cmip6_gcs>

In [5]: cat["cmip6_gcs"].load()
Out[5]: <pangeo-cmip6 catalog with 5822 dataset(s) from 351564 asset(s)>

This ESMSource could then be supplied as a driver in Intake catalogs, making it substantially faster to crawl catalogs containing ESM collections.

Describe alternatives you've considered

The current implementation of ESM collections within Intake catalogs works fine for accessing singular collections; when crawling catalogs with ESM collections, I typically use cat._entries["some_esm_collection"] to avoid directly loading the collections. This succeeds in getting the metadata of an ESM collection without opening it, but can be a cumbersome use case when crawling catalogs with mixed entry types.

The text was updated successfully, but these errors were encountered:

andersy005 · 2020-12-10T20:38:32Z

@charlesbluca, I think this is a great idea, would you be interested in submitting a PR? :)

charlesbluca · 2020-12-11T15:19:35Z

Sure! I'll use this issue for any outstanding questions I have in working on this.

charlesbluca · 2020-12-11T16:10:59Z

Looking through intake-esm/source.py, it seems I've spoken too soon! There are the ESMDataSource and ESMGroupDataSource classes, which can be used as drivers for Intake, although their behavior is different from something like intake-xarray.ZarrSource.

In particular, the source classes look for a pandas.Series or pandas.DataFrame as input, respectively, which I'm not exactly sure how to do in Intake - would this be accomplished by providing something like the output of pandas.*.to_json() but YAML formatted?

Regardless, I'm happy to conceptualize a data source class that takes the URL of an ESM collection as its primary argument (maybe called ESMCollectionDataSource?).

charlesbluca mentioned this issue Dec 9, 2020

Creating ESM source for Intake-esm entries intake/intake#552

Closed

andersy005 added enhancement Issues that are found to be a reasonable candidate feature additions feature labels Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESMSource class for collections in Intake catalogs #304

ESMSource class for collections in Intake catalogs #304

charlesbluca commented Dec 9, 2020

andersy005 commented Dec 10, 2020

charlesbluca commented Dec 11, 2020

charlesbluca commented Dec 11, 2020

ESMSource class for collections in Intake catalogs #304

ESMSource class for collections in Intake catalogs #304

Comments

charlesbluca commented Dec 9, 2020

andersy005 commented Dec 10, 2020

charlesbluca commented Dec 11, 2020

charlesbluca commented Dec 11, 2020