-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterating through catalog items is awkward and slow #66
Comments
If you do a profile, I assume that all the time is being spent in instantiating the Source objects, which involves (I think) talking to the server - but you don't actually need this just to look at the hrefs. You should be able to iterate the entries dict (without instantiating) with
which is the method used in internal methods such as walk, to get all the entries. Each entry has a |
To better understand what's going on, I added some logging to all our diff --git a/intake_stac/catalog.py b/intake_stac/catalog.py
index 145c590..47da8f6 100644
--- a/intake_stac/catalog.py
+++ b/intake_stac/catalog.py
@@ -1,3 +1,4 @@
+import logging
import os.path
import warnings
@@ -7,6 +8,8 @@ from intake.catalog.local import LocalCatalogEntry
from pkg_resources import get_distribution
__version__ = get_distribution('intake_stac').version
+logger = logging.getLogger(__name__)
+
# STAC catalog asset 'type' determines intake driver:
# https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md#media-types
@@ -116,6 +119,7 @@ class StacCatalog(AbstractStacCatalog):
Load the STAC Catalog.
"""
for subcatalog in self._stac_obj.get_children():
+ logger.info("Loading subcatalog %s from %s", subcatalog, self)
if isinstance(subcatalog, pystac.Collection):
# Collection subclasses Catalog, so check it first
driver = StacCollection
@@ -131,6 +135,7 @@ class StacCatalog(AbstractStacCatalog):
)
for item in self._stac_obj.get_items():
+ logger.info("Loading item %s from %s", item, self)
self._entries[item.id] = LocalCatalogEntry(
name=item.id,
description='',
@@ -182,6 +187,7 @@ class StacItemCollection(AbstractStacCatalog):
if not self._stac_obj.ext.implements('single-file-stac'):
raise ValueError("StacItemCollection requires 'single-file-stac' extension")
for feature in self._stac_obj.ext['single-file-stac'].features:
+ logger.info("Loading feature %s from %s", feature, self)
self._entries[feature.id] = LocalCatalogEntry(
name=feature.id,
description='',
@@ -232,6 +238,7 @@ class StacItem(AbstractStacCatalog):
Load the STAC Item.
"""
for key, value in self._stac_obj.assets.items():
+ logger.info("Loading asset %s from %s", key, self)
self._entries[key] = StacAsset(key, value)
def _get_metadata(self, **kwargs): Here's pystac loading an asset from a fairly nested catalog: In [1]: import pystac
In [2]: %time pcat = pystac.read_file("https://cbers-stac-1-0-0.s3.amazonaws.com/catalog.json")
CPU times: user 11.9 ms, sys: 1.02 ms, total: 12.9 ms
Wall time: 317 ms
In [3]: %time pcat.get_child("CBERS4A").get_child("CBERS4A-WPM").get_child("CBERS4A WPM 209").get_child("CBERS4A WPM 209/139").get_item("CBERS_4A_WPM_20200730_209_139_L4").get_assets()['B0']
...:
CPU times: user 47.2 ms, sys: 11.8 ms, total: 58.9 ms
Wall time: 1.67 s
Out[3]: <Asset href=s3://cbers-pds/CBERS4A/WPM/209/139/CBERS_4A_WPM_20200730_209_139_L4/CBERS_4A_WPM_20200730_209_139_L4_BAND0.tif> And here's intake-stac In [4]: import intake, coloredlogs; coloredlogs.install()
In [5]: %time cat = intake.open_stac_catalog("https://cbers-stac-1-0-0.s3.amazonaws.com/catalog.json")
2021-06-24 10:21:37 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Catalog id=CBERS4> from <Intake catalog: CBERS>
2021-06-24 10:21:37 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Catalog id=CBERS4A> from <Intake catalog: CBERS>
CPU times: user 153 ms, sys: 9.79 ms, total: 163 ms
Wall time: 966 ms
In [6]: %time cat['CBERS4A']["CBERS4A-WPM"]["CBERS4A WPM 209"]["CBERS4A WPM 209/139"]["CBERS_4A_WPM_20200730_209_139_L4"]["B0"]
2021-06-24 10:21:43 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Collection id=CBERS4A-WPM> from <Intake catalog: CBERS4A>
2021-06-24 10:21:44 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Collection id=CBERS4A-MUX> from <Intake catalog: CBERS4A>
2021-06-24 10:21:44 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Collection id=CBERS4A-WFI> from <Intake catalog: CBERS4A>
2021-06-24 10:21:45 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Catalog id=CBERS4A WPM 209> from <Intake catalog: CBERS4A-WPM>
2021-06-24 10:21:46 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Catalog id=CBERS4A WPM 209/139> from <Intake catalog: CBERS4A WPM 209>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading item <Item id=CBERS_4A_WPM_20200730_209_139_L4> from <Intake catalog: CBERS4A WPM 209/139>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset thumbnail from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset metadata from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset B0 from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset B1 from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset B2 from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset B3 from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset B4 from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
CPU times: user 757 ms, sys: 999 ms, total: 1.76 s
Wall time: 4.2 s I think that we make an HTTP request (both intake-stac and pystac) each time we load a sub catalog / collection. IIUC though, |
We have been going through a similar discussion at intake-sdmk (lazy dict linked), where it would be unfeasible to get all the metadata of al the possible catalogs and items when creating the hierarchy. |
Thanks. In this case things might be easier. I think we always know the keys for subcatalogs / sub-collections: In [18]: pcat.to_dict()
Out[18]:
{'type': 'Catalog',
'id': 'CBERS',
'stac_version': '1.0.0',
'description': "Catalogs from CBERS 4/4A missions' imagery on AWS",
'links': [{'rel': <RelType.SELF: 'self'>,
'href': 'https://cbers-stac-1-0-0.s3.amazonaws.com/catalog.json',
'type': <MediaType.JSON: 'application/json'>},
{'rel': <RelType.ROOT: 'root'>,
'href': './catalog.json',
'type': <MediaType.JSON: 'application/json'>},
{'rel': 'child', 'href': './CBERS4/catalog.json'},
{'rel': 'child', 'href': './CBERS4A/catalog.json'}],
'stac_extensions': [],
'title': 'CBERS on AWS'}
In [19]: pcat.get_child_links()
Out[19]:
[<Link rel=child target=<Catalog id=CBERS4>>,
<Link rel=child target=<Catalog id=CBERS4A>>] So we can know the keys of sub collection / catalogs statically, without additional HTTP requests. Accessing that key would require loading the catalog (another HTTP request). I think this also applies to child items in a Catalog, just using I'm unsure how dynamic catalogs complicate this. |
Here's a proposed solution: loading a Catalog / Collection will eagerly store the child links, available on self._links : Dict[str, StacObject] = {link.target.id: link.target link for link in stac_obj.get_child_links()} Then when users look up an item, we'll check for the key in My two unknowns right now:
|
Normally, a getitem returns one of these as its value - which is primitive lazyness (entries should be cheap to create, data source instances can have a fixed cost). In your case, you might still make an entry if there is any scope for parameters. But since you have your laziness up front in the dict-like, you could maybe skip the entry altogether. |
A common need is getting URLs from item assets within a catalog, which involves iterating over hundreds of items. Here is a quick example:
Initializing the catalog is fast!
Iterating through items is slow. I'm a bit confused by the syntax too. I find myself wanting to use an integer index to get the first item in a catalog (
first_item = catalog[0]
) or simplify the code block, but currentlty below tohrefs = [item.band.metadata.href for item in catalog]
(currently iterating through catalogs returns item IDs as strings.As for speed, it only takes microseconds to iterate through the underlaying JSON via sat-stac
@martindurant any suggestions here? I'm a bit perplexed about where the code lives to handle
list(catalog)
or foritem in catalog:
...The text was updated successfully, but these errors were encountered: