-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Analysis page] Query for datasets via collection metadata? #658
Comments
A few quick thoughts here about high level full catalog searches without collection filters: stac-api/collection/items/search|aggregation (answer specific questions)When we implement some aggregation functionality, we will have lots of opportunity for innovation and will be able to support investigations like:
stac-api/collections (provide a little spatial temporal info about all collections)The collections endpoint gives us gross information about where and when collections have coverage. There is a lot of flexibility in the descriptive metadata we add to collection records including more precise geometry. RE
Relevant properties for using only the stac-api/collections response and one suggestion
Smallsat data explorerFor the case in which a user arrives at an explore interface and simply wants to know what collections have any data within a time and area of interest, we should look into how the smallsat explorer supports completely open ended searches with a sampling grid. Is this something we can do? I think the backend is very similar. |
I used the collections endpoint in #666. I think the main concern with this approach is that we can filter datasets only through their bbox, therefore spatially sparse datasets can have empty results. Check the preview and let me know what you think / if the filter can be better fine-tuned. |
Wow, that turnaround was quick. I am sure we will hit the challenge with spatially (or temporally) sparse datasets eventually, but this solution is better than the current situation, at least for the GHG datasets. Rather show a bit too many datasets (and then have empty plots) than too few. We need to make a few random tests and validate that the results are as expected. All datasets that (possibly) have any data within the query should be listed. To address the spatial case in the future, maybe we could compute the real coverage upon ingest ( |
This PR uses `collections` endpoint to get all the collections, and filters them on the client side based on aoi/date range that user inputs. I followed the guidance provided in this issue: - #658 A few things to note - I am not sure how item search works. Currently, the code catches all the datasets with the bbox that intersects with the AOI && the date domain that overlaps with the selected date range. - `collections` endpoint doesn't offer a detailed spatial extent. The bbox is a convex hull that includes all the data points. If a dataset is sparse like nightlight and plume, a user can see an empty chart like the screenshot below. @anayeaye and I talked and it might be helpful to have a flag to signal that this dataset is spatially sparse. (something similar to `is_periodic` but for spatial extent.) ![Screen Shot 2023-09-20 at 2 35 39 PM](https://github.com/NASA-IMPACT/veda-ui/assets/4583806/7f87f3d7-9ad2-4fdd-a206-fd7fcab86e6b) @anayeaye thanks for your help 🙇 and let me know if you see anything unexpected! ## Related issues Supersedes / temporarily replaces #534
Done! |
Currently, we query for datasets available for the user-defined area and date range of interest by asking STAC for all items and then finding all collections.
This approach has several issues - it is costly and a lot of data gets transferred to the client that is not needed and it currently does not return all collections that should be returned, probably because not all items are loaded due to some limit / no pagination.
We have been discussing adding an aggregation endpoint to the STAC API / pgSTAC that could perform these queries in the database. However, also there, the issue remains that these queries are very costly and pgSTAC (unlike ElasticSearch) is not too fast for them.
An alternative solution is to make use of the total bounding box and date range information on the STAC collection level: STAC collection metadata already contains this information and we would just need to do the intersection in the client. While this approach is less accurate than the item query for edge cases where data coverage is sparse with large gaps, it is a lot faster and could at least limit the number of collections to query.
We will push for developing an aggregation function on our STAC backend, but that will take a while to develop. In the meantime, replacing the current approach by the fast collection metadata method would be great
Acceptance criteria
The text was updated successfully, but these errors were encountered: