Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Analysis page] Query for datasets via collection metadata? #658

Closed
1 task
j08lue opened this issue Sep 19, 2023 · 4 comments
Closed
1 task

[Analysis page] Query for datasets via collection metadata? #658

j08lue opened this issue Sep 19, 2023 · 4 comments
Assignees

Comments

@j08lue
Copy link
Contributor

j08lue commented Sep 19, 2023

Currently, we query for datasets available for the user-defined area and date range of interest by asking STAC for all items and then finding all collections.

This approach has several issues - it is costly and a lot of data gets transferred to the client that is not needed and it currently does not return all collections that should be returned, probably because not all items are loaded due to some limit / no pagination.

We have been discussing adding an aggregation endpoint to the STAC API / pgSTAC that could perform these queries in the database. However, also there, the issue remains that these queries are very costly and pgSTAC (unlike ElasticSearch) is not too fast for them.

An alternative solution is to make use of the total bounding box and date range information on the STAC collection level: STAC collection metadata already contains this information and we would just need to do the intersection in the client. While this approach is less accurate than the item query for edge cases where data coverage is sparse with large gaps, it is a lot faster and could at least limit the number of collections to query.

We will push for developing an aggregation function on our STAC backend, but that will take a while to develop. In the meantime, replacing the current approach by the fast collection metadata method would be great

Acceptance criteria

  • Tested whether collection metadata can be used to query for collections that cover area/time of interest
@hanbyul-here hanbyul-here self-assigned this Sep 19, 2023
@anayeaye
Copy link

A few quick thoughts here about high level full catalog searches without collection filters:

stac-api/collection/items/search|aggregation (answer specific questions)

When we implement some aggregation functionality, we will have lots of opportunity for innovation and will be able to support investigations like:

  • A disaster happened in this AOI, I want to know what VEDA collections have recent spectral data
  • I am doing a historical study and I want to know what VEDA collections have measured precipitation in an AOI-TOI
  • I am starting a project and want to see how much data is available by collection item count that match specified item metadata filters

stac-api/collections (provide a little spatial temporal info about all collections)

The collections endpoint gives us gross information about where and when collections have coverage. There is a lot of flexibility in the descriptive metadata we add to collection records including more precise geometry.

RE

An alternative solution is to make use of the total bounding box and date range information on the STAC collection level: STAC collection metadata already contains this information and we would just need to do the intersection in the client. While this approach is less accurate than the item query for edge cases where data coverage is sparse with large gaps, it is a lot faster and could at least limit the number of collections to query.

Relevant properties for using only the stac-api/collections response and one suggestion

  • extent.temporal
  • dashboard:is_periodic
  • dashboard:time_density
  • extent.spatial
  • New dashboard:continuous_spatial_distribution or :is_spotlight, ... or some other indicator of data that do not have the same spatial coverage at all times. This property could be used to trigger different behavior or to inform the user to not expect global coverage over the entire timespan of the dataset

Smallsat data explorer

For the case in which a user arrives at an explore interface and simply wants to know what collections have any data within a time and area of interest, we should look into how the smallsat explorer supports completely open ended searches with a sampling grid. Is this something we can do? I think the backend is very similar.
https://github.com/NASA-IMPACT/csdap-frontend/
https://csdap.earthdata.nasa.gov/explore/

smallsat

@hanbyul-here
Copy link
Collaborator

I used the collections endpoint in #666. I think the main concern with this approach is that we can filter datasets only through their bbox, therefore spatially sparse datasets can have empty results. Check the preview and let me know what you think / if the filter can be better fine-tuned.

@j08lue
Copy link
Contributor Author

j08lue commented Sep 20, 2023

Wow, that turnaround was quick.

I am sure we will hit the challenge with spatially (or temporally) sparse datasets eventually, but this solution is better than the current situation, at least for the GHG datasets. Rather show a bit too many datasets (and then have empty plots) than too few.

We need to make a few random tests and validate that the results are as expected. All datasets that (possibly) have any data within the query should be listed.

To address the spatial case in the future, maybe we could compute the real coverage upon ingest (union(existing_geom, new_geom)) and store that in addition to the max bbox. 🤷

hanbyul-here added a commit that referenced this issue Sep 22, 2023
This PR uses `collections` endpoint to get all the collections, and
filters them on the client side based on aoi/date range that user
inputs. I followed the guidance provided in this issue:

- #658

A few things to note
- I am not sure how item search works. Currently, the code catches all
the datasets with the bbox that intersects with the AOI && the date
domain that overlaps with the selected date range.
- `collections` endpoint doesn't offer a detailed spatial extent. The
bbox is a convex hull that includes all the data points. If a dataset is
sparse like nightlight and plume, a user can see an empty chart like the
screenshot below. @anayeaye and I talked and it might be helpful to have
a flag to signal that this dataset is spatially sparse. (something
similar to `is_periodic` but for spatial extent.)


![Screen Shot 2023-09-20 at 2 35 39
PM](https://github.com/NASA-IMPACT/veda-ui/assets/4583806/7f87f3d7-9ad2-4fdd-a206-fd7fcab86e6b)


@anayeaye thanks for your help 🙇 and let me know if you see anything
unexpected!

## Related issues

Supersedes / temporarily replaces
#534
@j08lue
Copy link
Contributor Author

j08lue commented Sep 28, 2023

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants