Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better communicate search_data spatial filter behavior expectations: polygon/bbox have no effect on "global" datasets, polygon/bbox do not clip/subset data #515

Open
ronygolderku opened this issue Apr 11, 2024 · 3 comments
Labels
impact: documentation Improvements or additions to documentation

Comments

@ronygolderku
Copy link

ronygolderku commented Apr 11, 2024

A "global" dataset would be one where each granule covers the entire earth. How can we communicate to users when this is the case, and their polygon/bbox query is having no effect? Since we only receive the matching granules, how can we know for sure that every granule in a collection is global so we can know that the users' spatial filters are having no effect? Is there perhaps a boolean in the CMR metadata we can use?


Description:
I am encountering an issue while attempting to access the GHRSST MUR dataset using the Earth Access API. I have followed the tutorial provided to access the dataset, and everything seems to work correctly until I attempt to filter the data using a bounding box or polygon.

Here's the process I've followed:

  1. I use the earthaccess.search_data function to search for the dataset, specifying the required parameters including short_name, temporal, and optionally bounding_box or polygon.
mur_results = earthaccess.search_data(short_name = 'MUR-JPL-L4-GLOB-v4.1',
                                      polygon = [(115.10, -31.44),(115.10, -32.77),(115.77, -32.77),(115.77, -31.44), (115.10, -31.44)])
  1. Upon searching, the API returns the expected number of granules based on the temporal parameter.
  2. However, when I attempt to open the granules using xr.open_mfdataset(earthaccess.open(mur_results), engine='h5netcdf'), my PC becomes unresponsive, and the process gets stuck.
    and it shows like :
    image
    I have also used the bounding_box and get the same output
    mur_results = earthaccess.search_data(short_name = 'MUR-JPL-L4-GLOB-v4.1',bounding_box = ('115.10','-32.77','115.77','-31.44'))
    Additionally, when I examine the output, I notice that entire latitude (-90, 90) and longitude ((-180, 180 ) values seem to be included, indicating that the bounding box or polygon filter is not functioning correctly.
    This issue is causing significant inconvenience and delays in accessing the dataset, as well as consuming a large amount of system resources.
    Another thing, I saw in the rest of that tutorial, there has been applied a slice of the dataset to subset the data. Then what is the application of polygon or bounding_box not clear to me.

Could you please investigate why the bounding box or polygon filter is not working as expected and provide guidance on how to resolve this issue? Thank you.

@mfisher87
Copy link
Collaborator

mfisher87 commented Apr 11, 2024

Thanks for your report! It looks to me like that tutorial needs a fix. The bounding box isn't doing anything because every granule in that collection covers the whole earth.

>>> r = earthaccess.search_data(short_name = 'MUR-JPL-L4-GLOB-v4.1', temporal = ('2012-05-21', '2012-08-20'), bounding_box = ('-125.41992','45.61181','-116.64844','49.2315'))
Granules found: 92
>>> r = earthaccess.search_data(short_name = 'MUR-JPL-L4-GLOB-v4.1', temporal = ('2012-05-21', '2012-08-20'))
Granules found: 92

These bounding box and polygon parameters only apply to filtering the granules returned by search (CMR under the hood), but they don't help clip/subset the actual data found inside those data files. Can you recommend a way that earthaccess could provide clarity to users who run in to this scenario? This is a fairly common issue our users run in to, and it would be amazing if the software could help those users be more aware of why they're seeing what they're seeing and how to move forward. More discussion: #467

@mfisher87 mfisher87 added the impact: documentation Improvements or additions to documentation label Apr 11, 2024
@mfisher87 mfisher87 changed the title Issue with Bounding Box/Polygon in earthaccess Not Functioning Correctly Better communicate search_data spatial filter behavior expectations: polygon/bbox have no effect on "global" datasets, polygon/bbox do not clip/subset data Apr 11, 2024
@andypbarrett
Copy link
Collaborator

This behavior could definitely be documented more clearly. My suggestion is that we do this in three places: the docstrings for search_data and search_datasets; in the user guide/how to section of the docs; and also in a more detailed reference section that documents the behavior of CMR.

It is important to note that the result is expected and correct. What needs to be made clear in the documentation is that, as @mfisher87 notes, is that earthaccess is performing a spatial filtering of data granules and not a subsetting or clipping of the dataset to the bounds. In the spatial filtered search, granules with bounding boxes that intersect the bounding box or polygon passed the search_data are returned. For datasets and granules that cover the globe, any valid bounding box or polygon will intersect. For datasets, such as some of the MODIS or ICESat-2 products, that are global in extent but have granules that cover a smaller region, the behavior of search_datasets and search_data will be different. In these cases, passing a spatial filter to search_data will return only those granules with bounding boxes that intersect the region of interest but search_datasets would return MODIS and ICESat-2 for any region of interest because the bounds of the dataset are global.

An approach for documenting this behavior could be:

For the docstrings, add the following notes to search_data

Complete granules with spatial bounds that intersect **bounding_box**  or **polygon** are returned.

For search_datasets we could add

Datasets with spatial bounds that intersect **bounding_box**  or **polygon** are returned.

For the user guide we can include examples of searches for global and regional datasets for search_datasets; and examples of searches for data granules that cover a region of the dataset bounds (e.g. MODIS or ICESat-2).

We can include a deep dive in the reference documentation that can be linked to from the user guide. @asteiker created a figure demonstrating the difference between a search filter and a subset for a tutorial

https://nasa-openscapes.github.io/earthdata-cloud-cookbook/examples/Earthdata-cloud-clinic.html

We could produce a similar diagram for a global dataset.

@mfisher87
Copy link
Collaborator

For the user guide we can include examples of searches for global and regional datasets for search_datasets; and examples of searches for data granules that cover a region of the dataset bounds (e.g. MODIS or ICESat-2).

I think pictures illustrating the selection would go a long way in the user guide!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact: documentation Improvements or additions to documentation
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants