Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cloud_hosted flag for granule queries doesn't work #565

Open
chuckwondo opened this issue May 10, 2024 · 5 comments
Open

The cloud_hosted flag for granule queries doesn't work #565

chuckwondo opened this issue May 10, 2024 · 5 comments
Labels
needs: decision needs: help Extra attention is needed type: bug Something isn't working

Comments

@chuckwondo
Copy link
Collaborator

As discovered in discussion of #563, using the cloud_hosted parameter for a granule query does not work.

This reproduces the problem:

import earthaccess

results = earthaccess.search_data(
    short_name="VIIRSJ1_L2_OC",
    version="R2022.0",
    cloud_hosted=True,
    temporal=("2024-02-27 00:00:00", "2024-02-27 23:59:00"),
    count=10,
    bounding_box=(-180, 0, 0, 90),
)

The specified collection is not cloud hosted, so the query should return an empty list of results, but instead returns a non-empty list of results.

Alternatively, instead of returning an empty list of results, we could raise an exception. If we take this route, we would need to decide whether to use a built-in type, such as ValueError or TypeError, or define a custom exception.

@chuckwondo
Copy link
Collaborator Author

Another option would be to eliminate the cloud_hosted parameter from granule queries, particularly given that it is not actually directly supported by the underlying CMR Search API. Only collection queries support it. Thus, this parameter requires us to make an implicit collection query under the covers, prior to submitting the granule search (if there is a collection with a cloud_hosted value matching the parameter value).

By eliminating the parameter, it is up to the user to either know whether or not the collection is cloud hosted, or to issue a separate collection query first to determine whether or not it is cloud hosted. Given that we would need to make such a collection query under the covers anyway, if we keep the cloud_hosted parameter for granule queries, there would be no difference in performance. In fact, by not implicitly performing the collection query, the user is able to avoid the extra query, if they already know whether or not the collection is cloud hosted. Further, being explicit over implicit is the 2nd principle of The Zen of Python, so it is worth considering.

@betolink
Copy link
Member

Thanks for framing this problem @chuckwondo, I'm inclined to retain the cloud_hosted parameter at the granule level in order to save our users the extra query. Likewise, there is no DOI parameter at the granule level and (anecdotally) this is one the most useful features in the search_data method according to users.

@asteiker
Copy link
Member

asteiker commented Nov 1, 2024

@chuckwondo I believe this example aligns with this particular problem:

results = earthaccess.search_data(
    doi='10.5067/ATLAS/ATL15.004',
    bounding_box=(180, 60, -180, 90),  # (lower_left_lon, lower_left_lat , upper_right_lon, upper_right_lat))
    cloud_hosted=True,
)

This is an example in a CryoCloud book tutorial by @mrsiegfried and @wsauthoff which returns files back from the on-prem copy of the data, with the same DOI. If you do the same search in Earthdata Search and set their "Available in Earthdata Cloud" filter, the correct cloud-hosted collection is returned.

This is especially problematic for DAACs including NSIDC who are still migrating to Earthdata Cloud and have both on-prem and cloud-hosted collections available.

This may be another good use of a decision committee per #761 . I'm also inclined to retain these parameters in search.data() for simplicity as long as the behavior is properly documented, and ensure that this is doing the right thing by using the CMR cloud_hosted collection filter prior to granule filter.

@asteiker
Copy link
Member

asteiker commented Nov 1, 2024

Interestingly, this behavior is only problematic if search.data() is using a DOI instead of short_name. For example:

results = earthaccess.search_data(
    short_name =  'ATL06',
    #doi='10.5067/ATLAS/ATL06.006',
    cloud_hosted=True,
    temporal=("2023-02-01 00:00:00", "2024-02-27 23:59:00"),
    bounding_box= (10,0,20,90),
    count=1
)

This will return the correct cloud hosted granule results. If you swap to DOI instead, it will return the ECS (on-prem)-hosted file.

@andypbarrett
Copy link
Collaborator

I think the problem is that DOI is only searchable at the collections level see (and also not all granules have a DOI), so internally search_data uses DataCollections to get the concept_id.

However, this call to DataCollections does not know if cloud_hosted has been set or not. It just blindly grabs the concept_id for the first collection returned.

I think this should be a separate issue.

One solution would be to add .cloud_hosted(self.cloud_hosted) to L923

result = earthaccess.search.DataCollections().doi('10.5067/ATLAS/ATL06.006').cloud_hosted(False).get()
result[0]["meta"]["s3-links"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[50], line 2
      1 result = earthaccess.search.DataCollections().doi('10.5067[/ATLAS/ATL06.006](http://localhost:8889/ATLAS/ATL06.006)').get()
----> 2 result[0]["meta"]["s3-links"]

KeyError: 's3-links'
result = earthaccess.search.DataCollections().doi('10.5067/ATLAS/ATL06.006').cloud_hosted(True).get()
result[0]["meta"]["s3-links"]
['nsidc-cumulus-prod-protected/ATLAS/ATL06/006',
 'nsidc-cumulus-prod-public/ATLAS/ATL06/006']

@asteiker asteiker moved this from 🆕 New to Needs Decision in earthaccess project Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: decision needs: help Extra attention is needed type: bug Something isn't working
Projects
Status: Needs Decision - Backlog
Development

No branches or pull requests

4 participants