Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection search uses paging as opposed to search-after header #483

Closed
doug-newman-nasa opened this issue Mar 5, 2024 · 23 comments
Closed
Labels
type: bug Something isn't working

Comments

@doug-newman-nasa
Copy link
Contributor

The following code uses page_size and page number to iterate through results:
https://github.com/nsidc/earthaccess/blob/8fe60974ce0f6e5d6f8fbec679afb96f12f1506f/earthaccess/search.py#L282C13-L282C64
Limit could be set to a value sufficiently high to cause CMR problems. Search-After is used to combat this.

Documentation on search-after: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#search-after

See similar fix to another CMR python library here: nasa/python_cmr@1702100

@doug-newman-nasa doug-newman-nasa changed the title Search queries appear to be using paging as opposed to Search queries appear to be using paging as opposed to search-after header Mar 5, 2024
@doug-newman-nasa
Copy link
Contributor Author

I plan on fixing this in the same way I did with nasa/python_cmr.

@betolink
Copy link
Member

betolink commented Mar 5, 2024

Hi Doug! thanks for reporting this. We do use search-after for granule search since those results will be in the thousands/millions.

if "CMR-Search-After" in response.headers:

That being said, it would be good practice to also use it for collections, I think that's the line you're referring to.

@mfisher87 mfisher87 changed the title Search queries appear to be using paging as opposed to search-after header Collection search uses paging as opposed to search-after header Mar 5, 2024
@mfisher87 mfisher87 added the type: bug Something isn't working label Mar 5, 2024
@mfisher87
Copy link
Collaborator

Great catch! Luis resolved this long ago for granules (#145), and collection search didn't even cross my mind at the time 😆 It's just not something I normally do I guess.

@doug-newman-nasa
Copy link
Contributor Author

Yeah, I got thrown because you have two classes in the same file! I would suggest factoring out the granule code to be used in the dataset code too since it should be the same except for marshaling the results.

@doug-newman-nasa
Copy link
Contributor Author

Looking more closely at the granule get method,
Search-After is only used if the format of the requested data is umm_json?

elif self._format == "umm_json":

@mfisher87
Copy link
Collaborator

👀 Not sure why that's like that! The whole conditional is a bit confusing to me. I think we badly need a refactor on that piece of core code 😁

@mfisher87
Copy link
Collaborator

@doug-newman-nasa would you be interested in attending one of our bi-weekly hackathons?

@betolink
Copy link
Member

betolink commented Mar 6, 2024

Yeah, this is mainly because earthaccess has only parsers for umm_json. Initially I wanted to write parsers for json, iso19115 and echo10 but I've always run into irregularities with the metadata and umm_json has been the most predictable flavor of the schemas.

@doug-newman-nasa
Copy link
Contributor Author

So, the reason I'm 'looking' here is whenever I find myself in a repo for a python wrapper for CMR I immediately check they aren't deep-paging. You aren't where it counts (granule) but are in other places (collection). But it pointed out some redundancy in your two classes. I'd like to tackle some refactoring while addressing the main issue. Discussing that at a bi-weekly hackathon would be extremely useful (at least to me).

@mfisher87
Copy link
Collaborator

@betolink Do you think the next step is to remove the temporary (?) code for the other parsers? Or start considering writing parsers for the other available CMR formats?

... it pointed out some redundancy in your two classes. I'd like to tackle some refactoring while addressing the main issue. Discussing that at a bi-weekly hackathon would be extremely useful (at least to me).

🙇 This will be so immensely appreciated! It's hard to get refactoring on core code like that done because we have many high-demand features and bugs splitting our attention! That doesn't make the refactors any less important, though. Looking forward to seeing you in two weeks!

@betolink
Copy link
Member

betolink commented Mar 6, 2024

@mfisher87 I'm leaning towards just supporting umm_json for now and when the time comes we can map CMR responses to the proper results parsers. This brings me to 2 related topics,

  • We use umm_json because of its completeness and consistency but, could the same be achieved by GraphQL queries that in theory are faster?
  • If we refactor the way we handle response it would be interesting to follow what pystac-client does, the results are wrapped into a data structure that contains stac items, we could do something similar. This could bring UX improvements to the way we present the data.

@doug-newman-nasa
Copy link
Contributor Author

doug-newman-nasa commented Mar 6, 2024

GraphQL queries that in theory are faster?

GraphQL might be faster if you are doing multiple queries. The primary utility of GraphQL is reducing the number of queries by the client but there is still potentially 'multiple queries' going on behind GraphQL. If you are getting everything you need from umm_json then GraphQL isn't going to speed things up. It's still taking your query and querying the CMR API. But it might have utility elsewhere.

the results are wrapped into a data structure that contains stac items

If you want stac items why not use the CMR STAC API?

@betolink
Copy link
Member

betolink commented Mar 6, 2024

We do not necessarily need the stac items as such, just the way the response is handled. eg.

results = earthaccess.search_data(**params)

Right now, results is a list of parsed umm_json items. Each instance of this list is an enhanced Python dictionary with some convenient custom methods like data_links() etc.

I'm thinking of refactoring this in a way that results will be a wrapped on a class that contains the same umm_json items but the handy methods can operate on the entire list. The semantics are a bit different but that will save users all those for loops to collect links from the results or filter granules by a particular criteria etc Maybe this is not as urgent, I just thought it could be interesting to explore! this is the class in pystac-client: https://pystac-client.readthedocs.io/en/latest/api.html#pystac_client.ItemSearch

@mfisher87
Copy link
Collaborator

Just to check if we're on the same page, you're thinking a search would return a Results object containing Granule objects and having some special methods that can make operations on the whole list easier? I've been thinking this might be useful as well. E.g. results.get_links()?

@betolink
Copy link
Member

betolink commented Mar 6, 2024

@mfisher87 correct! we could also implement pagination like in the stac search results results.next_page() if we don't want to load all the results in one go.

@mfisher87
Copy link
Collaborator

I love that. Or we could promote a generator usage pattern? next(results)? Or to get them all list(results)!

@doug-newman-nasa
Copy link
Contributor Author

Started work on this. What I want to do is add VCR to your testing. I think I only see one test that actually queries CMR: test_data_links. So there is nothing that really tests collection search url construction or results parsing at the collection level. Using VCR we can test this and remove the need to hit CMR each time the test is run. You are also relying on the result contents returned by test_data_links not changing in the future. Thoughts?

@mfisher87
Copy link
Collaborator

TIL VCR! I'm having one of those "how have I not heard of this?" moments ;) That's a really cool idea. I'm all for this! Would you be using the betamax implementation?

@mfisher87
Copy link
Collaborator

cc @danielfromearth this may be helpful for unit tests for #426

@doug-newman-nasa
Copy link
Contributor Author

I'm using https://vcrpy.readthedocs.io/en/latest
I should have a PR ready today.

@doug-newman-nasa
Copy link
Contributor Author

PR submiited: #494

@mfisher87
Copy link
Collaborator

I can't stress enough how much I feel this has improved our tests and will improve our testing practices going forward. Thank you! 💯

@chuckwondo
Copy link
Collaborator

Fixed by #494

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in earthaccess project May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

4 participants