Collection search uses paging as opposed to search-after header #483

doug-newman-nasa · 2024-03-05T03:17:22Z

The following code uses page_size and page number to iterate through results:
https://github.com/nsidc/earthaccess/blob/8fe60974ce0f6e5d6f8fbec679afb96f12f1506f/earthaccess/search.py#L282C13-L282C64
Limit could be set to a value sufficiently high to cause CMR problems. Search-After is used to combat this.

Documentation on search-after: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#search-after

See similar fix to another CMR python library here: nasa/python_cmr@1702100

The text was updated successfully, but these errors were encountered:

doug-newman-nasa · 2024-03-05T03:25:02Z

I plan on fixing this in the same way I did with nasa/python_cmr.

betolink · 2024-03-05T04:38:00Z

Hi Doug! thanks for reporting this. We do use search-after for granule search since those results will be in the thousands/millions.

earthaccess/earthaccess/search.py

Line 645 in 8fe6097

if "CMR-Search-After" in response.headers:

That being said, it would be good practice to also use it for collections, I think that's the line you're referring to.

mfisher87 · 2024-03-05T12:24:04Z

Great catch! Luis resolved this long ago for granules (#145), and collection search didn't even cross my mind at the time 😆 It's just not something I normally do I guess.

doug-newman-nasa · 2024-03-05T21:49:42Z

Yeah, I got thrown because you have two classes in the same file! I would suggest factoring out the granule code to be used in the dataset code too since it should be the same except for marshaling the results.

doug-newman-nasa · 2024-03-05T23:59:35Z

Looking more closely at the granule get method,
Search-After is only used if the format of the requested data is umm_json?

earthaccess/earthaccess/search.py

Line 642 in 8fe6097

elif self._format == "umm_json":

mfisher87 · 2024-03-06T00:23:54Z

👀 Not sure why that's like that! The whole conditional is a bit confusing to me. I think we badly need a refactor on that piece of core code 😁

mfisher87 · 2024-03-06T00:29:18Z

@doug-newman-nasa would you be interested in attending one of our bi-weekly hackathons?

betolink · 2024-03-06T00:30:32Z

Yeah, this is mainly because earthaccess has only parsers for umm_json. Initially I wanted to write parsers for json, iso19115 and echo10 but I've always run into irregularities with the metadata and umm_json has been the most predictable flavor of the schemas.

doug-newman-nasa · 2024-03-06T00:35:50Z

So, the reason I'm 'looking' here is whenever I find myself in a repo for a python wrapper for CMR I immediately check they aren't deep-paging. You aren't where it counts (granule) but are in other places (collection). But it pointed out some redundancy in your two classes. I'd like to tackle some refactoring while addressing the main issue. Discussing that at a bi-weekly hackathon would be extremely useful (at least to me).

mfisher87 · 2024-03-06T00:38:30Z

@betolink Do you think the next step is to remove the temporary (?) code for the other parsers? Or start considering writing parsers for the other available CMR formats?

... it pointed out some redundancy in your two classes. I'd like to tackle some refactoring while addressing the main issue. Discussing that at a bi-weekly hackathon would be extremely useful (at least to me).

🙇 This will be so immensely appreciated! It's hard to get refactoring on core code like that done because we have many high-demand features and bugs splitting our attention! That doesn't make the refactors any less important, though. Looking forward to seeing you in two weeks!

betolink · 2024-03-06T02:55:43Z

@mfisher87 I'm leaning towards just supporting umm_json for now and when the time comes we can map CMR responses to the proper results parsers. This brings me to 2 related topics,

We use umm_json because of its completeness and consistency but, could the same be achieved by GraphQL queries that in theory are faster?
If we refactor the way we handle response it would be interesting to follow what pystac-client does, the results are wrapped into a data structure that contains stac items, we could do something similar. This could bring UX improvements to the way we present the data.

doug-newman-nasa · 2024-03-06T03:01:17Z

GraphQL queries that in theory are faster?

GraphQL might be faster if you are doing multiple queries. The primary utility of GraphQL is reducing the number of queries by the client but there is still potentially 'multiple queries' going on behind GraphQL. If you are getting everything you need from umm_json then GraphQL isn't going to speed things up. It's still taking your query and querying the CMR API. But it might have utility elsewhere.

the results are wrapped into a data structure that contains stac items

If you want stac items why not use the CMR STAC API?

betolink · 2024-03-06T03:14:32Z

We do not necessarily need the stac items as such, just the way the response is handled. eg.

results = earthaccess.search_data(**params)

Right now, results is a list of parsed umm_json items. Each instance of this list is an enhanced Python dictionary with some convenient custom methods like data_links() etc.

I'm thinking of refactoring this in a way that results will be a wrapped on a class that contains the same umm_json items but the handy methods can operate on the entire list. The semantics are a bit different but that will save users all those for loops to collect links from the results or filter granules by a particular criteria etc Maybe this is not as urgent, I just thought it could be interesting to explore! this is the class in pystac-client: https://pystac-client.readthedocs.io/en/latest/api.html#pystac_client.ItemSearch

mfisher87 · 2024-03-06T13:51:07Z

Just to check if we're on the same page, you're thinking a search would return a Results object containing Granule objects and having some special methods that can make operations on the whole list easier? I've been thinking this might be useful as well. E.g. results.get_links()?

betolink · 2024-03-06T15:17:36Z

@mfisher87 correct! we could also implement pagination like in the stac search results results.next_page() if we don't want to load all the results in one go.

mfisher87 · 2024-03-06T16:10:58Z

I love that. Or we could promote a generator usage pattern? next(results)? Or to get them all list(results)!

doug-newman-nasa · 2024-03-14T21:12:17Z

Started work on this. What I want to do is add VCR to your testing. I think I only see one test that actually queries CMR: test_data_links. So there is nothing that really tests collection search url construction or results parsing at the collection level. Using VCR we can test this and remove the need to hit CMR each time the test is run. You are also relying on the result contents returned by test_data_links not changing in the future. Thoughts?

mfisher87 · 2024-03-15T00:10:59Z

TIL VCR! I'm having one of those "how have I not heard of this?" moments ;) That's a really cool idea. I'm all for this! Would you be using the betamax implementation?

mfisher87 · 2024-03-15T00:27:40Z

cc @danielfromearth this may be helpful for unit tests for #426

doug-newman-nasa · 2024-03-15T11:51:58Z

I'm using https://vcrpy.readthedocs.io/en/latest
I should have a PR ready today.

doug-newman-nasa · 2024-03-15T14:20:48Z

PR submiited: #494

mfisher87 · 2024-03-15T15:45:47Z

I can't stress enough how much I feel this has improved our tests and will improve our testing practices going forward. Thank you! 💯

chuckwondo · 2024-05-09T15:28:59Z

Fixed by #494

github-project-automation bot added this to earthaccess project Mar 5, 2024

github-project-automation bot moved this to 🆕 New in earthaccess project Mar 5, 2024

doug-newman-nasa changed the title ~~Search queries appear to be using paging as opposed to~~ Search queries appear to be using paging as opposed to search-after header Mar 5, 2024

mfisher87 changed the title ~~Search queries appear to be using paging as opposed to search-after header~~ Collection search uses paging as opposed to search-after header Mar 5, 2024

mfisher87 added the type: bug Something isn't working label Mar 5, 2024

chuckwondo closed this as completed May 9, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in earthaccess project May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collection search uses paging as opposed to search-after header #483

Collection search uses paging as opposed to search-after header #483

doug-newman-nasa commented Mar 5, 2024

doug-newman-nasa commented Mar 5, 2024

betolink commented Mar 5, 2024

mfisher87 commented Mar 5, 2024

doug-newman-nasa commented Mar 5, 2024

doug-newman-nasa commented Mar 5, 2024

mfisher87 commented Mar 6, 2024

mfisher87 commented Mar 6, 2024

betolink commented Mar 6, 2024

doug-newman-nasa commented Mar 6, 2024

mfisher87 commented Mar 6, 2024

betolink commented Mar 6, 2024 •

edited

Loading

doug-newman-nasa commented Mar 6, 2024 •

edited

Loading

betolink commented Mar 6, 2024

mfisher87 commented Mar 6, 2024

betolink commented Mar 6, 2024

mfisher87 commented Mar 6, 2024

doug-newman-nasa commented Mar 14, 2024

mfisher87 commented Mar 15, 2024

mfisher87 commented Mar 15, 2024

doug-newman-nasa commented Mar 15, 2024

doug-newman-nasa commented Mar 15, 2024

mfisher87 commented Mar 15, 2024

chuckwondo commented May 9, 2024

Collection search uses paging as opposed to search-after header #483

Collection search uses paging as opposed to search-after header #483

Comments

doug-newman-nasa commented Mar 5, 2024

doug-newman-nasa commented Mar 5, 2024

betolink commented Mar 5, 2024

mfisher87 commented Mar 5, 2024

doug-newman-nasa commented Mar 5, 2024

doug-newman-nasa commented Mar 5, 2024

mfisher87 commented Mar 6, 2024

mfisher87 commented Mar 6, 2024

betolink commented Mar 6, 2024

doug-newman-nasa commented Mar 6, 2024

mfisher87 commented Mar 6, 2024

betolink commented Mar 6, 2024 • edited Loading

doug-newman-nasa commented Mar 6, 2024 • edited Loading

betolink commented Mar 6, 2024

mfisher87 commented Mar 6, 2024

betolink commented Mar 6, 2024

mfisher87 commented Mar 6, 2024

doug-newman-nasa commented Mar 14, 2024

mfisher87 commented Mar 15, 2024

mfisher87 commented Mar 15, 2024

doug-newman-nasa commented Mar 15, 2024

doug-newman-nasa commented Mar 15, 2024

mfisher87 commented Mar 15, 2024

chuckwondo commented May 9, 2024

betolink commented Mar 6, 2024 •

edited

Loading

doug-newman-nasa commented Mar 6, 2024 •

edited

Loading