-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collection search uses paging as opposed to search-after header #483
Comments
I plan on fixing this in the same way I did with nasa/python_cmr. |
Hi Doug! thanks for reporting this. We do use earthaccess/earthaccess/search.py Line 645 in 8fe6097
That being said, it would be good practice to also use it for collections, I think that's the line you're referring to. |
Great catch! Luis resolved this long ago for granules (#145), and collection search didn't even cross my mind at the time 😆 It's just not something I normally do I guess. |
Yeah, I got thrown because you have two classes in the same file! I would suggest factoring out the granule code to be used in the dataset code too since it should be the same except for marshaling the results. |
Looking more closely at the granule get method, earthaccess/earthaccess/search.py Line 642 in 8fe6097
|
👀 Not sure why that's like that! The whole conditional is a bit confusing to me. I think we badly need a refactor on that piece of core code 😁 |
@doug-newman-nasa would you be interested in attending one of our bi-weekly hackathons? |
Yeah, this is mainly because earthaccess has only parsers for |
So, the reason I'm 'looking' here is whenever I find myself in a repo for a python wrapper for CMR I immediately check they aren't deep-paging. You aren't where it counts (granule) but are in other places (collection). But it pointed out some redundancy in your two classes. I'd like to tackle some refactoring while addressing the main issue. Discussing that at a bi-weekly hackathon would be extremely useful (at least to me). |
@betolink Do you think the next step is to remove the temporary (?) code for the other parsers? Or start considering writing parsers for the other available CMR formats?
🙇 This will be so immensely appreciated! It's hard to get refactoring on core code like that done because we have many high-demand features and bugs splitting our attention! That doesn't make the refactors any less important, though. Looking forward to seeing you in two weeks! |
@mfisher87 I'm leaning towards just supporting
|
GraphQL might be faster if you are doing multiple queries. The primary utility of GraphQL is reducing the number of queries by the client but there is still potentially 'multiple queries' going on behind GraphQL. If you are getting everything you need from umm_json then GraphQL isn't going to speed things up. It's still taking your query and querying the CMR API. But it might have utility elsewhere.
If you want stac items why not use the CMR STAC API? |
We do not necessarily need the stac items as such, just the way the response is handled. eg. results = earthaccess.search_data(**params) Right now, I'm thinking of refactoring this in a way that results will be a wrapped on a class that contains the same |
Just to check if we're on the same page, you're thinking a search would return a |
@mfisher87 correct! we could also implement pagination like in the stac search results |
I love that. Or we could promote a generator usage pattern? |
Started work on this. What I want to do is add VCR to your testing. I think I only see one test that actually queries CMR: test_data_links. So there is nothing that really tests collection search url construction or results parsing at the collection level. Using VCR we can test this and remove the need to hit CMR each time the test is run. You are also relying on the result contents returned by test_data_links not changing in the future. Thoughts? |
cc @danielfromearth this may be helpful for unit tests for #426 |
I'm using https://vcrpy.readthedocs.io/en/latest |
PR submiited: #494 |
I can't stress enough how much I feel this has improved our tests and will improve our testing practices going forward. Thank you! 💯 |
Fixed by #494 |
The following code uses page_size and page number to iterate through results:
https://github.com/nsidc/earthaccess/blob/8fe60974ce0f6e5d6f8fbec679afb96f12f1506f/earthaccess/search.py#L282C13-L282C64
Limit could be set to a value sufficiently high to cause CMR problems. Search-After is used to combat this.
Documentation on search-after: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#search-after
See similar fix to another CMR python library here: nasa/python_cmr@1702100
The text was updated successfully, but these errors were encountered: