Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link fetcher needs to handle change to OpenSearch limit/offset pagination #45

Closed
2 tasks
ceholden opened this issue Nov 21, 2024 · 2 comments
Closed
2 tasks
Assignees

Comments

@ceholden
Copy link
Collaborator

ceholden commented Nov 21, 2024

Background

On November 12th the ESA OpenSearch Catalog API introduced a change to how the search pagination works that impacts our "link fetcher" scheduled search,
https://documentation.dataspace.copernicus.eu/APIs/Others/UpcomingChanges.html#catalogue-api-change-parameters-limits

Specifically for the OpenSearch endpoint we use,

OpenSearch interface: maximum value for ‘(page - 1) * maxRecords + index - 1’ will be set to 10 000, where by deafult maxRecords = 20, page = 1 and index = 1; maximum value for ‘index’ will be set to 10001

For example, this search query reproduces the issue,

https://catalogue.dataspace.copernicus.eu/resto/api/collections/Sentinel2/search.json?processingLevel=S2MSI1C&publishedAfter=2024-11-09T00%3A00%3A00Z&publishedBefore=2024-11-09T23%3A59%3A59Z&startDate=2024-10-10T00%3A00%3A00Z&sortParam=published&sortOrder=desc&maxRecords=100&index=10002&exactCount=1

Impact

The impact on our system is that our "link fetcher" application encounters an error for the last few batches of link fetching. For example, recent executions have found a total number of results of ~10,500. Our link fetcher pulls a maximum of 100 results at one time so we end up requesting an index=10001, resulting in a 400 "bad request" error, Input should be less than or equal to 10001.

The impact of this is that we do not fetch the links for the last few hundred result items, causing potentially missing granules.

Resolution

There are at least a few approaches to mitigate this issue that we could take. In the longer term we should be able to avoid this issue entirely by switching to the "granule created" Subscriptions API (see PR for implementation).

To mitigate this issue, there are at least 2 styles of approaches we could take,

  1. Refine our search to prevent >10,000 search results
    • ESA provides an endpoint that describes the search query parameters we could include, https://catalogue.dataspace.copernicus.eu/resto/api/collections/Sentinel2/describe.xml
    • A relatively straightforward way to do this would be to include the platform=[S2A | S2B | S2C] in our query
    • Pro:
      • Splitting the query by platform is relatively trivial and would "just work" once Sentinel-2C begins regular processing operations.
    • Con:
      • This solution requires relatively more work in additional link fetching orchestration
  2. When necessary, grow our maxRecords=[int] to encompass all remaining search results
    • e.g., If we have 10,500 total results, our link fetcher would expand the maxRecords once we hit 10,000 to encompass the (totalResults - currentIndex) such that the final search request we perform finishes reading all records
    • The limit of our maxRecords parameter is be 2,000 based on the 400 "BadRequest" we get when trying to grow this parameter (Input should be less than or equal to 2000.)
    • Pro:
      • This would require the least amount of change to our current setup
    • Con:
      • This is pretty fragile as it relies on an assumption that the total number of search results will be less than 12,000. For example this would probably break if Sentinel-2A, -2B, and -2C are all producing granules at the same time.

@sharkinsspatial and @chuckwondo might have other suggestions for ways to fix this!

Acceptance Criteria

  • We resolve or at least mitigate the current issue with our link fetcher
  • We re-run any failed link fetching StepFunction invocations to ensure we have caught anything we missed while we encountered this error
@ceholden ceholden self-assigned this Nov 21, 2024
@chuckwondo
Copy link
Collaborator

I suggest option 2, but even simpler: just bump maxRecords from 100 to 2000 and be done with it -- no need to add logic to "grow" the value at the end.

Given that this adjustment is a stop-gap measure until we flip the switch to the new subscription-based solution, this should hopefully be the only thing we need to do until then.

If for some reason we bump up against the 12K limit before we make the switch, we can revisit this at that time. For the moment, I don't think the extra effort for a more complicated solution is necessary.

@ceholden
Copy link
Collaborator Author

Thanks @chuckwondo! I wasn't sure if there was a reason to keep the current request limit so low (100) so I was inclined to keep that, but I don't see a technical reason why we couldn't use the max allowed limit (2000). The query takes a bit longer (~10 sec vs ~2 sec) but I don't think we'd ever have our Lambda function timeout because the ~10 seconds is well within the 60 second "bail early" threshold. The higher limit we'd send fewer requests to ESA which should be less stressful for their system considering limit/offset pagination requires DBs to read & discard results, so we'd have that operation happen fewer times

I'll have a PR up shortly, might need to update integration tests but I've updated all the unit tests already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants