Link fetcher needs to handle change to OpenSearch limit/offset pagination #45

ceholden · 2024-11-21T18:14:59Z

Background

On November 12th the ESA OpenSearch Catalog API introduced a change to how the search pagination works that impacts our "link fetcher" scheduled search,
https://documentation.dataspace.copernicus.eu/APIs/Others/UpcomingChanges.html#catalogue-api-change-parameters-limits

Specifically for the OpenSearch endpoint we use,

OpenSearch interface: maximum value for ‘(page - 1) * maxRecords + index - 1’ will be set to 10 000, where by deafult maxRecords = 20, page = 1 and index = 1; maximum value for ‘index’ will be set to 10001

For example, this search query reproduces the issue,

https://catalogue.dataspace.copernicus.eu/resto/api/collections/Sentinel2/search.json?processingLevel=S2MSI1C&publishedAfter=2024-11-09T00%3A00%3A00Z&publishedBefore=2024-11-09T23%3A59%3A59Z&startDate=2024-10-10T00%3A00%3A00Z&sortParam=published&sortOrder=desc&maxRecords=100&index=10002&exactCount=1

Impact

The impact on our system is that our "link fetcher" application encounters an error for the last few batches of link fetching. For example, recent executions have found a total number of results of ~10,500. Our link fetcher pulls a maximum of 100 results at one time so we end up requesting an index=10001, resulting in a 400 "bad request" error, Input should be less than or equal to 10001.

The impact of this is that we do not fetch the links for the last few hundred result items, causing potentially missing granules.

Resolution

There are at least a few approaches to mitigate this issue that we could take. In the longer term we should be able to avoid this issue entirely by switching to the "granule created" Subscriptions API (see PR for implementation).

To mitigate this issue, there are at least 2 styles of approaches we could take,

Refine our search to prevent >10,000 search results
- ESA provides an endpoint that describes the search query parameters we could include, https://catalogue.dataspace.copernicus.eu/resto/api/collections/Sentinel2/describe.xml
- A relatively straightforward way to do this would be to include the platform=[S2A | S2B | S2C] in our query
- Pro:
  - Splitting the query by platform is relatively trivial and would "just work" once Sentinel-2C begins regular processing operations.
- Con:
  - This solution requires relatively more work in additional link fetching orchestration
When necessary, grow our maxRecords=[int] to encompass all remaining search results
- e.g., If we have 10,500 total results, our link fetcher would expand the maxRecords once we hit 10,000 to encompass the (totalResults - currentIndex) such that the final search request we perform finishes reading all records
- The limit of our maxRecords parameter is be 2,000 based on the 400 "BadRequest" we get when trying to grow this parameter (Input should be less than or equal to 2000.)
- Pro:
  - This would require the least amount of change to our current setup
- Con:
  - This is pretty fragile as it relies on an assumption that the total number of search results will be less than 12,000. For example this would probably break if Sentinel-2A, -2B, and -2C are all producing granules at the same time.

@sharkinsspatial and @chuckwondo might have other suggestions for ways to fix this!

Acceptance Criteria

We resolve or at least mitigate the current issue with our link fetcher
We re-run any failed link fetching StepFunction invocations to ensure we have caught anything we missed while we encountered this error

The text was updated successfully, but these errors were encountered:

chuckwondo · 2024-11-21T20:19:55Z

I suggest option 2, but even simpler: just bump maxRecords from 100 to 2000 and be done with it -- no need to add logic to "grow" the value at the end.

Given that this adjustment is a stop-gap measure until we flip the switch to the new subscription-based solution, this should hopefully be the only thing we need to do until then.

If for some reason we bump up against the 12K limit before we make the switch, we can revisit this at that time. For the moment, I don't think the extra effort for a more complicated solution is necessary.

ceholden · 2024-11-21T21:29:37Z

Thanks @chuckwondo! I wasn't sure if there was a reason to keep the current request limit so low (100) so I was inclined to keep that, but I don't see a technical reason why we couldn't use the max allowed limit (2000). The query takes a bit longer (~10 sec vs ~2 sec) but I don't think we'd ever have our Lambda function timeout because the ~10 seconds is well within the 60 second "bail early" threshold. The higher limit we'd send fewer requests to ESA which should be less stressful for their system considering limit/offset pagination requires DBs to read & discard results, so we'd have that operation happen fewer times

I'll have a PR up shortly, might need to update integration tests but I've updated all the unit tests already

ceholden self-assigned this Nov 21, 2024

This was referenced Nov 21, 2024

fix: Update search limit to maximum to encompass <12k results without exce… #46

Merged

fix: avoid running final query when fetched_links == total_results #47

Closed

fix: search platforms separately to keep under 10,000 results #48

Merged

ceholden closed this as completed Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link fetcher needs to handle change to OpenSearch limit/offset pagination #45

Link fetcher needs to handle change to OpenSearch limit/offset pagination #45

ceholden commented Nov 21, 2024 •

edited

Loading

chuckwondo commented Nov 21, 2024

ceholden commented Nov 21, 2024

Link fetcher needs to handle change to OpenSearch limit/offset pagination #45

Link fetcher needs to handle change to OpenSearch limit/offset pagination #45

Comments

ceholden commented Nov 21, 2024 • edited Loading

Background

Impact

Resolution

Acceptance Criteria

chuckwondo commented Nov 21, 2024

ceholden commented Nov 21, 2024

ceholden commented Nov 21, 2024 •

edited

Loading