Add RPDE tracing to help debug common errors in RPDE implementation #665

lukehesluke · 2024-03-21T15:16:14Z

This issue relates to #476, so please read the description for that issue first. That issue should be completed before completing this one as that informs the user what the problem is that they need to resolve, whereas this issue gives the user info that may help them resolve that problem.

So, with #476 complete, the Test Suite user now knows that a particular timeout has happened because either:

Their Test Interface Create Opportunity Endpoint has failed to put the Opportunity in the RPDE feed.
An Opportunity was not correctly updated in the Opportunity RPDE feed with an expected capacity change (e.g. capcity would be expected to increase after a successful cancellation).
An Order or OrderProposal was not correctly updated in either of their respective RPDE feeds after some action is invoked.

(these points correspond with the scenarios described in #476)

From working with many different integrations, I can confirm that a very common issue that comes up is a broken implementation of RPDE. Here are some very common examples of RPDE implementation issues which can cause this kind of timeout error in tests:

Opportunity/Order/OrderProposals are correctly inserted into the feed, but, when they are updated, they are not pushed to the end of the feed.
The paging itself is broken, such that, by following the next URL of each RPDE page (which is what it should do, according to the RPDE spec), Broker Microservice will miss some items
With scenario 2, the issue could very simply be that the capacity was not updated to the correct number.

In any of these cases, the user would have a profoundly better ability to diagnose what happened if they could see the progress that Broker Microservice made through their RPDE feed.

High Level Solution Design

When an RPDE issue occurs, of the kind that #476 should more properly clarify to the user as an RPDE issue, the user should be shown some information which informs them:

What RPDE pages Broker Microservice visited in the appropriate RPDE feed (e.g. the Opportunities RPDE feed when there is an issue with an Assert Opportunity Capacity stage) since it started listening for the RPDE feed update.
Some of the contents of those RPDE pages that it visited. Obviously more content is more useful, but there may be resource-based constraints here. Some key contents that should be included for each page: the next URL; the ID and modified of each item in the page; the fully expanded contents of any items that have the same ID as the one that is being listened for (this is essential for the Assert Opportunity Capacity scenario, and would clearly demonstrate that the Opportunity did update, but to the wrong capacity value).
From Broker Microservice's cache, it's most up-to-date version of the Opportunity/Order/OrderProposal in question. If the item has never been seen by Broker, then this will be made clear.

Starter-for-10 Solution Proposal

Glossary:

Broker-RPDE-Listen-Op (BRLO): Every time Broker Microservice is instructed to listen out for changes to an Order/OrderProposal/Opportunity in its respective RPDE feed. There are a few different API endpoints that Broker exposes for this. To find all of the different ways in which BRLOs happen, look to the Broker API endpoints that get called by each of the following FlowStages: packages/openactive-integration-tests/test/helpers/flow-stages/fetch-opportunities.js, packages/openactive-integration-tests/test/helpers/flow-stages/order-feed-update.js and packages/openactive-integration-tests/test/helpers/flow-stages/assert-opportunity-capacity.js.

The Proposal:

More information is always better, so what if Broker Microservice just stored a copy of every RPDE page that it fetched while performing a BRLO.
- Specifically, Broker Microservice would simply save every fetched RPDE page to a new file. It could then associate each BRLO with both a first page (which could be either the 1st page fetched after a BRLO is initialized or the last page fetched before the BRLO is initialized) and a last page, which would be set once the BRLO has completed (i.e. the item has been found or a timeout has occurred).
- As long as these files are stored sequentially, a first page and last page should be sufficient. A user can start at the first page and check out successive pages until the last page (for the BRLO in question) is reached
The test output should include, for every stage which requires setting up a BRLO, a link to the first and last pages (in the local filesystem) of the respective RPDE page. So, this may look like: (first page) ../../openactive-broker-microservice/output/rpde-pages/orders/primary/18.json; (last page) ../../openactive-broker-microservice/output/rpde-pages/orders/primary/39.json. It would also need some way of showing the previous state of the item that was being searched for, in a way that doesn't clutter the results (e.g. maybe it's a hidden section that can be toggled to visible by clicking something)
- A stretch goal or perhaps a goal for a subsequent issue might be to create a web page which simplifies the process, for a user, of looking through a given set of pages. This could be a route in Broker, which could be accessed like http://localhost:3000/rpde-feed-viewer?feedType=orders&auth=primary&firstPage=18&lastPage=39, which just renders a page at a time and provides "next" and "back" buttons for flicking through.

Things to look out for:

This proposal will create a LOT of files as RPDE feeds can be large and Broker polls quite aggressively. Creating a huge amount of files may take up too much space, or hit the inode limit in a linux system, and otherwise add a performance overhead to every RPDE page fetch
- The vast majority of these will just be repeated polls of the (current) last page of the feed, so an obvious optimisation would be only saving a new page if it's different from the last
To aid in tracing visibility, each cached RPDE page should contain any useful additional info like the timestamp when the page was fetched and the page's URL

Implementing this issue may also help identify nuanced logic errors within the test suite itself (#545), though this is hard to confirm without doing this

This issue was spawned from #607

The text was updated successfully, but these errors were encountered:

lukehesluke added this to OpenActive Infrastructure Mar 21, 2024

github-project-automation bot moved this to 💡Ideas in OpenActive Infrastructure Mar 21, 2024

lukehesluke changed the title ~~[DRAFT] Add RPDE tracing to test suite to help debug most common error in RPDE implementation~~ Add RPDE tracing to help debug common errors in RPDE implementation Mar 22, 2024

lukehesluke mentioned this issue Mar 22, 2024

[ELABORATION] Add RPDE tracing to test suite to help debug most common error in RPDE implementation #607

Open

lukehesluke mentioned this issue Oct 15, 2024

Allow Broker to ignore irrelevant updates when listening for Order/Opportunities (Listener Item Expectations) #698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RPDE tracing to help debug common errors in RPDE implementation #665

Add RPDE tracing to help debug common errors in RPDE implementation #665

lukehesluke commented Mar 21, 2024 •

edited

Loading

Add RPDE tracing to help debug common errors in RPDE implementation #665

Add RPDE tracing to help debug common errors in RPDE implementation #665

Comments

lukehesluke commented Mar 21, 2024 • edited Loading

High Level Solution Design

Starter-for-10 Solution Proposal

lukehesluke commented Mar 21, 2024 •

edited

Loading