Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve subquery timing #721

Closed
tokebe opened this issue Sep 6, 2023 · 5 comments
Closed

Improve subquery timing #721

tokebe opened this issue Sep 6, 2023 · 5 comments
Assignees
Labels
enhancement New feature or request On Test Related changes are deployed to Test server

Comments

@tokebe
Copy link
Member

tokebe commented Sep 6, 2023

Currently, BTE queues subqueries by organizing them into buckets. Each bucket may only carry 3 queries per API, with no limit on the number of APIs per bucket. Buckets are then executed sequentially, with each subsequent bucket waiting for the previous to complete all contained queries.

This setup has advantages -- namely, it prevents BTE from making too many queries to the same API at once. However, it also limits BTE's subquery throughput substantially. In most cases, a bucket will contain only 3 queries to the same API, and then only 3 queries are made at a time, waiting for the longest of each, until all buckets are complete. Additionally, there's no theoretical limit to the number of outbound queries, provided they're all to different APIs (though, in practice, this isn't an issue).

We should instead use a different model that allows for higher throughput. Some APIs provide rate limits, which BTE already attempts to respect (albeit, naively). These should be better taken into account, and for those which are unknown, BTE should better handle 429 responses to automatically determine a limit -- one which could be saved both locally to some global (for local, non-redis installations) and in redis (for live instances). Additionally, BTE should use a number of 'slots' and start new subqueries as slots open up, rather than as the slowest of a bucket completes.

Further consideration for the model to implement can be determined in further discussion after some tinkering has been done.

@tokebe tokebe added the enhancement New feature or request label Sep 6, 2023
@tokebe tokebe self-assigned this Sep 6, 2023
@colleenXu
Copy link
Collaborator

I'm wondering if scrolling BioThings queries need special consideration here (I don't know how they relate to "buckets". I'm also not sure that they're included in the logs of how many sub-queries are executed / successful).

@tokebe
Copy link
Member Author

tokebe commented Sep 8, 2023

Note: we also want to add a (probably configurable) retry system, so if a query errors out for most non-400, non-timeout reasons, it should be recycled back into the queue (probabably only once or twice).

@tokebe tokebe added the On Dev Related changes are deployed to Dev server label Nov 28, 2023
@tokebe tokebe added On CI Related changes are deployed to CI server and removed On Dev Related changes are deployed to Dev server labels Dec 7, 2023
@colleenXu
Copy link
Collaborator

@tokebe It looks like the related PRs aren't linked. Is that okay? (For example: maybe this one?)

@tokebe
Copy link
Member Author

tokebe commented Dec 7, 2023

@tokebe tokebe added On Test Related changes are deployed to Test server and removed On CI Related changes are deployed to CI server labels Dec 20, 2023
@tokebe
Copy link
Member Author

tokebe commented Feb 21, 2024

Relevant changes deployed to Prod.

@tokebe tokebe closed this as completed Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request On Test Related changes are deployed to Test server
Projects
None yet
Development

No branches or pull requests

2 participants