Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/SK-592 | Refactor gRPC server #490

Merged
merged 8 commits into from
Dec 7, 2023
Merged

Feature/SK-592 | Refactor gRPC server #490

merged 8 commits into from
Dec 7, 2023

Conversation

Wrede
Copy link
Member

@Wrede Wrede commented Nov 24, 2023

There is a lot of redundancy in the server logic of the combiner. Also logging is messy and hard to understand, introduce logger for different verbosity . Some functions/endpoints can be merged to improve readability and reduce complexity.

Update:

gRPC endpoint Report has been removed and replaced with ListActiveClients. The combiner interfaces (interfaces.py) has list_active_clients(queue_name) which will return the number of "online" clients attached to the specific combiner. The Queue (also reffered as channel in the logic) is either MODEL_UPDATE_REQUESTS or MODEL_VALIDATION_REQUESTS.
Obeserve when the controller loadbalancer LeatPacked calls list_active_clients() it defaults to MODEL_UPDATE_REQUESTS, which means it the loadbalancer is based on the number of online "training" clients, and does not consider "validation" clients.

The way client status is managed has changed. The combiner (gRPC server) now also keeps track of the "online"/"offline" status of clients using the state dict. A client is considered online if it's gRPC context (stream) has been active for the last 10 seconds, or if the client has sent a hearbeat within 10 seconds. The status of clients are then pushed to the statestore DB, bulk update (mongodb update_many) using a deamon process, this push occur every 10 seconds. This should be made configurable in the future. It's also possible to use the gRPC endpoint ListActiveClients, this will aslo push the status updates to statestore DB once called.
The decision for this approach is:

  1. For scalability we want to avoid sending to many DB updates, now it's based on bulk updates and only clients that have changed state are updated. Also it should be configurable (the timeout for the deamon or completely disable it and solely use the ListActiveClients endpoint to update the DB)
  2. There should be a way to solely do request based updates, this is what ListActiveClients enable.

@Wrede Wrede added the HOLD label Nov 24, 2023
@Wrede Wrede requested a review from ahellander November 24, 2023 16:08
Copy link
Member

@ahellander ahellander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good imo. I agree that we should refactor the streams, there should only be one connection for update-messages. But why not first finish these improvements to error handling and logging and try to merge those?

Copy link
Member

@ahellander ahellander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, go go go :-)

@Wrede Wrede removed the HOLD label Dec 7, 2023
@Wrede Wrede merged commit 3df403e into master Dec 7, 2023
15 checks passed
@Wrede Wrede deleted the feature/SK-592 branch December 7, 2023 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants