[Serve] Not using previously failed replica when retry a failed request #3916

cblmemo · 2024-09-05T20:49:38Z

Previously, the retry on our load balancer is just select a ready replica, which is possible to select the same replica that failed last time, and its status might not be updated to the LB as the sync with controller only happens every 20s. This PR try to avoid those replica that failed once and trying other instead.

Also, update the controller LB sync interval to 10s to keep align with the probing interval (which is also 10s).

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Michaelvll

Thanks for the fix @cblmemo! Looks mostly good to me.

Michaelvll · 2024-09-09T23:36:52Z

sky/serve/load_balancer.py

        while True:
            retry_cnt += 1
            with self._client_pool_lock:
                ready_replica_url = self._load_balancing_policy.select_replica(
-                    request)
+                    request, failed_replica_urls)


Seems if this is the case, we will never retry for a transient network issue. How about we allow retrying on previously failed URLs if retry_cnt has not reached the maximum amount of retries?

If that cause too much overheads for retries, we can probably reduce the interval between retries?

This is only retry logic in one request and it will still be selected by following requests. This PR is mainly for the case when the replica is in the NOT_READY state in the controller but not synced to LB yet; For the transient network, it will convert back to the READY state soon. Besides, if all possible replicas are in failed_replica_urls, then we will still choose from them.

Oh, I am talking about a single request. Should we fail directly for that specific request if there are replicas available but just transient network issue during the time load balancer is sending that request to the replica? I think our original purpose for this retry is to retry on network issues. I am proposing the following to allow retries for the same replica if we have not reach max retry count yet.

if ready_replica_url is None and failed_replica_urls: failed_replica_urls = []

Fixed! PTAL again

Michaelvll

Thanks for the update @cblmemo!

Michaelvll · 2024-09-11T06:42:10Z

sky/serve/load_balancer.py

@@ -160,11 +160,19 @@ async def _proxy_with_retries(
        # SkyServe supports serving on Spot Instances. To avoid preemptions
        # during request handling, we add a retry here.
        retry_cnt = 0
+        failed_replica_urls: List[str] = []


How about using Set?

Good point! Updated. Thanks!

Michaelvll · 2024-09-12T22:59:04Z

sky/serve/load_balancer.py

+        # Here we try to not retry those failed replicas for the case when the
+        # replica is in a NOT_READY state but does not sync-ed to the load
+        # balancer yet. However, we still maintain a per-request failed replica
+        # list instead of the global one to avoid the case for transient
+        # networking issues, and letting new requests to retry them. Since our
+        # LB policy is a global one, when the request rate is high, it is likely
+        # that multiple retries on a single request will use the same replica.
+        # Here we use the failed replica list to keep track of the failures
+        # happened on every request and try to avoid them in the next retry.


Suggested change

# Here we try to not retry those failed replicas for the case when the

# replica is in a NOT_READY state but does not sync-ed to the load

# balancer yet. However, we still maintain a per-request failed replica

# list instead of the global one to avoid the case for transient

# networking issues, and letting new requests to retry them. Since our

# LB policy is a global one, when the request rate is high, it is likely

# that multiple retries on a single request will use the same replica.

# Here we use the failed replica list to keep track of the failures

# happened on every request and try to avoid them in the next retry.

# We keep track of the failed replicas for the current request, because

# we have a global round-robin policy, and if there is a large load,

# all retries for the same request can go to the same replica by chance.

# If the same replica is in `NOT_READY` state but the new state has not

# been synced from the controller, the current request will fail.

#

# We maintain a per-request failed replica list instead of the global one to

# allow multiple requests to still try failed replicas for one request in case

# that replica is failed by transient network issue.

This looks awesome! Thanks

I change
to allow multiple requests to still try xxx for one request
to
to allow multiple requests to still try xxx for one time
since the former is slightly confusing (multiple req for one req).

Michaelvll · 2024-09-13T03:25:51Z

sky/serve/load_balancer.py

+                if ready_replica_url is None and failed_replica_urls:
+                    failed_replica_urls = []
+                    ready_replica_url = (
+                        self._load_balancing_policy.select_replica(
+                            request, failed_replica_urls))


This may increase the index twice for round robin. Should we instead, just check if len(failed_replica_urls) == len(ready_replica) (can add an interface for LBPolicy.num_ready_replica()for latter) before we call the previousselect_replicaand setfailed_replica_urlsto empty before callingselect_replica`?

Yeah that is a good call. Done!

Co-authored-by: Zhanghao Wu <[email protected]>

cblmemo · 2024-10-01T23:01:03Z

@Michaelvll bump for this

cblmemo · 2024-10-10T00:23:54Z

bump for review @Michaelvll

Michaelvll · 2024-10-11T07:12:55Z

sky/serve/load_balancer.py

+                if (len(failed_replica_urls) ==
+                        self._load_balancing_policy.num_ready_replicas()):


Just curious, is it possible that the num_ready_replicas updated when it is retrying in this fail loop, for example, the replicas scaled down. There should be concurrency issue here.

Should we instead check if not (set(self._load_balancing_policy.ready_replicas) - failed_replica_urls):

Michaelvll · 2024-10-11T07:21:07Z

sky/serve/load_balancer.py

        while True:
            retry_cnt += 1
            with self._client_pool_lock:
+                # If all replicas are failed, clear the record and retry them
+                # again as some of them might be transient networking issues.
+                if (len(failed_replica_urls) ==


I am wondering how effective the new failed_repica_urls is compared to the original globally increased index? Can we simulate the case when a replica went down and there is a different load and see the success rate / latency?

Michaelvll

Sorry, did not mean to approve this PR. I feel we need some benchmark to see whether this handling is actually effective

cblmemo added 2 commits September 5, 2024 13:47

fix

847d78e

change to 10

97e944f

Michaelvll reviewed Sep 9, 2024

View reviewed changes

cblmemo added 4 commits September 10, 2024 23:04

upd

56ccf63

todo

39444aa

upd comments

0f4c9ea

comment

f94491c

Michaelvll reviewed Sep 13, 2024

View reviewed changes

cblmemo and others added 3 commits September 12, 2024 22:44

apply suggestions from code review

2898816

Update sky/serve/load_balancer.py

a6980b7

Co-authored-by: Zhanghao Wu <[email protected]>

format

615181c

cblmemo requested a review from Michaelvll September 19, 2024 19:50

Michaelvll approved these changes Oct 11, 2024

View reviewed changes

Michaelvll requested changes Oct 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Not using previously failed replica when retry a failed request #3916

[Serve] Not using previously failed replica when retry a failed request #3916

cblmemo commented Sep 5, 2024 •

edited

Loading

Michaelvll left a comment

Michaelvll Sep 9, 2024

cblmemo Sep 10, 2024

Michaelvll Sep 10, 2024

cblmemo Sep 11, 2024

Michaelvll left a comment

Michaelvll Sep 11, 2024

cblmemo Sep 13, 2024

Michaelvll Sep 12, 2024

cblmemo Sep 13, 2024

cblmemo Sep 13, 2024

Michaelvll Sep 13, 2024

cblmemo Sep 13, 2024

cblmemo commented Oct 1, 2024

cblmemo commented Oct 10, 2024

Michaelvll Oct 11, 2024

Michaelvll Oct 11, 2024

Michaelvll left a comment

		if (len(failed_replica_urls) ==
		self._load_balancing_policy.num_ready_replicas()):

[Serve] Not using previously failed replica when retry a failed request #3916

Are you sure you want to change the base?

[Serve] Not using previously failed replica when retry a failed request #3916

Conversation

cblmemo commented Sep 5, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cblmemo commented Oct 1, 2024

cblmemo commented Oct 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

cblmemo commented Sep 5, 2024 •

edited

Loading