-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Not using previously failed replica when retry a failed request #3916
base: master
Are you sure you want to change the base?
Changes from all commits
847d78e
97e944f
56ccf63
39444aa
0f4c9ea
f94491c
2898816
a6980b7
615181c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,7 @@ | |
import asyncio | ||
import logging | ||
import threading | ||
from typing import Dict, Union | ||
from typing import Dict, Set, Union | ||
|
||
import aiohttp | ||
import fastapi | ||
|
@@ -160,11 +160,26 @@ async def _proxy_with_retries( | |
# SkyServe supports serving on Spot Instances. To avoid preemptions | ||
# during request handling, we add a retry here. | ||
retry_cnt = 0 | ||
# We keep track of the failed replicas for the current request, because | ||
# we have a global round-robin policy, and if there is a large load, | ||
# all retries for the same request can go to the same replica by chance. | ||
# If the same replica is in `NOT_READY` state but the new state has not | ||
# been synced from the controller, the current request will fail. | ||
# | ||
# We maintain a per-request failed replica set instead of the global | ||
# one to allow multiple requests to still try failed replicas for one | ||
# time, in case that replica is failed by transient network issue. | ||
failed_replica_urls: Set[str] = set() | ||
while True: | ||
retry_cnt += 1 | ||
with self._client_pool_lock: | ||
# If all replicas are failed, clear the record and retry them | ||
# again as some of them might be transient networking issues. | ||
if (len(failed_replica_urls) == | ||
self._load_balancing_policy.num_ready_replicas()): | ||
Comment on lines
+178
to
+179
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just curious, is it possible that the Should we instead check |
||
failed_replica_urls.clear() | ||
ready_replica_url = self._load_balancing_policy.select_replica( | ||
request) | ||
request, failed_replica_urls) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems if this is the case, we will never retry for a transient network issue. How about we allow retrying on previously failed URLs if retry_cnt has not reached the maximum amount of retries? If that cause too much overheads for retries, we can probably reduce the interval between retries? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is only retry logic in one request and it will still be selected by following requests. This PR is mainly for the case when the replica is in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I am talking about a single request. Should we fail directly for that specific request if there are replicas available but just transient network issue during the time load balancer is sending that request to the replica? I think our original purpose for this retry is to retry on network issues. I am proposing the following to allow retries for the same replica if we have not reach max retry count yet. if ready_replica_url is None and failed_replica_urls:
failed_replica_urls = [] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed! PTAL again |
||
if ready_replica_url is None: | ||
response_or_exception = fastapi.HTTPException( | ||
# 503 means that the server is currently | ||
|
@@ -184,6 +199,8 @@ async def _proxy_with_retries( | |
# 499 means a client terminates the connection | ||
# before the server is able to respond. | ||
return fastapi.responses.Response(status_code=499) | ||
assert ready_replica_url is not None | ||
failed_replica_urls.add(ready_replica_url) | ||
# TODO(tian): Fail fast for errors like 404 not found. | ||
if retry_cnt == constants.LB_MAX_RETRY: | ||
if isinstance(response_or_exception, fastapi.HTTPException): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering how effective the new
failed_repica_urls
is compared to the original globally increased index? Can we simulate the case when a replica went down and there is a different load and see the success rate / latency?