-
-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[🧹 CHORE]: Autoscale. First look, bugs, proposals #2086
Comments
Hey @trin4ik 👋🏻 |
What is the point in waiting at all? If no workers are available, I would argue you should create a new one immediately and only use the |
An application may work just fine when all workers are fully loaded and timeout to wait within a few hundred milliseconds. Furthermore, there is no such a single thing called - no workers available. Many threads can wait for a worker and if we allocate them immediately, you may see a huge spike of 100 (or max_workers number) of workers allocated at the same time (maximum for the dynamic allocator). |
"No workers available" is just the condition where no worker is idle. Isn't that the point of dynamically scaling workers? To allocate additional workers when they are all busy? If you set your max_workers to 100, you would expect to be able to handle 100 workers, so what's the problem? This is what FPM does and it has very similar worker scaling if you use I don't know about the internals of how this works, but waiting more than 50-100ms for a worker seems to defeat the purpose of auto-scaling. |
Maybe you could consider "scale-in-delay" as a parameter. So it will wait minimum x ms between starting new workers. This way you get immediate response for the first dynamic worker but avoid the spike you mentioned. |
yes, which is why it's suggested that we split the timeouts. |
Well, the point is that there is no reason to wait at all. If you have saturated your worker pool but your CPU is doing nothing because you're waiting for IO (or whatever - but something typical for web applications), there is no reason to wait. If a delay is necessary for some technical reason, it should be set very low by default. |
yes, its true, delay should be very slow and that's exactly what this ishue is about. zerodelay would lead to an overhead, it seems to me. |
But why is it a problem creating 5 more workers if all 10 are busy? For this example to make sense, you'd have to receive 15 requests in < 50ms. That's a very tight gap I'd say if the traffic pattern won't continue at all (and hence still require a lot of workers). But I understand your point. I think maybe you can get the best of both worlds with two configs: Something like: So every 50ms you will be allowed to create up to 2 more workers, if still necessary. This way you'd would immediately go from 10 to 12 in your example, but not from 12 to 14 before 50ms had passed. Edit: Okay it seems we already have spawn rate. So yea, I guess we agree. I didn't read the docs at all. |
i dont want to limit spawning workers by timeout, i want spawn workers (and the allocation of resources, which can be impressive) only after timeout like 50ms. if after 50ms the workers are also busy, then new wokers should be spawned. its my case. another case, which I also described, raises the issue of the long start time of workers. when it is a good idea to start creating new workers before all the current ones are busy. |
No duplicates 🥲.
What should be improved or cleaned up?
Starting with 2024.3.0 we have autoscale workers 😍
It's a useful feature and, of course, I immediately went to test it out. After a little discussion in Discord (https://discord.com/channels/538114875570913290/1314816983090593803), we came to the conclusion that some things still need to be improved.
1.
allocate_timeout
is redundant at the moment.Before autoscale,
allocate_timeout
was responsible for the startup timeout of the worker. (https://docs.roadrunner.dev/docs/error-codes/allocate-timeout)Now
allocate_timeout
is also used as a debounce when spawning new workers in autoscale. I.e. before theEventNoFreeWorkers
fire the pool waits forallocate_timeout
and only then adds workers.The obvious problem is that these should be different options in the configuration, since the timeout for creating a new worker and the delay between creating new workers in the pool are different values. Default
allocate_timeout
is 60s, for workers startup it might be okay. but not for timeout before allocating new dynamic workers in the pool. its too long. For example, if all workers inworking
status and we have new lightweight request from user, user will waitallocate_timeout
(60 seconds) before pool spawn new workers for users request.It is suggested that
allocate_timeout
be split into two options.allocate_timeout
, exactly what it was before.dynamic_allocator.debounce_timeout
, the waiting time when all the wokers are inworking
status before theEventNoFreeWorkers
event.debounce_timeout
working title, it may be different.Questions for the community:
debounce_timeout
?2. Sometime need to spawn new workers before
EventNoFreeWorkers
If our workers have long-time warmup, like need to open big SQLite, or load AI model, etc, we want to spawn new workers in advance. We're ready for overhead, just as long as it's delay-free for the user.
In this case, we want to control spawn new workers before fired
EventNoFreeWorkers
, for example, when there are less than 2 free workers (statusready
).It is suggested that new options
dynamic_allocator.min_ready_workers
(working title). If we havemin_ready_workers: 2
and in pool we have less than 2 workers inready
status, pool firedEventMinReadyWorkers
and spawn new workers from configuration. Of course, theEventMinReadyWorkers
event should fire with thedebounce_timeout
.Questions for the community:
min_ready_workers
?EventMinReadyWorkers
, or just fireEventNoFreeWorkers
?Bugs:
The text was updated successfully, but these errors were encountered: