[🧹 CHORE]: Autoscale. First look, bugs, proposals #2086

trin4ik · 2024-12-07T18:55:36Z

No duplicates 🥲.

I have searched for a similar issue.

What should be improved or cleaned up?

Starting with 2024.3.0 we have autoscale workers 😍
It's a useful feature and, of course, I immediately went to test it out. After a little discussion in Discord (https://discord.com/channels/538114875570913290/1314816983090593803), we came to the conclusion that some things still need to be improved.

1. `allocate_timeout` is redundant at the moment.

Before autoscale, allocate_timeout was responsible for the startup timeout of the worker. (https://docs.roadrunner.dev/docs/error-codes/allocate-timeout)
Now allocate_timeout is also used as a debounce when spawning new workers in autoscale. I.e. before the EventNoFreeWorkers fire the pool waits for allocate_timeout and only then adds workers.
The obvious problem is that these should be different options in the configuration, since the timeout for creating a new worker and the delay between creating new workers in the pool are different values. Default allocate_timeout is 60s, for workers startup it might be okay. but not for timeout before allocating new dynamic workers in the pool. its too long. For example, if all workers in working status and we have new lightweight request from user, user will wait allocate_timeout (60 seconds) before pool spawn new workers for users request.

It is suggested that allocate_timeout be split into two options.

allocate_timeout, exactly what it was before.
dynamic_allocator.debounce_timeout, the waiting time when all the wokers are in working status before the EventNoFreeWorkers event. debounce_timeout working title, it may be different.

Questions for the community:

Name of debounce_timeout?
Any suggestions and comments.

2. Sometime need to spawn new workers before `EventNoFreeWorkers`

If our workers have long-time warmup, like need to open big SQLite, or load AI model, etc, we want to spawn new workers in advance. We're ready for overhead, just as long as it's delay-free for the user.
In this case, we want to control spawn new workers before fired EventNoFreeWorkers, for example, when there are less than 2 free workers (status ready).

It is suggested that new options dynamic_allocator.min_ready_workers (working title). If we have min_ready_workers: 2 and in pool we have less than 2 workers in ready status, pool fired EventMinReadyWorkers and spawn new workers from configuration. Of course, the EventMinReadyWorkers event should fire with the debounce_timeout.

Questions for the community:

Name of min_ready_workers?
Need new event like EventMinReadyWorkers, or just fire EventNoFreeWorkers?
Any suggestions and comments.

Bugs:

[🐛 BUG]: Downscale workers didn't work correctly #2092

The text was updated successfully, but these errors were encountered:

rustatian · 2024-12-07T19:10:24Z

Hey @trin4ik 👋🏻
Thank you for the valuable feedback 👍🏻 I agree about separating allocate_timeout into 2 options, good old allocate_timeout + allocate specific option for the dynamic pool timeout.
2. MinReadyWorkers is also interesting idea, I need to double check that these new options won't involve on performance, because, you know, why to use RR if it'd be slower than FPM 😃

nickdnk · 2024-12-10T16:09:49Z

What is the point in waiting at all? If no workers are available, I would argue you should create a new one immediately and only use the allocate_timeout logic once the maximum number of dynamic workers is reached.

rustatian · 2024-12-10T16:16:18Z

An application may work just fine when all workers are fully loaded and timeout to wait within a few hundred milliseconds. Furthermore, there is no such a single thing called - no workers available. Many threads can wait for a worker and if we allocate them immediately, you may see a huge spike of 100 (or max_workers number) of workers allocated at the same time (maximum for the dynamic allocator).

nickdnk · 2024-12-10T16:23:38Z

"No workers available" is just the condition where no worker is idle. Isn't that the point of dynamically scaling workers? To allocate additional workers when they are all busy?

If you set your max_workers to 100, you would expect to be able to handle 100 workers, so what's the problem? This is what FPM does and it has very similar worker scaling if you use pm = dynamic mode (https://www.php.net/manual/en/install.fpm.configuration.php). Maybe there is a small delay with FPM for performance reasons, but it's definitely not measured in seconds.

I don't know about the internals of how this works, but waiting more than 50-100ms for a worker seems to defeat the purpose of auto-scaling.

nickdnk · 2024-12-10T16:25:40Z

Maybe you could consider "scale-in-delay" as a parameter. So it will wait minimum x ms between starting new workers. This way you get immediate response for the first dynamic worker but avoid the spike you mentioned.

trin4ik · 2024-12-10T17:15:04Z

I don't know about the internals of how this works, but waiting more than 50-100ms for a worker seems to defeat the purpose of auto-scaling.

yes, which is why it's suggested that we split the timeouts. allocate_timeout is too large for debouncing the start of new workers. but I don't see the point of removing the debounce completely.

nickdnk · 2024-12-10T17:19:51Z

I don't know about the internals of how this works, but waiting more than 50-100ms for a worker seems to defeat the purpose of auto-scaling.

yes, which is why it's suggested that we split the timeouts. allocate_timeout is too large for debouncing the start of new workers. but I don't see the point of removing the debounce completely.

Well, the point is that there is no reason to wait at all. If you have saturated your worker pool but your CPU is doing nothing because you're waiting for IO (or whatever - but something typical for web applications), there is no reason to wait. If a delay is necessary for some technical reason, it should be set very low by default.

trin4ik · 2024-12-10T17:29:04Z

Well, the point is that there is no reason to wait at all. If you have saturated your worker pool but your CPU is doing nothing because you're waiting for IO (or whatever - but something typical for web applications), there is no reason to wait. If a delay is necessary for some technical reason, it should be set very low by default.

yes, its true, delay should be very slow and that's exactly what this ishue is about.

zerodelay would lead to an overhead, it seems to me.
example: hiload service with many requests, default workers pool 10, spawn_rate 5. every time all workers will be busy for at least 1ms, rr will create +5 workers. expecting at least 50ms is a good solution, at least in my production I would like to see it. the other thing is that the allocate_timeout chosen is certainly not suitable for this.

nickdnk · 2024-12-10T17:35:21Z

But why is it a problem creating 5 more workers if all 10 are busy? For this example to make sense, you'd have to receive 15 requests in < 50ms. That's a very tight gap I'd say if the traffic pattern won't continue at all (and hence still require a lot of workers). But I understand your point. I think maybe you can get the best of both worlds with two configs:

Something like:
worker_max_spawn_rate: 2
worker_spawn_delay: 50ms

So every 50ms you will be allowed to create up to 2 more workers, if still necessary. This way you'd would immediately go from 10 to 12 in your example, but not from 12 to 14 before 50ms had passed.

Edit: Okay it seems we already have spawn rate. So yea, I guess we agree. I didn't read the docs at all.

trin4ik · 2024-12-10T17:45:05Z

i dont want to limit spawning workers by timeout, i want spawn workers (and the allocation of resources, which can be impressive) only after timeout like 50ms. if after 50ms the workers are also busy, then new wokers should be spawned.

its my case. another case, which I also described, raises the issue of the long start time of workers. when it is a good idea to start creating new workers before all the current ones are busy.

trin4ik added the C-enhancement Category: enhancement. Meaning improvements of current module, transport, etc.. label Dec 7, 2024

trin4ik assigned rustatian Dec 7, 2024

rustatian added this to Jira 😄 Dec 7, 2024

rustatian added the S-RFC Request: Request for comments label Dec 7, 2024

rustatian moved this to 🔖 Ready in Jira 😄 Dec 7, 2024

rustatian added this to the v2025.1.0 milestone Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🧹 CHORE]: Autoscale. First look, bugs, proposals #2086

[🧹 CHORE]: Autoscale. First look, bugs, proposals #2086

trin4ik commented Dec 7, 2024 •

edited by rustatian

Loading

rustatian commented Dec 7, 2024

nickdnk commented Dec 10, 2024

rustatian commented Dec 10, 2024

nickdnk commented Dec 10, 2024 •

edited

Loading

nickdnk commented Dec 10, 2024 •

edited

Loading

trin4ik commented Dec 10, 2024

nickdnk commented Dec 10, 2024

trin4ik commented Dec 10, 2024

nickdnk commented Dec 10, 2024 •

edited

Loading

trin4ik commented Dec 10, 2024

[🧹 CHORE]: Autoscale. First look, bugs, proposals #2086

[🧹 CHORE]: Autoscale. First look, bugs, proposals #2086

Comments

trin4ik commented Dec 7, 2024 • edited by rustatian Loading

No duplicates 🥲.

What should be improved or cleaned up?

1. allocate_timeout is redundant at the moment.

2. Sometime need to spawn new workers before EventNoFreeWorkers

rustatian commented Dec 7, 2024

nickdnk commented Dec 10, 2024

rustatian commented Dec 10, 2024

nickdnk commented Dec 10, 2024 • edited Loading

nickdnk commented Dec 10, 2024 • edited Loading

trin4ik commented Dec 10, 2024

nickdnk commented Dec 10, 2024

trin4ik commented Dec 10, 2024

nickdnk commented Dec 10, 2024 • edited Loading

trin4ik commented Dec 10, 2024

trin4ik commented Dec 7, 2024 •

edited by rustatian

Loading

1. `allocate_timeout` is redundant at the moment.

2. Sometime need to spawn new workers before `EventNoFreeWorkers`

nickdnk commented Dec 10, 2024 •

edited

Loading

nickdnk commented Dec 10, 2024 •

edited

Loading

nickdnk commented Dec 10, 2024 •

edited

Loading