-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processes stack up on concurrent restarts #45
Comments
Replacements must be spawned before killing the workers, otherwise there will be a zero availability period between killing the workers and the replacements becoming ready. edit: additionally, old workers will continue to run without receiving new requests as long as they are still servicing long running old requests, or until the kill timeout expires. Blocking restarts while one is in progress could be a good optional feature. |
I believe if you kill first and then spawn there will always be at least
N-1 workers which may be desirable over having sometimes up to N*2 workers.
I tested and it can for sure be 100% available using either strategy given
you have 2 or more workers. Either way it's not 100% available now due to
the unaccounted for edge case.
On Sat, Jan 13, 2018 at 3:43 AM Gorgi Kosev ***@***.***> wrote:
Replacements must be spawned before killing the workers, otherwise there
will be a zero availability period between killing the workers and the
replacements becoming ready.
Blocking restarts while one is in progress could be a good optional
feature.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD1cOvyUVycvRLlVDeXRCFvzt3CtFTkSks5tKJbKgaJpZM4RdKoF>
.
--
Sent from Gmail Mobile
|
You mean using rolling restarts? That could be an option too. |
Correct, the primary goal of this issue would be blocking concurrent restarts so it remains 100% available, without that, I don't feel like its really production ready (what if a restart takes longer than usual and someone accidentally deploys again before its done restarting?) According to siege, you'll drop like 10-20% of traffic to the nodeJS backend which is really risk. A secondary goal would be to support rolling restarts, where you always have N-1 works online, and no more than N workers online. There could be a setting to preserve the current strategy of spawning N replacement workers, where you now have between N & N*2 workers online. My performance testing shows that having rolling restarts (>= N-1 workers and <= N workers at all times) performs better than sometimes having up to N2 workers. with N-1 workers, or rolling restarts, the other workers just pick up the slack. With N2 workers, they start competing for CPU & other system resources since you now have more workers spawned than you have CPU cores, not to mention some apps are memory intensive. So the primary goal would be to be able to have 100% uptime, despite any edge cases. The secondary goal would be a "rolling restart" strategy that doesn't impact performance as much while a restart is occurring. |
This workaround seems to solve the problem. let reloading = false
process.on('SIGUSR2', function() {
if(reloading) {
console.log('Got SIGUSR2, already reloading so ignoring it...');
return;
}
reloading = true;
console.log('Got SIGUSR2, reloading cluster...');
cluster.reload(() => reloading = false);
}); |
send SIGUSR2 & wait, my 2 processes come back up as 2 process every time. Using siege, availability stays at 100%
Send SIGUSR2 twice rapidly, my 2 processes turn to 5 processes and I get error in console
Error [ERR_IPC_DISCONNECTED]: IPC channel is already disconnected
Using siege, it shows availability drops from 100% to around 90%. Eventually number of running processes decrease back down to 2.
I would have expected it to kill a worker before spawning a replacement, instead it appears it spawns replacements before killing the worker being replaced, this means total processes sometimes exceed number of CPUs (undesired). Also 90% availability instead of 100% when issuing restarts in quick succession is an issue. A quick fix could be to toggle a flag & block restarts if theres already one pending.
The text was updated successfully, but these errors were encountered: