Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processes stack up on concurrent restarts #45

Open
joshribakoff opened this issue Jan 13, 2018 · 5 comments
Open

Processes stack up on concurrent restarts #45

joshribakoff opened this issue Jan 13, 2018 · 5 comments

Comments

@joshribakoff
Copy link

joshribakoff commented Jan 13, 2018

send SIGUSR2 & wait, my 2 processes come back up as 2 process every time. Using siege, availability stays at 100%

Send SIGUSR2 twice rapidly, my 2 processes turn to 5 processes and I get error in console Error [ERR_IPC_DISCONNECTED]: IPC channel is already disconnected

Using siege, it shows availability drops from 100% to around 90%. Eventually number of running processes decrease back down to 2.

I would have expected it to kill a worker before spawning a replacement, instead it appears it spawns replacements before killing the worker being replaced, this means total processes sometimes exceed number of CPUs (undesired). Also 90% availability instead of 100% when issuing restarts in quick succession is an issue. A quick fix could be to toggle a flag & block restarts if theres already one pending.

@spion
Copy link
Contributor

spion commented Jan 13, 2018

Replacements must be spawned before killing the workers, otherwise there will be a zero availability period between killing the workers and the replacements becoming ready.

edit: additionally, old workers will continue to run without receiving new requests as long as they are still servicing long running old requests, or until the kill timeout expires.

Blocking restarts while one is in progress could be a good optional feature.

@joshribakoff
Copy link
Author

joshribakoff commented Jan 13, 2018 via email

@spion
Copy link
Contributor

spion commented Jan 14, 2018

I tested and it can for sure be 100% available using either strategy given you have 2 or more workers.

You mean using rolling restarts? That could be an option too.

@joshribakoff
Copy link
Author

Correct, the primary goal of this issue would be blocking concurrent restarts so it remains 100% available, without that, I don't feel like its really production ready (what if a restart takes longer than usual and someone accidentally deploys again before its done restarting?) According to siege, you'll drop like 10-20% of traffic to the nodeJS backend which is really risk.

A secondary goal would be to support rolling restarts, where you always have N-1 works online, and no more than N workers online. There could be a setting to preserve the current strategy of spawning N replacement workers, where you now have between N & N*2 workers online.

My performance testing shows that having rolling restarts (>= N-1 workers and <= N workers at all times) performs better than sometimes having up to N2 workers. with N-1 workers, or rolling restarts, the other workers just pick up the slack. With N2 workers, they start competing for CPU & other system resources since you now have more workers spawned than you have CPU cores, not to mention some apps are memory intensive.

So the primary goal would be to be able to have 100% uptime, despite any edge cases. The secondary goal would be a "rolling restart" strategy that doesn't impact performance as much while a restart is occurring.

@joshribakoff
Copy link
Author

This workaround seems to solve the problem.

let reloading = false
process.on('SIGUSR2', function() {
    if(reloading) {
        console.log('Got SIGUSR2, already reloading so ignoring it...');
        return;
    }

    reloading = true;
    console.log('Got SIGUSR2, reloading cluster...');
    
    cluster.reload(() => reloading = false);
});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants