Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deployment: extend "at scale" page #126

Open
VMois opened this issue May 20, 2022 · 0 comments
Open

deployment: extend "at scale" page #126

VMois opened this issue May 20, 2022 · 0 comments
Assignees

Comments

@VMois
Copy link

VMois commented May 20, 2022

Take notes below by @tiborsimko and use them to extend our "Deployment at scale" page.

or a test deployment you can start very small; even a single-node Kubernetes
deployment on one machine of the type m2.xlarge (8 CPU, 16 GB RAM) should be
more than enough.

However, if you are already thinking about creating a small multi-user
multi-node cluster that could easily grow with time, then I'd recommend to
start with at least 5 such nodes, using the following responsibilities:

- 1 node labelled reana.io/system=infrastructure that will run the REANA
  frontend and backend infrastructure services;

- 1 node labelled reana.io/system=infrastructuredb that will run the PostgreSQL
  DB service (unless you have some already-existing DB service running outside
  of the cluster that could be reused without hosting DB yourself);

- 1 node labelled reana.io/system=infrastructuremq that will run the Rabbit MQ
  service;

- 1 node labelled reana.io/system=runtimebatch that will run the user runtime
  batch workflow orchestration pods (CWL/Serial/Snakemake/Yadage);

- 1 node labelled reana.io/system=runtimejobs that will run the user runtime
  job pods (generated by those workflow orchestration pods).

With such a setup, you can keep 3 infrastructure nodes and scale the 2 runtime
nodes (1 batch, 1 jobs) to e.g. 50 runtime nodes (10 batch, 40 jobs) as your
needs grow:

For example, 1 runtime batch node can comfortably run 8 concurrent user
workflows at the full speed (since 1 node has 8 cores). So, if you need 80
users running at full speed, then 10 such runtime batch nodes may be wanted.

For example, if the nature of your physics workflows is usually such that 1
workflow typically generates 4 parallel n-tupling jobs, then you may want to
add 4x more runtime job nodes for one runtime batch node in the system, so that
everything can run optimally at sustainable full speed. (Provided memory is
enough; if not, then higher RAM machine types may be necessary.)

We have tried the above three-infrastructure-node setup for clusters of O(1k)
core size and everything was scaling very nicely.

If you'd like to aim even higher, say 5k cores, then using 8 CPU 16 GB nodes is
not optimal. We have made some tests and saw slowdowns and huge loads on the
Kubernetes master node in our scalability tests. Using larger machine type
flavours (32 CPU or more) would be preferable here. But I guess these types of
considerations can wait for now.
@tiborsimko tiborsimko self-assigned this Mar 13, 2023
@mdonadoni mdonadoni added this to 0.95.0 Aug 8, 2024
@mdonadoni mdonadoni moved this to In work in 0.95.0 Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In work
Development

No branches or pull requests

2 participants